mirror of
https://github.com/boostorg/tokenizer.git
synced 2026-01-26 07:02:14 +00:00
257 lines
7.6 KiB
HTML
257 lines
7.6 KiB
HTML
<html>
|
|
|
|
<head>
|
|
<meta http-equiv="Content-Type"
|
|
content="text/html; charset=iso-8859-1">
|
|
<meta name="GENERATOR" content="Microsoft FrontPage Express 2.0">
|
|
<title>Boost Char Separator</title>
|
|
<!--
|
|
-- Copyright © Jeremy Siek and John Bandela 2001-2002
|
|
--
|
|
-- Permission to use, copy, modify, distribute and sell this software
|
|
-- and its documentation for any purpose is hereby granted without fee,
|
|
-- provided that the above copyright notice appears in all copies and
|
|
-- that both that copyright notice and this permission notice appear
|
|
-- in supporting documentation. Jeremy Siek makes no
|
|
-- representations about the suitability of this software for any
|
|
-- purpose. It is provided "as is" without express or implied warranty.
|
|
-->
|
|
</head>
|
|
|
|
<body bgcolor="#FFFFFF" text="#000000" link="#0000EE"
|
|
vlink="#551A8B" alink="#FF0000">
|
|
|
|
<p><img src="../../c++boost.gif" alt="C++ Boost" width="277"
|
|
height="86"> <br>
|
|
</p>
|
|
|
|
<h1>
|
|
char_separator<Char, Traits>
|
|
</h1>
|
|
|
|
<p>
|
|
The <tt>char_separator</tt> class breaks a sequence of characters into
|
|
tokens based on character delimiters much in the same way that
|
|
<tt>strtok()</tt> does (but without all the evils of non-reentrancy
|
|
and destruction of the input sequence).
|
|
</p>
|
|
|
|
<p>
|
|
The <tt>char_separator</tt> class is used in conjunction with the <a
|
|
href="token_iterator.htm"><tt>token_iterator</tt></a> or <a
|
|
href="tokenizer.htm"><tt>tokenizer</tt></a> to perform tokenizing.
|
|
</p>
|
|
|
|
<h2>Definitions</h2>
|
|
|
|
<p>
|
|
The <tt>strtok()</tt> function does not include matches with the
|
|
character delimiters in the output sequence of tokens. However,
|
|
sometimes it is useful to have the delimiters show up in the output
|
|
sequence, therefore <tt>char_separator</tt> provides this as an
|
|
option. We refer to delimiters that show up as output tokens as
|
|
<b><i>kept delimiters</i></b> and delimiters that do now show up as
|
|
output tokens as <b><i>dropped delimiters</i></b>.
|
|
</p>
|
|
|
|
<p>
|
|
When two delimiters appear next to each other in the input sequence,
|
|
there is the question of whether to output an <b><i>empty
|
|
token</i></b> or to skip ahead. The behaviour of <tt>strtok()</tt> is
|
|
to skip ahead. The <tt>char_separator</tt> class provides both
|
|
options.
|
|
</p>
|
|
|
|
|
|
<h2>Examples</h2>
|
|
|
|
<p>
|
|
This first examples shows how to use <tt>char_separator</tt> as a
|
|
replacement for the <tt>strtok()</tt> function. We've specified three
|
|
character delimiters, and they will not show up as output tokens. We
|
|
have not specified any kept delimiters, and by default any empty
|
|
tokens will be ignored.
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>// char_sep_example_1.cpp
|
|
#include <iostream>
|
|
#include <boost/tokenizer.hpp>
|
|
#include <string>
|
|
|
|
int main()
|
|
{
|
|
std::string str = ";;Hello|world||-foo--bar;yow;baz|";
|
|
typedef boost::tokenizer<boost::char_separator<char> >
|
|
tokenizer;
|
|
boost::char_separator<char> sep("-;|");
|
|
tokenizer tokens(str, sep);
|
|
for (tokenizer::iterator tok_iter = tokens.begin();
|
|
tok_iter != tokens.end(); ++tok_iter)
|
|
std::cout << "<" << *tok_iter << "> ";
|
|
std::cout << "\n";
|
|
return EXIT_SUCCESS;
|
|
}
|
|
</pre>
|
|
</blockquote>
|
|
The output is:
|
|
<blockquote>
|
|
<pre>
|
|
<Hello> <world> <foo> <bar> <yow> <baz>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
|
|
<p>
|
|
The next example shows tokenizing with two dropped delimiters '-' and
|
|
';' and a single kept delimiter '|'. We also specify that empty tokens
|
|
should show up in the output when two delimiters are next to each
|
|
other.
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>// char_sep_example_2.cpp
|
|
#include <iostream>
|
|
#include <boost/tokenizer.hpp>
|
|
#include <string>
|
|
|
|
int main()
|
|
{
|
|
std::string str = ";;Hello|world||-foo--bar;yow;baz|";
|
|
typedef boost::tokenizer<boost::char_separator<char> >
|
|
tokenizer;
|
|
boost::char_separator<char> sep("-;", "|", boost::keep_empty_tokens);
|
|
tokenizer tokens(str, sep);
|
|
for (tokenizer::iterator tok_iter = tokens.begin();
|
|
tok_iter != tokens.end(); ++tok_iter)
|
|
std::cout << "<" << *tok_iter << "> ";
|
|
std::cout << "\n";
|
|
return EXIT_SUCCESS;
|
|
}
|
|
</pre>
|
|
</blockquote>
|
|
The output is:
|
|
<blockquote>
|
|
<pre>
|
|
<> <> <Hello> <|> <world> <|> <> <|> <> <foo> <> <bar> <yow> <baz> <|> <>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The final example shows tokenizing on punctuation and whitespace
|
|
characters using the default constructor of the
|
|
<tt>char_separator</tt>.
|
|
</p>
|
|
|
|
<blockquote>
|
|
<pre>// char_sep_example_3.cpp
|
|
#include <iostream>
|
|
#include <boost/tokenizer.hpp>
|
|
#include <string>
|
|
|
|
int main()
|
|
{
|
|
std::string str = "This is, a test";
|
|
typedef boost::tokenizer<boost::char_separator<char> > Tok;
|
|
boost::char_separator<char> sep; // default constructed
|
|
Tok tok(str, sep);
|
|
for(Tok::iterator tok_iter = tok.begin(); tok_iter != tok.end(); ++tok_iter)
|
|
std::cout << "<" << *tok_iter << "> ";
|
|
std::cout << "\n";
|
|
return EXIT_SUCCESS;
|
|
}
|
|
</pre>
|
|
</blockquote>
|
|
The output is:
|
|
<blockquote>
|
|
<pre>
|
|
<This> <is> <,> <a> <test>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<h2>Template parameters</h2>
|
|
|
|
<P>
|
|
<table border>
|
|
<TR>
|
|
<th>Parameter</th><th>Description</th><th>Default</th>
|
|
</tr>
|
|
|
|
<TR><TD><TT>Char</TT></TD>
|
|
<TD>The type of elements within a token, typically <tt>char</tt>.</TD>
|
|
<TD> </TD>
|
|
</TR>
|
|
|
|
<TR><TD><TT>Traits</TT></TD>
|
|
<TD>The <tt>char_traits</tt> for the character type.</TD>
|
|
<TD><tt>char_traits<char></tt></TD>
|
|
</TR>
|
|
|
|
</table>
|
|
|
|
<h2>Model of</h2>
|
|
|
|
<a href="tokenizerfunction.htm">Tokenizer Function</a>
|
|
|
|
|
|
<h2>Members</h2>
|
|
|
|
<hr>
|
|
<pre>
|
|
explicit char_separator(const Char* dropped_delims,
|
|
const Char* kept_delims = "",
|
|
empty_token_policy empty_tokens = drop_empty_tokens)
|
|
</pre>
|
|
|
|
<p>
|
|
This creates a <tt>char_separator</tt> object, which can then be used
|
|
to create a <a href="token_iterator.htm"><tt>token_iterator</tt></a>
|
|
or <a href="tokenizer.htm"><tt>tokenizer</tt></a> to perform
|
|
tokenizing. The <tt>dropped_delims</tt> and <tt>kept_delims</tt> are
|
|
strings of characters where each character is used as delimiter during
|
|
tokenizing. Whenever a delimiter is seen in the input sequence, the
|
|
current token is finished, and a new token begins.
|
|
|
|
The delimiters in <tt>dropped_delims</tt> do not show up as tokens in
|
|
the output whereas the delimiters in <tt>kept_delims</tt> do show up
|
|
as tokens. If <tt>empty_tokens</tt> is <tt>drop_empty_tokens</tt>,
|
|
then empty tokens will not show up in the output. If
|
|
<tt>empty_tokens</tt> is <tt>keep_empty_tokens</tt> then empty tokens
|
|
will show up in the output.
|
|
</p>
|
|
|
|
<hr>
|
|
|
|
<pre>
|
|
explicit char_separator()
|
|
</pre>
|
|
<p>
|
|
The function <tt>std::isspace()</tt> is used to identify dropped
|
|
delimiters and <tt>std::ispunct()</tt> is used to identify kept
|
|
delimiters. In addition, empty tokens are dropped.
|
|
</p>
|
|
|
|
<hr>
|
|
|
|
<pre>
|
|
template <typename InputIterator, typename Token>
|
|
bool operator()(InputIterator& next, InputIterator end, Token& tok)
|
|
</pre>
|
|
|
|
<p>
|
|
This function is called by the <a
|
|
href="token_iterator.htm"><tt>token_iterator</tt></a> to perform
|
|
tokenizing. The user typically does not call this function directly.
|
|
</p>
|
|
|
|
|
|
<hr>
|
|
|
|
<p>© Copyright Jeremy Siek and John R. Bandela 2001-2002. Permission
|
|
to copy, use, modify, sell and distribute this document is granted
|
|
provided this copyright notice appears in all copies. This document is
|
|
provided "as is" without express or implied warranty, and
|
|
with no claim as to its suitability for any purpose.</p>
|
|
</body>
|
|
</html>
|