mirror of
https://github.com/boostorg/parser.git
synced 2026-01-26 18:52:23 +00:00
1781 lines
76 KiB
Plaintext
1781 lines
76 KiB
Plaintext
[/
|
|
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
/]
|
|
|
|
[section Tutorial]
|
|
|
|
[section Terminology]
|
|
|
|
First, let's cover some terminology that we'll be using throughout the docs:
|
|
|
|
A /semantic action/ is an arbitrary bit of logic associated with a parser,
|
|
that is only executed when the parser succeeds.
|
|
|
|
Simpler parsers can be combined to form more complex parsers. Given some
|
|
combining operation `C`, and parsers `P0`, `P1`, ... `PN`, `C(P0, P1, ... PN)`
|
|
creates a new parser `Q`. This creates a /parse tree/. `Q` is the parent of
|
|
`P1`, `P2` is the child of `Q`, etc. The parsers are applied in the top-down
|
|
fashion implied by this. When you use `Q` to parse a string, it will use
|
|
`P0`, `P1`, etc. to do the actual work. If `P3` is being used to parse the
|
|
input, that means that `Q` is as well, since the way `Q` parses is by
|
|
dispatching to its children to do some or all of the work. At any point in
|
|
the parse, there will be exactly one parser without children that is being
|
|
used to parse the input; all other parsers being used are its ancestors in the
|
|
parse tree.
|
|
|
|
A /subparser/ is a parser that is the child of another parser.
|
|
|
|
The /top-level parser/ is the root of the tree of parsers.
|
|
|
|
The /current parser/ or /innermost parser/ is the parser with no children that
|
|
is currently being used to parse the input.
|
|
|
|
A /rule/ is a kind of parser that makes building large, complex parsers
|
|
easier. A /subrule/ is a rule that is the child of some other rule. The
|
|
/current rule/ or /innermost rule/ is the one rule currently being used to
|
|
parse the input that has no subrules. Note that while there is always exactly
|
|
one current parser, there may or may not be a current rule _emdash_ rules are
|
|
one kind of parser, and you may or may not be using them in your top-level
|
|
parser.
|
|
|
|
The /top-level parse/ is the parse operation being performed by the top-level
|
|
parser. This term is necessary, because though most parse failures are local
|
|
to a particular parser, some parse failures cause the call to _p_ to indicate
|
|
failure of the entire parse. For these cases, we say that such a local
|
|
failure "causes the top-level parse to fail".
|
|
|
|
There are a couple of special kinds of parsers that come up often in this
|
|
documentation.
|
|
|
|
One is a /sequence parser/; you will see it created using `operator>>()`, as
|
|
in `p1 >> p2 >> p3`. A sequence parser tries to match all of its subparsers
|
|
to the input, one at a time, in order. It matches the input iff all its
|
|
subparsers do.
|
|
|
|
The other is an /alternative parser/; you will see it created using
|
|
`operator|()`, as in `p1 | p2 | p3`. A alternative parser tries to match all
|
|
of its subparsers to the input, one at a time, in order; it stops after
|
|
matching at most one subparser. It matches the input iff one of its
|
|
subparsers does.
|
|
|
|
_Parser_ parsers each have an attribute associated with them, or explicitly
|
|
have no attribute. An attribute is a value that the parser generates when it
|
|
matches the input. For instance, the parser _d_ generates a `double` when it
|
|
matches the input. Since it is not possible to write a type trait that
|
|
returns the attribute type of a parser, we need notation for concisely
|
|
communicating that relationship. _ATTR_ is a notional macro that expands to
|
|
the attribute type of the parser passed to it; `_ATTR_np_(_d_)` is `double`.
|
|
|
|
Next, we'll look at some simple programs that parse using _Parser_. We'll
|
|
start small and build up from there.
|
|
|
|
[endsect]
|
|
|
|
[section Hello, Whomever]
|
|
|
|
This is just about the most minimal example of using _Parser_ that one could
|
|
write. We take a string from the command line, or `"World"` if none is given,
|
|
and then we parse it:
|
|
|
|
[hello_example]
|
|
|
|
The expression `*bp::char_` is a parser-expression. It uses one of the many
|
|
parsers that _Parser_ provides, _ch_. Like all _Parser_ parsers, it has
|
|
certain operations defined on it. In this case, `*bp::char_` is using an
|
|
overloaded `operator*()` as the C++ version of a _kl_ operator. Since C++ has
|
|
no postfix unary `*` operator, we have to use the one we have, so it is used
|
|
as a prefix.
|
|
|
|
So, `*bp::char_` means "any number of characters". In other words, it really
|
|
cannot fail. Even an empty string will match it.
|
|
|
|
The parse operation is performed by calling the _p_ function, passing the
|
|
parser as one of the arguments:
|
|
|
|
bp::parse(input, *bp::char_, result);
|
|
|
|
The arguments here are: `input`, the string to parse; `*bp::char_`, the parser
|
|
used to do the parse; and `result`, and out-parameter into which to put the
|
|
result of the parse. Don't get too caught up on this method of getting the
|
|
parse result out of _p_; there are multiple ways of doing so, and we'll cover
|
|
all of them in subsequent examples.
|
|
|
|
Also, just ignore for now the fact that _Parser_ somehow figured out that the
|
|
result type of the `*bp::char_` parser is a `std::string`. There are clear
|
|
rules for this that we'll cover later.
|
|
|
|
The effects of this call to _p_ is not very interesting _emdash_ since the
|
|
parser we gave it cannot ever fail, and because we're placing the output in
|
|
the same type as the input, it just copies the contents of `input` to
|
|
`result`.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example]
|
|
|
|
Let's look at a slightly more complicated example, even if it is still
|
|
trivial. Instead of taking any old `char`s we're given, let's require some
|
|
structure. Let's parse one or more `double`s, separated by commas.
|
|
|
|
The _Parser_ parser for `double` is _d_. So, to parse a single `double`, we'd
|
|
use _d_. If we wanted to parse two `double`s in a row, we'd use:
|
|
|
|
boost::parser::double_ >> boost::parser::double_
|
|
|
|
`operator>>()` in this expression is the sequence-operator; read is as
|
|
"followed by". If we combine the sequence-operator with _kl_, we can get the
|
|
parser we want by writing:
|
|
|
|
boost::parser::double_ >> *(',' >> boost::parser::double_)
|
|
|
|
This is a parser that matches at least one `double` _emdash_ because of the
|
|
first _d_ in the expression above _emdash_ followed by zero or more instances
|
|
of a-comma-followed-by-a-`double`. Notice that we can use `','` directly.
|
|
Though it is not a parser, `operator>>()` and the other operators defined on
|
|
_Parser_ parsers have overloads that accept character/parser pairs of
|
|
arguments; these operator overloads will create the right parser to recognize
|
|
`','`.
|
|
|
|
[trivial_example]
|
|
|
|
The first example filled in an out-parameter to deliver the result of the
|
|
parse. This call to _p_ returns a result instead. As you can see, the result
|
|
is contextually convertible to `bool`, and `*result` is some sort of range.
|
|
In fact, the return type of this call to _p_ is
|
|
`std::optional<std::vector<double>>`. Naturally, if the parse fails,
|
|
`std::nullopt` is returned. We'll look at how _Parser_ maps the type of the
|
|
parser to the return type, or the filled in out-parameter's type, a bit later.
|
|
|
|
If I run it in a shell, this is the result:
|
|
|
|
[pre
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6,8.9
|
|
Great! It looks like you entered:
|
|
5.6
|
|
8.9
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6, 8.9
|
|
Good job! Please proceed to the recovery annex for cake.
|
|
]
|
|
|
|
It does not recognize `"5.6, 8.9"`. This is because it expects a comma
|
|
followed /immediately/ by a `double`, but I inserted a space after the comma.
|
|
The same failure to parse would occur if I put a space before the comma, or
|
|
before or after the list of `double`s.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example That Gracefully Handles Whitespace]
|
|
|
|
Let's modify the trivial parser we just saw to ignore any spaces that might
|
|
exist among the `double`s and commas. To skip whitespace wherever we find it,
|
|
we can pass a /skip parser/ to our call to _p_ (we don't need to touch the
|
|
parser passed to _p_). Here, we use `ascii::space`, which matches any ASCII
|
|
character `c` for which `std::isspace(c)` is true.
|
|
|
|
[trivial_skipper_example]
|
|
|
|
The skip parser, or /skipper/, is run between the subparsers within the
|
|
parser passed to _p_. In this case, the skipper is run before the first
|
|
`double` is parsed, before any subsequent comma or `double` is parsed, and at
|
|
the end. So, the strings `"3.6,5.9"` and `" 3.6 , \t 5.9 "` are parsed the
|
|
same by this program.
|
|
|
|
Skipping is an important concept in _Parser_. You can skip anything, not just
|
|
ASCII whitespace; there are lots of other things you might want to skip. The
|
|
skipper you pass to _p_ can be an arbitrary parser. For example, if you write
|
|
a parser for a scripting language, you can write a skipper to skip whitespace,
|
|
inline comments, and end-of-line comments.
|
|
|
|
We'll be using skip parsers almost exclusively in the rest of the
|
|
documentation. The ability to ignore the parts of your input that you don't
|
|
care about is so convenient that parsing without skipping is a rarity in
|
|
practice.
|
|
|
|
[endsect]
|
|
|
|
[section Semantic Actions]
|
|
|
|
Like all parsing systems (lex & yacc, _Spirit_, etc.), _Parser_ has a
|
|
mechanism for associating semantic actions with different parts of the parse.
|
|
Here is nearly the same program as we saw in the previous example, except that
|
|
it is implemented in terms of a semantic action that appends each parsed
|
|
`double` to a result, instead of automatically building and returning the
|
|
result:
|
|
|
|
[semantic_action_example]
|
|
|
|
Run in a shell, it looks like this:
|
|
|
|
[pre
|
|
$ example/semantic_actions
|
|
Enter a list of doubles, separated by commas. 4,3
|
|
Got one!
|
|
Got one!
|
|
You entered:
|
|
4
|
|
3
|
|
]
|
|
|
|
In _Parser_, semantic actions are implemented in terms of invocable objects
|
|
that take a single parameter to a parse-context object. In the example we
|
|
used this lambda as our invocable:
|
|
|
|
[semantic_action_example_lambda]
|
|
|
|
We're both printing a message to `std::cout` and recording a parsed result in
|
|
the lambda. It could do both, either, or neither of these things if you like.
|
|
The way we get the parsed `double` in the lambda is by asking the parse
|
|
context for it. `_attr(ctx)` is how you ask the parse context for the
|
|
attribute produced by the parser to which the semantic action is attached.
|
|
There are lots of functions like `_attr()` that can be used to access the
|
|
state in the parse context. We'll cover more of them later on. The next
|
|
section defines what exactly the parse context is and how it works.
|
|
|
|
TODO: Briefly introduce rules here.
|
|
|
|
[endsect]
|
|
|
|
[section The Parse Context]
|
|
|
|
Now would be a good time to describe the parse context in some detail. Any
|
|
semantic action that you write will need to use the state in the parse
|
|
context, so you need to know what's available.
|
|
|
|
The parse context is a `hana::map` from tag types to elements. Elements are
|
|
added to or removevd from it at different times during the parse. For
|
|
instance, when a parser with a semantic action succeeds, it adds the attribute
|
|
it produces to the parse context, then calls the invocable semantic action.
|
|
This is efficient to do, because the `hana::map` remains fairly small, usually
|
|
around ten elements, and each element is stored as a pointer. Copying the
|
|
entire map when mutating the context is therefore fast.
|
|
|
|
[note All these functions that take the parse context as their first parameter
|
|
will find by found by Argument-Dependent Lookup. You will probably never need
|
|
to qualify them with `boost::parser::`.]
|
|
|
|
[heading Accessors for data that are always available]
|
|
|
|
By convention, the names of all _Parser_ functions that take a parse context,
|
|
and are therefore intended for use inside semantic actions, contain a leading
|
|
underscore.
|
|
|
|
[heading _pass_]
|
|
|
|
_pass_ returns a reference to a `bool` indicating the success of failure of
|
|
the current parse. This can be used to force the current parse to pass or
|
|
fail:
|
|
|
|
[](auto & ctx) {
|
|
// If the attribute meets this predicate, fail the parse.
|
|
if (some_condition(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
Note that for a semantic action to be executed, its associated parser must
|
|
already have succeeded. So unless you previously wrote `_pass(ctx) = false`
|
|
somewhere, `_pass(ctx) = true` does nothing; it's redundant.
|
|
|
|
[heading _begin_, _end_ and _where_]
|
|
|
|
_begin_ and _end_ return the beginning and end of the range that you passed to
|
|
_p_, respectively. _where_ returns a _v_ indicating the bounds of the input
|
|
matched by the current parse. _where_ can be useful if you just want to parse
|
|
some text and return a result consisting of where certain elements are
|
|
located, without producing any other attributes.
|
|
|
|
[heading _error_handler_]
|
|
|
|
_error_handler_ returns a reference to the error handler associated with the
|
|
parser passed to _p_. Any error handler must have the following member
|
|
functions:
|
|
|
|
[error_handler_api_1]
|
|
|
|
[error_handler_api_2]
|
|
|
|
If you call the second one, the one without the iterator parameter, it will
|
|
call the first with `_where(context).begin()` as the iterator parameter. The
|
|
one without the iterator is the one you will use most often. The one with the
|
|
explicit iterator parameter can be useful in situations where you have
|
|
messages that are related to each other, associated with multiple locations.
|
|
For instance, if you are parsing XML, you may want to report that a close-tag
|
|
does not match its associated open-tag by showing the line where the open-tag
|
|
was found. That may of course not be located anywhere near
|
|
`_where(ctx).begin()`. (A description of _globals_ is below.)
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
open_tag_msg,
|
|
ctx,
|
|
_globals(ctx).open_tags_position.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
close_tag_msg,
|
|
ctx);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
[heading _report_error_ and _report_warning_]
|
|
|
|
There are also some convenience functions that make the above code a little
|
|
less verbose, _report_error_ and _report_warning_:
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_report_error(ctx, open_tag_msg, _globals(ctx).open_tag_positions.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_report_error(ctx, close_tag_msg);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
You should use these less verbose functions almost all the time. The only
|
|
time you would want to use _error_handler_ is when you are using a custom
|
|
error handler, and you want access to some part of it's interface besides
|
|
`diagnose()`.
|
|
|
|
[heading Accessors for data that are only sometimes available]
|
|
|
|
[heading _attr_]
|
|
|
|
_attr_ returns a reference to the value of the current parser's attribute. It
|
|
is available only when the current parser's parse is successful. If the
|
|
parser has no semantic action, no attribute gets added to the parse context.
|
|
It can be used to read and write the current parser's attribute:
|
|
|
|
[](auto & ctx) { _attr(ctx) = 3; }
|
|
|
|
If the current parser has no attribute, a _n_ is returned.
|
|
|
|
[heading _val_]
|
|
|
|
_val_ returns a reference to the value of the attribute of the current rule
|
|
being used to parse (if any), and is available even before the rule's parse is
|
|
successful. It can be used to set the current rule's attribute, even from a
|
|
parser that is a subparser inside the rule. Let's say we're writing a parser
|
|
with a semantic action that is within a rule. If we want to set the current
|
|
rule's value to whatever this subparser parses, we would write this semantic
|
|
action:
|
|
|
|
[](auto & ctx) { _val(ctx) = _attr(ctx); }
|
|
|
|
If there is no current rule, or the current rule has no attribute, a _n_ is
|
|
returned.
|
|
|
|
[heading _globals_]
|
|
|
|
_globals_ returns a reference to a user-supplied struct that contains whatever
|
|
data you want to use during the parse. We'll get into this more later, but
|
|
for now, here's how you might use it:
|
|
|
|
[](auto & ctx) {
|
|
// black_list is some set of proscribed values that are not allowed.
|
|
if (_globals(ctx).black_list.contains(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
[heading _locals_]
|
|
|
|
_locals_ returns a reference to one or more values that are local to the
|
|
current rule being parsed, if any. If there are two or more local values,
|
|
_locals_ returns a reference to a `hana::tuple`. Rules are something we
|
|
haven't gotten to yet, but here is how you use _locals_:
|
|
|
|
[](auto & ctx) {
|
|
auto & local = _locals(ctx);
|
|
// Use local here. If it is a hana::tuple, access its members like this:
|
|
using namespace hana::literals;
|
|
auto & first_element = local[0_c];
|
|
auto & second_element = local[1_c];
|
|
}
|
|
|
|
If there is no current rule, or the current rule has no locals, a _n_ is
|
|
returned.
|
|
|
|
[heading _params_]
|
|
|
|
_params_, like _locals_, applies to the current rule being used to parse, if
|
|
any. It also returns a reference to a single value, if the current rule has
|
|
only one parameter, or a `hana::tuple` to multiple values if the current rule
|
|
has multiple parameters.
|
|
|
|
If there is no current rule, or the current rule has no parameters, a _n_ is
|
|
returned.
|
|
|
|
[note _n_ is a type that is used as a return value in _Parser_ for parse
|
|
context accessors. _n_ is convertible to anything that has a default
|
|
constructor, convertible from anything, assignable form anything, and has
|
|
templated overloads for all the overloadable operators. The intention is that
|
|
a misuse of _val_, _globals_, etc. should compile, and produce an assertion at
|
|
runtime. Experience has shown that using a debugger for investigating the
|
|
stack that leads to your mistake is a far better user experience than sifting
|
|
through compiler diagnostics. See the _rationale_ section for a more detailed
|
|
explanation.]
|
|
|
|
[endsect]
|
|
|
|
[section Symbol Tables]
|
|
|
|
When writing a parser, it often comes up that there is a set of strings that,
|
|
when parsed, are associated with a set of values 1-to-1. It is tedious to
|
|
write parsers that recognize all the possible input strings when you have to
|
|
associate each one with an attribute via a semantic action. Instead, we can
|
|
use a symbol table.
|
|
|
|
Say we want to parse Roman numerals, one of the most common work-related
|
|
parsing problems. We want to recognize numbers that start with any number of
|
|
"M"s, representing thousands, followed by the hundreds, the tens, and the
|
|
ones. Any of these may be absent from the input, but not all. Here are three
|
|
symbol _Parser_ tables that we can use to recognize ones, tens, and hundreds
|
|
values, respectively:
|
|
|
|
[roman_numeral_symbol_tables]
|
|
|
|
A _symbols_ maps strings of `char` to their associated attributes. The type
|
|
of the attribute must be specified as a template parameter to _symbols_
|
|
_emdash_ `int` in this case.
|
|
|
|
Any "M"s we encounter should add 1000 to the result, and all other values come
|
|
from the symbol tables. Here are the semantic actions we'll need to do that:
|
|
|
|
[roman_numeral_actions]
|
|
|
|
`add_1000` just adds `1000` to `result`. `add` adds whatever attribute is
|
|
produced by its parser to `result`.
|
|
|
|
Now we just need to put the pieces together to make a parser:
|
|
|
|
[roman_numeral_parser]
|
|
|
|
We've got a few new bits in play here, so let's break it down. `'M'_l` is a
|
|
/literal parser/. That is, it is a parser that parses a literal `char`, code
|
|
point, or string. In this case, a `char` "M" is being parsed. The `_l` bit
|
|
at the end is a _udl_ suffix that you can put after any `char`, `char32_t`, or
|
|
`char const *` to form a literal parser. You can also make a literal parser
|
|
by writing _lit_ for some `x` of one of the previously mentioned types.
|
|
|
|
Why do we need any of this, considering that we just used a literal `','` in
|
|
our previous example? The reason is that `'M'` is not used in an expression
|
|
with another _Parser_ parser. It is used within `*'M'_l[add_1000]`. If we'd
|
|
written `*'M'[add_1000]`, clearly that would be ill-formed; `char` has no
|
|
`operator*()`, nor an `operator[]()`, associated with it.
|
|
|
|
[tip Any time you want to use a `char`, `char32_t`, or string literal in a
|
|
_Parser_ parser, write it as-is if it is combined with a preexisting _Parser_
|
|
subparser `p`, as in `'x' >> p`. Otherwise, you need to wrap it in a call to
|
|
_lit_, or use the `_l` _udl_ suffix.]
|
|
|
|
On to the next bit: `-hundreds[add]`. By now, the use of the index operator
|
|
should be pretty familiar; it associates the semantic action `add` with the
|
|
parser `hundreds`. The `operator-()` at the beginning is new. It means that
|
|
the parser it is applied to is optional. You can read it as "zero or one".
|
|
So, if `hundreds` is not successfully parsed after `*'M'[add_1000]`, nothing
|
|
happens, because `hundreds` is allowed to be missing _emdash_ it's optional.
|
|
If `hundreds` is parsed successfully, say by matching `"CC"`, the resulting
|
|
attribute, `200`, is added to `result` inside `add`.
|
|
|
|
Here is the full listing of the program. Notice that it would have been
|
|
inappropriate to use a whitespace skipper here, since the entire parse is a
|
|
single number, so it was removed.
|
|
|
|
[roman_numeral_example]
|
|
|
|
[endsect]
|
|
|
|
[section Mutable Symbol Tables]
|
|
|
|
The previous example showed how to use a symbol table as a fixed lookup table.
|
|
What if we want to add things to the table during the parse? We can do that,
|
|
but we need to do so within a semantic action. First, here is our symbol
|
|
table, already with a single value in it:
|
|
|
|
[self_filling_symbol_table_table]
|
|
|
|
No surprise that it works to use the symbol table as a parser to parse the one
|
|
string in the symbol table. Now, here's our parser:
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
Here, we've attached the semantic action not to a simple parser like _d_, but
|
|
to the sequence parser `(bp::char_ >> bp::int_)`. This sequence parser
|
|
contains two parsers, each with its own attribute, so it produces two
|
|
attributes as a `hana::tuple`.
|
|
|
|
[self_filling_symbol_table_action]
|
|
|
|
Inside the semantic action, we can get the first element of the attribute
|
|
tuple using _udls_ provided by Boost.Hana, and `hana::tuple::operator[]()`.
|
|
The first attribute, from the _ch_, is `_attr(ctx)[0_c]`, and the second, from
|
|
the _i_, is `_attr(ctx)[1_c]`. To add the symbol to the symbol table, we call
|
|
`insert()`.
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
During the parse, `("X", 9)` is parsed and added to the symbol table. Then,
|
|
the second `'X'` is recognized by the symbol table parser. However:
|
|
|
|
[self_filling_symbol_table_after_parse]
|
|
|
|
If we parse again, we find that `"X"` did not stay in the symbol table. The
|
|
fact that `symbols` was declared const might have given you a hint that this
|
|
would happen. Also, notice that the call to `insert()` in the semantic action
|
|
uses the parse context; that's where all the symbol table changes are stored
|
|
during the parse.
|
|
|
|
The full program:
|
|
|
|
[self_filling_symbol_table_example]
|
|
|
|
[note It is possible to add symbols to a _symbols_ permanently. To do so, you
|
|
have to use a mutable _symbols_ object `s`, and add the symbols by calling
|
|
`s.add()`, instead of `s.insert()`.]
|
|
|
|
[endsect]
|
|
|
|
[section Alternative Parsers]
|
|
|
|
Frequently, you need to parse something that might have one of several forms.
|
|
`operator|()` is overloaded to form alternative parsers. For example:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const parser_1 = bp::int_ | bp::eps;
|
|
|
|
`parser_1` matches an integer, or if that fails, it matches /epsilon/, the
|
|
empty string. This is equivalent to writing:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const parser_2 = -bp::int_;
|
|
|
|
However, neither `parser_1` nor `parser_2` is equivalent to writing this:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const parser_3 = bp::eps | bp::int_;
|
|
|
|
The reason is that alternative parsers try each of their subparsers, one at a
|
|
time, and stop on the first one that matches. /Epsilon/ matches anything,
|
|
since it is zero length and consumes no input. It even matches the end of
|
|
input. This means that `parser_3` is equivalent to _e_ by itself.
|
|
|
|
[endsect]
|
|
|
|
[section The Parsers And Their Uses]
|
|
|
|
_Parser_ comes with all the parsers most parsing tasks will ever need. (You
|
|
can also write your own; we'll cover that later.) Each one is a `constexpr`
|
|
object, or a `constexpr` function. Some of the non-functions are also
|
|
callable, such as _ch_, which may be used directly, or with arguments, as in
|
|
_ch_`('a', 'z')`. Any parser that can be called, whether a function or
|
|
callable object, will be called a /callable parser/ from now on. Note that
|
|
there are no nullary calalble parsers; they each take one or more arguments.
|
|
|
|
Each callable parser takes one or more /parse arguments/. A parse argument
|
|
may be a value or an invocable object that accepts a reference to the parse
|
|
context. The reference parameter may be mutable or constant. For example:
|
|
|
|
struct get_attribute
|
|
{
|
|
template<typename Context>
|
|
auto operator()(Context & ctx)
|
|
{
|
|
return _attr(ctx);
|
|
}
|
|
};
|
|
|
|
This can also be a lambda. For example:
|
|
|
|
[](auto const & ctx) { return _attr(ctx); }
|
|
|
|
The operation that produces a value from a parse argument, which may be a
|
|
value or a callable taking a parse context argument, is referred to as
|
|
/resolving/ the parse argument.
|
|
|
|
Some callable parsers take a /parse predicate/. A parse predicate is not
|
|
quite the same as a parse argument, because it must be a callable object, and
|
|
cannot be a value. A parse predicate's return type must be contextually
|
|
convertible to `bool`. For example:
|
|
|
|
struct equals_three
|
|
{
|
|
template<typename Context>
|
|
bool operator()(Context const & ctx)
|
|
{
|
|
return _attr(ctx) == 3;
|
|
}
|
|
};
|
|
|
|
This may of course be a lambda:
|
|
|
|
[](auto & ctx) { return _attr(ctx) == 3; }
|
|
|
|
An example of how parse arguments are used:
|
|
|
|
namespace bp = boost::parser;
|
|
// This parser matches one code point that is at least 'a', and at most
|
|
// the value of last_char, which comes from the globals.
|
|
auto last_char = [](auto & ctx) { return _globals(ctx).last_char; }
|
|
auto subparser = bp::char_('a', last_char);
|
|
|
|
Don't worry for now about what the globals are for now; the take-away is that
|
|
you can make any argument you pass to a parser depend on the current state of
|
|
the parse, by using the parse context:
|
|
|
|
namespace bp = boost::parser;
|
|
// This parser parses two code points. For the parse to succeed, the
|
|
// second one must be >= 'a' and <= the first one.
|
|
auto set_last_char = [](auto & ctx) { _globals(ctx).last_char = _attr(x); };
|
|
auto parser = bp::char_[set_last_char] >> subparser;
|
|
|
|
Each callable parser returns a new parser, parameterized using the arguments
|
|
given in the invocation.
|
|
|
|
TODO: This is way too long for a tutorial. Put this after the examples, in a
|
|
reference section separate from the headers-reference (consider moving other
|
|
long tables, too). Instead, just cover a few exemplars, like _i_, _f_, char_,
|
|
string.
|
|
|
|
This table lists all the _Parser_ parsers. For the callable parsers, a
|
|
separate entry exists for each possible arity of arguments. For a parser `p`,
|
|
if there is no entry for `p` without arguments, `p` is a function, and cannot
|
|
itself be used as a parser; it must be called. In the table below:
|
|
|
|
* each entry is a global object usable directly in your parsers, unless
|
|
otherwise noted;
|
|
|
|
* "code point" is used to refer to the elements of the input range, which
|
|
asumes that the parse is being done in the Unicode-aware code path (if the
|
|
parse is being done in the non-Unicode code path, read "code point" as
|
|
"`char`");
|
|
|
|
* _RES_ is a notional macro that expands to the resolution of parse argument
|
|
or evaluation of a parse predicate;
|
|
|
|
* "`_RES_np_(pred) == true`" is a shorthand notation for "`_RES_np_(pred)` is
|
|
contextually convertible to `true`", and likewise for `false`;
|
|
|
|
* `c` is a character of type `char`, `char8_t`, or `char32_t`;
|
|
|
|
* `str` is a string literal of type `char const[]`, `char8_t const []`, or
|
|
`char32_t const []`;
|
|
|
|
* `pred` is a parse predicate;
|
|
|
|
* `arg0`, `arg1`, `arg2`, ... are parse arguments;
|
|
|
|
* `a` is a semantic action;
|
|
|
|
* `r` is an object whose type models `parsable_range_like`; and
|
|
|
|
* `p`, `p1`, `p2`, ... are parsers.
|
|
|
|
[note The definition of `parsable_range_like` is:
|
|
|
|
[parsable_range_like_concept]
|
|
|
|
It is intended to be a range-like thing; a null-terminated sequence of
|
|
characters is considered range-like, given that a pointer `T *` to a
|
|
null-terminated string is isomorphic with `view<T *,
|
|
boost::text::null_sentinel>`.]
|
|
|
|
[note Some of the parsers in this table consume no input. All parsers consume
|
|
the input they match unless otherwise stated in the table below.]
|
|
|
|
[table Parsers and Their Semantics
|
|
[[Parser] [Semantics] [Attribute Type] [Notes]]
|
|
|
|
[[ _e_ ]
|
|
[ Matches /epsilon/, the empty string. Always matches, and consumes no input. ]
|
|
[ None. ]
|
|
[]]
|
|
|
|
[[ `_e_(pred)` ]
|
|
[ Fails to match the input if `_RES_np_(pred) == false`. Otherwise, the semantics are those of _e_. ]
|
|
[ None. ]
|
|
[]]
|
|
|
|
[[ _ws_ ]
|
|
[ Matches a single whitespace code point (see note), according to the Unicode White_Space property. ]
|
|
[ None. ]
|
|
[ For more info, see the [@https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt Unicode properties]. _ws_ may consume one code point or two. It only consumes two code points when it matches `"\r\n"`. ]]
|
|
|
|
[[ _eol_ ]
|
|
[ Matches a single newline (see note), following the "hard" line breaks in the Unicode line breaking algorithm. ]
|
|
[ None. ]
|
|
[ For more info, see the [@https://unicode.org/reports/tr14 Unicode Line Breaking Algorithm]. _eol_ may consume one code point or two. It only consumes two code points when it matches `"\r\n"`. ]]
|
|
|
|
[[ _eoi_ ]
|
|
[ Matches only at the end of input, and consumes no input. ]
|
|
[ None. ]
|
|
[]]
|
|
|
|
[[ _attr_np_`(arg0)` ]
|
|
[ Always matches, and consumes no input. Generates the attribute `_RES_np_(arg0)`. ]
|
|
[ `decltype(_RES_np_(arg0))`. ]
|
|
[]]
|
|
|
|
[[ _ch_ ]
|
|
[ Matches any single code point. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ]
|
|
[]]
|
|
|
|
[[ `_ch_(arg0)` ]
|
|
[ Matches exactly the code point `_RES_np_(arg0)`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ]
|
|
[]]
|
|
|
|
[[ `_ch_(arg0, arg1)` ]
|
|
[ Matches the next code point in the input `n`, if `_RES_np_(arg0) <= n && n <= _RES_np_(arg1)`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ]
|
|
[]]
|
|
|
|
[[ `_ch_(r)` ]
|
|
[ Matches the next code point in the input `n`, if `_RES_np_(arg0) <= n && n <= _RES_np_(arg1)`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ]
|
|
[ `r` is taken to be in a UTF encoding. The exact UTF used depends on the size of `r`'s element type. If you do not pass UTF encoded ranges for `r`, the bahavior of _ch_ is undefined. Note that ASCII is a subset of UTF-8, so ASCII is fine. EBCDIC may not be. `r` is not copied; a reference to it is taken. The liftime of `_ch_(r)` must be within the lifetime of r`. This overload of _ch_ does *not* take parse arguments. ]]
|
|
|
|
[[ _cp_ ]
|
|
[ Matches a single code point. ]
|
|
[ `uint32_t` ]
|
|
[ Similar to _ch_, but with a fixed `uint32_t` attribute type; _cu_ has all the same call operator overloads as _ch_, though they are not repeated here, for brevity. ]]
|
|
|
|
[[ _cu_ ]
|
|
[ Matches a single code point. ]
|
|
[ `char` ]
|
|
[ Similar to _ch_, but with a fixed `char` attribute type; _cu_ has all the same call operator overloads as _ch_, though they are not repeated here, for brevity. Even though the name "`cu`" suggests that this parser match at the code unit level, it does not. The name refers to the attribute type generated, much like the names _i_ versus _ui_. ]]
|
|
|
|
[[ `_alnum_` ]
|
|
[ Matches a single code point for which `std::alnum()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_alpha_` ]
|
|
[ Matches a single code point for which `std::isalpha()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_blank_` ]
|
|
[ Matches a single code point for which `std::isblank()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_cntrl_` ]
|
|
[ Matches a single code point for which `std::iscntrl()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_digit_` ]
|
|
[ Matches a single code point for which `std::isdigit()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_graph_` ]
|
|
[ Matches a single code point for which `std::isgraph()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_print_` ]
|
|
[ Matches a single code point for which `std::isprint()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_punct_` ]
|
|
[ Matches a single code point for which `std::ispunct()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_space_` ]
|
|
[ Matches a single code point for which `std::isspace()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_xdigit_` ]
|
|
[ Matches a single code point for which `std::isxdigit()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_lower_` ]
|
|
[ Matches a single code point for which `std::islower()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ `_upper_` ]
|
|
[ Matches a single code point for which `std::isupper()` is `true`. ]
|
|
[ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ]
|
|
[ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]]
|
|
|
|
[[ _lit_np_`(c)`]
|
|
[ Matches exactly the given code point `c`. ]
|
|
[ None. ]
|
|
[_lit_ does *not* take parse arguments. ]]
|
|
|
|
[[ `c_l` ]
|
|
[ Matches exactly the given code point `c`. ]
|
|
[ None. ]
|
|
[ This is a _udl_ that represents `_lit_np_(c)`, for example `'F'_l`. ]]
|
|
|
|
[[ _lit_np_`(r)`]
|
|
[ Matches exactly the given string `r`. ]
|
|
[ None. ]
|
|
[ _lit_ does *not* take parse arguments. ]]
|
|
|
|
[[ `str_l` ]
|
|
[ Matches exactly the given string `str`. ]
|
|
[ None. ]
|
|
[ This is a _udl_ that represents `_lit_np_(s)`, for example `"a string"_l`. ]]
|
|
|
|
[[ `_str_np_(r)`]
|
|
[ Matches exactly `r`, and generates the match as an attribute. ]
|
|
[ `std::string` ]
|
|
[ _str_ does *not* take parse arguments. ]]
|
|
|
|
[[ `str_p`]
|
|
[ Matches exactly `str`, and generates the match as an attribute. ]
|
|
[ `std::string` ]
|
|
[ This is a _udl_ that represents `_str_np_(s)`, for example `"a string"_p`. ]]
|
|
|
|
[[ _b_ ]
|
|
[ Matches `"true"` or `"false"`. ]
|
|
[ `bool` ]
|
|
[]]
|
|
|
|
[[ _bin_ ]
|
|
[ Matches a binary unsigned integral value. ]
|
|
[ `unsigned int` ]
|
|
[ For example, _bin_ would match `"101"`, and generate an attribute of `5u`. ]]
|
|
|
|
[[ `_bin_(arg0)` ]
|
|
[ Matches exactly the binary unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _oct_ ]
|
|
[ Matches an octal unsigned integral value. ]
|
|
[ `unsigned int` ]
|
|
[ For example, _oct_ would match `"31"`, and generate an attribute of `25u`. ]]
|
|
|
|
[[ `_oct_(arg0)` ]
|
|
[ Matches exactly the octal unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _hex_ ]
|
|
[ Matches a hexidecimal unsigned integral value. ]
|
|
[ `unsigned int` ]
|
|
[ For example, _hex_ would match `"ff"`, and generate an attribute of `255u`. ]]
|
|
|
|
[[ `_hex_(arg0)` ]
|
|
[ Matches exactly the hexidecimal unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _us_ ]
|
|
[ Matches an unsigned integral value. ]
|
|
[ `unsigned short` ]
|
|
[]]
|
|
|
|
[[ `_us_(arg0)` ]
|
|
[ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _ui_ ]
|
|
[ Matches an unsigned integral value. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ `_ui_(arg0)` ]
|
|
[ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _ul_ ]
|
|
[ Matches an unsigned integral value. ]
|
|
[ `unsigned long` ]
|
|
[]]
|
|
|
|
[[ `_ul_(arg0)` ]
|
|
[ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _ull_ ]
|
|
[ Matches an unsigned integral value. ]
|
|
[ `unsigned long long` ]
|
|
[]]
|
|
|
|
[[ `_ull_(arg0)` ]
|
|
[ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ]
|
|
[ `unsigned int` ]
|
|
[]]
|
|
|
|
[[ _s_ ]
|
|
[ Matches a signed integral value. ]
|
|
[ `short` ]
|
|
[]]
|
|
|
|
[[ `_s_(arg0)` ]
|
|
[ Matches exactly the signed integral value `_RES_np_(arg0)`. ]
|
|
[ `int` ]
|
|
[]]
|
|
|
|
[[ _i_ ]
|
|
[ Matches a signed integral value. ]
|
|
[ `int` ]
|
|
[]]
|
|
|
|
[[ `_i_(arg0)` ]
|
|
[ Matches exactly the signed integral value `_RES_np_(arg0)`. ]
|
|
[ `int` ]
|
|
[]]
|
|
|
|
[[ _l_ ]
|
|
[ Matches a signed integral value. ]
|
|
[ `long` ]
|
|
[]]
|
|
|
|
[[ `_l_(arg0)` ]
|
|
[ Matches exactly the signed integral value `_RES_np_(arg0)`. ]
|
|
[ `long` ]
|
|
[]]
|
|
|
|
[[ _ll_ ]
|
|
[ Matches a signed integral value. ]
|
|
[ `long long` ]
|
|
[]]
|
|
|
|
[[ `_ll_(arg0)` ]
|
|
[ Matches exactly the signed integral value `_RES_np_(arg0)`. ]
|
|
[ `long long` ]
|
|
[]]
|
|
|
|
[[ _f_ ]
|
|
[ Matches a floating-point number. _f_ uses parsing implementation details from _Spirit_. The specifics of what formats are accepted can be found in their _spirit_reals_. Note that only the default `RealPolicies` is supported by _f_. ]
|
|
[ `float` ]
|
|
[]]
|
|
|
|
[[ _d_ ]
|
|
[ Matches a floating-point number. _d_ uses parsing implementation details from _Spirit_. The specifics of what formats are accepted can be found in their _spirit_reals_. Note that only the default `RealPolicies` is supported by _d_. ]
|
|
[ `double` ]
|
|
[]]
|
|
|
|
[[ `_rpt_np_(arg0)[p]` ]
|
|
[ Matches iff `p` matches exactly `_RES_np_(arg0)` times. ]
|
|
[ `std::vector<_ATTR_np_(p)>` ]
|
|
[ The special value _inf_ may be used; it indicates unlimited repetition. `decltype(_RES_np_(arg0))` must be implicitly convertible to `int64_t`. ]]
|
|
|
|
[[ `_rpt_np_(arg0, arg1)[p]` ]
|
|
[ Matches iff `p` matches between `_RES_np_(arg0)` and `_RES_np_(arg1)` times, inclusively. ]
|
|
[ `std::vector<_ATTR_np_(p)>` ]
|
|
[ The special value _inf_ may be used for the upper bound; it indicates unlimited repetition. `decltype(_RES_np_(arg0))` and `decltype(_RES_np_(arg1))` each must be implicitly convertible to `int64_t`. ]]
|
|
|
|
[[ `_if_np_(pred)[p]` ]
|
|
[ Equivalent to `_e_(pred) >> p`. ]
|
|
[ `std::optional<_ATTR_np_(p)>` ]
|
|
[ It is an error to write `_if_np_(pred)`. That is, it is an error to omit the conditionally matched parser `p`. ]]
|
|
|
|
[[ `_sw_np_(arg0)(arg1, p1)(arg2, p2) ...` ]
|
|
[ Equivalent to `p1` when `_RES_np_(arg0) == _RES_np_(arg1)`, `p2` when `_RES_np_(arg0) == _RES_np_(arg2)`, etc. If there is such no `argN`, the behavior of _sw_ is undefined. ]
|
|
[ `std::variant<_ATTR_np_(p1), _ATTR_np_(p2), ...>` ]
|
|
[ It is an error to write `_sw_np_(arg0)`. That is, it is an error to omit the conditionally matched parsers `p1`, `p2`, .... ]]
|
|
|
|
[[ _symbols_t_ ]
|
|
[ _symbols_ is an associative container of key, value pairs. Each key is a `std::string` and each value has type `T`. In the Unicode parsing path, the strings are considered to be UTF-8 encoded; in the non-Unicode path, no encoding is assumed. _symbols_ Matches the longest prefix `pre` of the input that is equal to one of the keys `k`. If the length `len` of `pre` is zero, and there is no zero-length key, it does not match the input. If `len` is positive, the generated attribute is the value associated with `k`.]
|
|
[ `T` ]
|
|
[ Unlike the other entries in this table, _symbols_ is a type, not an object. ]]
|
|
]
|
|
|
|
[note A slightly more complete description of the attributes generated by
|
|
these parsers is in the next section. The attributes are repeated here so you
|
|
can use see all the properties of the parsers in one place.]
|
|
|
|
TODO: int<>, uint<>
|
|
|
|
[endsect]
|
|
|
|
[section Directives]
|
|
|
|
A directive is an element of your parser that doesn't have any meaning by
|
|
itself. Some are second-order parsers that need a first-order parser to do
|
|
the actual parsing. Others influence the parse in some way. Lexically, you
|
|
can spot a directive by its use of `[]`. Non-directives never use `[]`, and
|
|
directives always do.
|
|
|
|
The directives that are second order parsers are technically directives, but
|
|
since they are also used to create parsers, it is more useful just to focus on
|
|
that. The directives _rpt_ and _if_ were already described in the section on
|
|
parsers; we won't say more about them here.
|
|
|
|
That leaves the directives that affect aspects of the parse:
|
|
|
|
[heading _omit_]
|
|
|
|
`_omit_np_[p]` disables attribute generation for the parser `p`. Not only
|
|
does `_omit_np_[p]` have no attribute, but any attribute generation work that
|
|
normally happens within `p` is skipped.
|
|
|
|
This directive can be useful in cases like this: say you have some fairly
|
|
complicated parser `p` that generates a large and expensive-to-construct
|
|
attribute. Now say that you want to write a function that just counts how
|
|
many times `p` can match a string (where the matches are non-overlapping).
|
|
Instead of using `p` directly, and building all those attributes, or rewriting
|
|
`p` without the attribute generation, use _omit_.
|
|
|
|
[heading _raw_]
|
|
|
|
`_raw_np_[p]` changes the attribute from `_ATTR_np_(p)` to to a view that
|
|
delimits the subrange of the input that was matched by `p`. The type of the
|
|
view is `_v_<I>`, where `I` is the type of the iterator used within the parse.
|
|
Note that this may not be the same as the iterator type passed to _p_. For
|
|
instance, when parsing UTF-8, the iterator passed to _p_ may be `char8_t const
|
|
*`, but within the parse it will be a UTF-8 to UTF-32 transcoding (converting)
|
|
iterator. Just like _omit_, _raw_ causes all attribute-generation work within
|
|
`p` to be skipped.
|
|
|
|
Similar to the re-use scenario for _omit_ above, _raw_ could be used to find
|
|
the *locations* of all non-overlapping matches of `p` in a string.
|
|
|
|
[heading _lexeme_]
|
|
|
|
`_lexeme_np_[p]` disables use of the skipper, if a skipper is being used,
|
|
within the parse of `p`. This is useful, for instance, if you want to enable
|
|
skipping in most parts of your parser, but disable it only in one section
|
|
where it doesn't belong. If you are skipping whitespace in most of your
|
|
parser, but want to parse strings that may contain spaces, you should use
|
|
_lexeme_:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const string_parser = bp::lexeme['"' >> *(bp::char_ = '"') >> '"'];
|
|
|
|
Without _lexeme_, our string parser would corerctly match `"foo bar"`, but the
|
|
generated attribute would be `"foobar"`.
|
|
|
|
[heading _skip_]
|
|
|
|
_skip_ is like the inverse of _lexeme_. It enables skipping in the parse,
|
|
even if it was not enabled before. For example, within a call to _p_ that
|
|
uses a skipper, let's say we have these parsers in use:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const one_or_more = +bp::char_;
|
|
auto const skip_or_skip_not_there_is_no_try = bp::lexeme[bp::skip[one_or_more] >> one_or_more];
|
|
|
|
The use of _lexeme_ disables skipping, but then the use of _skip_ turns it
|
|
back on. The net result is that the first occurrance of `one_or_more` will
|
|
use the skipper passed to _p_; the second will not.
|
|
|
|
_skip_ has another use. You can parameterize skip with a different parser to
|
|
change the skipper just within the scope of the directive. Let's say we
|
|
passed _space_ to _p_, and we're using these parsers somewhere within that
|
|
call:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const zero_or_more = *bp::char_;
|
|
auto const skip_both_ways = zero_or_more >> bp::skip(bp::ws)[zero_or_more];
|
|
|
|
The first occurrance of `zero_or_more` will use the skipper passed to _p_,
|
|
_space_; the second will use _ws_ as its skipper.
|
|
|
|
[endsect]
|
|
|
|
[section Combining Operations]
|
|
|
|
Certain overloaded operators are defined for all parsers in _Parser_. We've
|
|
already seen some of them used in this tutorial, especially `operator>>()` and
|
|
`operator|()`, which are used to form sequence parsers and alternative
|
|
parsers, respectively.
|
|
|
|
Here are all the operators overloaded for parsers. In the tables below:
|
|
|
|
* `c` is a character of type `char` or `char32_t`;
|
|
|
|
* `a` is a semantic action;
|
|
|
|
* `r` is an object whose type models `parsable_range_like` (see _concepts_);
|
|
and
|
|
|
|
* `p`, `p1`, `p2`, ... are parsers.
|
|
|
|
[note Some of the expressions in this table consume no input. All parsers
|
|
consume the input they match unless otherwise stated in the table below.]
|
|
|
|
[table Combining Operations and Their Semantics
|
|
[[Expression] [Semantics] [Attribute Type] [Notes]]
|
|
|
|
[[`!p`] [ Matches iff `p` does not match; consumes no input. ] [None.] []]
|
|
[[`&p`] [ Matches iff `p` matches; consumes no input. ] [None.] []]
|
|
[[`*p`] [ Parses using `p` repeatedly until `p` no longer matches; always matches. ] [`std::vector<_ATTR_np_(p)>`] []]
|
|
[[`+p`] [ Parses using `p` repeatedly until `p` no longer matches; matches iff `p` matches at least once. ] [`std::vector<_ATTR_np_(p)>`] []]
|
|
[[`-p`] [ Equivalent to `p | _e_`. ] [`std::optional<_ATTR_np_(p)>`] []]
|
|
[[`p1 >> p2`] [ Matches only iff `p1` matches, and then `p2` matches. ] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `>>` is associative; `p1 >> p2 >> p3`, `(p1 >> p2) >> p3`, and `p1 >> (p2 >> p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]]
|
|
[[`p >> c`] [ Equivalent to `p >> lit(c)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p >> r`] [ Equivalent to `p >> lit(r)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p1 > p2`] [ Matches only iff `p1` matches, and then `p2` matches. No back-tracking is allowed after `p1` matches; if `p1` matches but then `p2` does not, the top-level parse fails. ] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `>` is associative; `p1 > p2 > p3`, `(p1 > p2) > p3`, and `p1 > (p2 > p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]]
|
|
[[`p > c`] [ Equivalent to `p > lit(c)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p > r`] [ Equivalent to `p > lit(r)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p1 | p2`] [ Matches only iff either `p1` matches or `p2` matches. ] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `|` is associative; `p1 | p2 | p3`, `(p1 | p2) | p3`, and `p1 | (p2 | p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]]
|
|
[[`p | c`] [ Equivalent to `p | lit(c)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p | r`] [ Equivalent to `p | lit(r)`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p1 - p2`] [ Equivalent to `!p2 >> p1`. ] [`_ATTR_np_(p1)`] []]
|
|
[[`p - c`] [ Equivalent to `p - lit(c)`. ] [`_ATTR_np_(p)`] []]
|
|
[[`p - r`] [ Equivalent to `p - lit(r)`. ] [`_ATTR_np_(p)`] []]
|
|
[[`p1 % p2`] [ Equivalent to `p1 >> *(p2 >> p1)`. ] [`std::vector<_ATTR_np_(p1)>`] []]
|
|
[[`p % c`] [ Equivalent to `p % lit(c)`. ] [`std::vector<_ATTR_np_(p)>`] []]
|
|
[[`p % r`] [ Equivalent to `p % lit(r)`. ] [`std::vector<_ATTR_np_(p)>`] []]
|
|
[[`p[a]`] [ Matches iff `p` matches. If `p` matches, the semantic action `a` is executed. ] [None.] []]
|
|
]
|
|
|
|
There are a couple of special rules not captured in the table above:
|
|
|
|
First, the zero-or-more and one-or-more repetitions (`operator*()` and
|
|
`operator+()`, respectively) may collapse when combined. For any parser `p`,
|
|
`++p` collapses to `+p`; `**p`, `*+p`, and `+*p` each collapse to just `*p`.
|
|
|
|
Second, using _e_ in an alternative parser as any alternative *except* the
|
|
last one is a common source of errors; _Parser_ disallows it. This is true
|
|
because, for any parser `p`, `_e_ | p` is equivalent to _e_, since _e_ always
|
|
matches. This is not true for _e_ parameterized with a condition. For any
|
|
condition `cond`, `_e_(cond)` is allowed to appear anywhere within an
|
|
alternative parser.
|
|
|
|
[endsect]
|
|
|
|
[section Attribute Generation]
|
|
|
|
So far, we've seen several different types of attributes that come from
|
|
different parsers, `int` for _i_, `hana::tuple<char, int>` for
|
|
`boost::parser::char_ >> boost::parser::int_`, etc. Let's get into how this
|
|
works with a bit more rigor.
|
|
|
|
[note Some parsers have no attribute at all. In the tables below, the type of
|
|
the attribute is listed as "None." There is a non-`void` type that is
|
|
returned from each parser that lacks an attribute. This keeps the logic
|
|
simple; having to handle the two cases _emdash_ `void` or non-`void` _emdash_
|
|
would make the library significantly more complicated. The type of this
|
|
non-`void` attribute associated with these parsers is an implementation
|
|
detail. The type comes from the `boost::parser::detail` namespace and is
|
|
pretty useless. You should never see this type in practice. Within semantic
|
|
actions, asking for the attribute of a non-attribute-producing parser (using
|
|
`_attr(ctx)`) will yield a value of the special type `boost::parser::none`.
|
|
When calling _p_ in a form that returns the attribute parsed, when there is no
|
|
attribute, simply returns `bool`; this indicates the success of failure of the
|
|
parse.]
|
|
|
|
[heading Parser attributes]
|
|
|
|
This table summarizes the attributes generated for all _Parser_ parsers. In
|
|
the table, _RES_ is a notional macro that expands to the resolution of parse
|
|
argument or evaluation of a parse predicate; and `x` and `y` represent
|
|
arbitrary objects.
|
|
|
|
[table Parsers and Their Attributes
|
|
[[Parser] [Attribute Type] [Notes]]
|
|
|
|
[[ _e_ ] [ None. ] []]
|
|
[[ _eol_ ] [ None. ] []]
|
|
[[ _eoi_ ] [ None. ] []]
|
|
[[ `_attr_np_(x)` ] [ `decltype(_RES_np_(x))` ][]]
|
|
[[ _ch_ ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing; see below. ]
|
|
[Includes all the `_p` _udls_ that take a single character, and all parsers in the `boost::parser::ascii` namespace.]]
|
|
[[ _cp_ ] [ `uint32_t` ] []]
|
|
[[ _cu_ ] [ `char` ] []]
|
|
[[ `_lit_np_(x)`] [ None. ]
|
|
[Includes all the `_l` _udls_.]]
|
|
[[ `_str_np_(x)`] [ `std::string` ]
|
|
[Includes all the `_p` _udls_ that take a string.]]
|
|
[[ _b_ ] [ `bool` ] []]
|
|
|
|
[[ _bin_ ] [ `unsigned int` ] []]
|
|
[[ _oct_ ] [ `unsigned int` ] []]
|
|
[[ _hex_ ] [ `unsigned int` ] []]
|
|
[[ _us_ ] [ `unsigned short` ] []]
|
|
[[ _ui_ ] [ `unsigned int` ] []]
|
|
[[ _ul_ ] [ `unsigned long` ] []]
|
|
[[ _ull_ ] [ `unsigned long long` ] []]
|
|
|
|
[[ _s_ ] [ `short` ] []]
|
|
[[ _i_ ] [ `int` ] []]
|
|
[[ _l_ ] [ `long` ] []]
|
|
[[ _ll_ ] [ `long long` ] []]
|
|
[[ _f_ ] [ `float` ] []]
|
|
[[ _d_ ] [ `double` ] []]
|
|
|
|
[[ _symbols_t_ ] [ `T` ]]
|
|
]
|
|
|
|
_ch_ is a bit odd, since its attribute type is polymorphic. When you use _ch_
|
|
to parse text in the non-Unicode code path (i.e. a string of `char`), the
|
|
attribute is `char`. When you use the exact same _ch_ to parse in the
|
|
Unicode-aware code path, all matching is code point based, and so the
|
|
attribute type is the type used to represent code points. For typical uses,
|
|
that type is `uint32_t`. All parsing of UTF-8 falls under this typical case.
|
|
The only time the code point type will be something different is if you call
|
|
_p_ with a code point sequence whose element type is something besides
|
|
`uint32_t`. For example, when you parse plain `char`s, meaning that the
|
|
parsing is in the non-Unicode code path, the attribute of _ch_ is `char`:
|
|
|
|
auto result = parse("some text", boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<char>>));
|
|
|
|
When you parse UTF-8, the matching is done on a code point basis, and the code
|
|
point type is `uint32_t`, so the attribute type is `uint32_t`:
|
|
|
|
auto result = parse(boost::text::as_utf8("some text"), boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<uint32_t>>));
|
|
|
|
When you parse code points by explicitly giving a code point range to _p_, the
|
|
attribute type is whatever the input range's element type is:
|
|
|
|
auto result = parse(U"some text", boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<char32_t>>));
|
|
|
|
[tip If you know or suspect that you will want to use the same parser in
|
|
Unicode and non-Unicode parsing modes, you can use _cp_ and/or _cu_ to enforce
|
|
a nonpolymorphic attribute type.]
|
|
|
|
|
|
[heading Combining operation attributes]
|
|
|
|
Combining operations of course affect the generation of attributes. In the
|
|
tables below: `m` and `n` are parse arguments that resolve to integral values;
|
|
`pred` is a parse predicate; `arg0`, `arg1`, `arg2`, ... are parse arguments;
|
|
`a` is a semantic action; and `p`, `p1`, `p2`, ... are parsers that generate
|
|
attributes.
|
|
|
|
[table Combining Operations and Their Attributes
|
|
[[Parser] [Attribute Type]]
|
|
|
|
[[`!p`] [None.]]
|
|
[[`&p`] [None.]]
|
|
|
|
[[`*p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`+p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`+*p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`*+p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`-p`] [`std::optional<_ATTR_np_(p)>`]]
|
|
|
|
[[`p1 >> p2`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2)>`]]
|
|
[[`p1 > p2`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2)>`]]
|
|
[[`p1 >> p2 >> p3`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]]
|
|
[[`p1 > p2 >> p3`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]]
|
|
[[`p1 >> p2 > p3`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]]
|
|
[[`p1 > p2 > p3`] [`hana::tuple<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]]
|
|
|
|
[[`p1 | p2`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>`]]
|
|
[[`p1 | p2 | p3`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]]
|
|
|
|
[[`p1 % p2`] [`std::vector<_ATTR_np_(p1)>`]]
|
|
|
|
[[`p[a]`] [None.]]
|
|
|
|
[[`_rpt_np_(arg0)[p]`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`_rpt_np_(arg0, arg1)[p]`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`_if_np_(pred)[p]`] [`std::optional<_ATTR_np_(p)>`]]
|
|
[[`_sw_np_(arg0)(arg1, p1)(arg2, p2)...`]
|
|
[`std::variant<_ATTR_np_(p1), _ATTR_np_(p2), ...>`]]
|
|
]
|
|
|
|
There are a relatively small number of rules that define how sequence parsers
|
|
and alternative parsers's attributes are generated. (Don't worry, there are
|
|
examples below.)
|
|
|
|
[heading Sequence parser attribute rules]
|
|
|
|
The attribute generation behavior of sequence parsers is conceptually pretty
|
|
simple:
|
|
|
|
* the attributes of subparsers form a tuple of values;
|
|
|
|
* subparsers that do not generate attributes do not contribute to the
|
|
sequence's attribute;
|
|
|
|
* subparsers that do generate attributes usually contribute an individual
|
|
element to the tuple result; except
|
|
|
|
* when containers of the same element type are next to each other, or
|
|
individual elements are next to containers of their type, the two adjacent
|
|
attributes collapse into one attribute; and
|
|
|
|
* if the result of all that is a degenerate tuple `hana::tuple<T>` (even if
|
|
`T` is a type that means "no attribute"), the attribute becomes `T`.
|
|
|
|
More formally, the attribute generation algorithm works like this. For a
|
|
sequence parser `p`, let the list of attribute types for the subparsers of `p`
|
|
be `{a0, a1, a2, ..., an}`.
|
|
|
|
We get the attribute of `p` by evaluating a compile-time left fold operation,
|
|
`left-fold({a1, a2, ..., an}, a0, OP)`. `OP` is the combining operation that
|
|
takes the current attribute type (initially `a0`) and the next attribute type,
|
|
and returns the new current attribute type. The current attribute type at the
|
|
end is the attribute type for `p`.
|
|
|
|
`OP` attempts to apply a series of rules, one at a time. The rules are noted
|
|
as `A >> B -> C`, where `A` is the type of the current attribute type, `B` is
|
|
the type of the next attribute type, and `C` is the new current attribute
|
|
type. In these rules, `C<T>` is a container of `T`; `none` is a special type
|
|
that indicates that there is no attribute; `T` is a type; and `Ts...` is a
|
|
parameter pack of one or more types. Note that `T` may be the special type
|
|
`none`.
|
|
|
|
* `T >> none -> T`
|
|
* `C<T> >> C<T> -> C<T>`
|
|
* `T >> T -> vector<T>`
|
|
* `C<T> >> T -> C<T>`
|
|
* `C<T> >> optional<T> -> C<T>`
|
|
* `T >> C<T> -> C<T>`
|
|
* `optional<T> >> C<T> -> C<T>`
|
|
* `hana::tuple<none> >> T -> hana::tuple<T>`
|
|
* `hana::tuple<Ts...> >> T -> hana::tuple<Ts..., T>`
|
|
|
|
Again, if the result is that the attribute is `hana::tuple<T>`, the attribute
|
|
becomes `T`.
|
|
|
|
[note What constitutes a container in the rules above is determined by the
|
|
`container` concept:
|
|
[container_concept]
|
|
]
|
|
|
|
[heading Alternative parser attribute rules]
|
|
|
|
The rules for alternative parsers are much simpler. For an alternative parer
|
|
`p`, let the list of attribute types for the subparsers of `p` be `{a0, a1,
|
|
a2, ..., an}`. The attribute of `p` is `std::variant<a0, a1, a2, ..., an>`,
|
|
with these exceptions:
|
|
|
|
* all the `none` attributes are left out, but if any were taken out, the
|
|
attribute become a `std::optional`;
|
|
|
|
* if the result is `std::variant<T>`, the result becomes `T` instead; and
|
|
|
|
* if the result is `std::variant<>`, the result becomes `none` instead.
|
|
|
|
[heading Formation of containers in attributes]
|
|
|
|
There are no special rules for forming containers from non-containers. For
|
|
instance, one of the rules above for sequence containers is `T >> T ->
|
|
vector<T>`. So, you get a vector if you have multiple values in sequence.
|
|
Another rule is that the attribute of `*p` is `std::vector<_ATTR_np_(p)>`. The
|
|
point is, _Parser_ will generate your favorite container out of sequences and
|
|
repetitions, as long as your favorite container is `std::vector`.
|
|
|
|
Another rule for sequence containers is that an value `x` and a container `c`
|
|
containing elements of `x`'s type will form a single container. However,
|
|
`x`'s type must be exactly the same as the elements in `c`. So, the attribute
|
|
of `char_ >> string("str")` is odd. In the non-Unicode code path, `char_`'s
|
|
attribute type is guaranteed to be `char`, so `_ATTR_np_(char_ >> string("str"))`
|
|
is `std::string`. If you are parsing UTF-8 in the Unicode code path,
|
|
`char_`'s attribute type is `uint32_t`, and `_ATTR_np_(char_ >> string("str"))` is
|
|
therefor `hana::tuple<uint32_t, std::string>`.
|
|
|
|
Again, there are no special rules here.
|
|
|
|
[heading Examples of attributes generated by sequence and alternative parsers]
|
|
|
|
In the table: `a` is a semantic action; and `p`, `p1`, `p2`, ... are parsers
|
|
that generate attributes. Note that only `>>` is used here. `>` has the
|
|
exact same attribute generation rules.
|
|
|
|
[table Sequence and Alternative Combining Operations and Their Attributes
|
|
[[Expression] [Attribute Type]]
|
|
|
|
[[`_e_ >> _e_`] [None.]]
|
|
[[`p >> _e_`] [`_ATTR_np_(p)`]]
|
|
[[`_e_ >> p`] [`_ATTR_np_(p)`]]
|
|
|
|
[[`_cu_ >> _str_np_("str")`] [`std::string`]]
|
|
[[_str_np_`("str") >> `_cu_] [`std::string`]]
|
|
[[`*_cu_ >> _str_np_("str")`] [`hana::tuple<std::vector<char>, std::string>`]]
|
|
[[`_str_np_("str") >> *_cu_`] [`hana::tuple<std::string, std::vector<char>>`]]
|
|
|
|
[[`p >> p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`*p >> p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`p >> *p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`*p >> -p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
[[`-p >> *p`] [`std::vector<_ATTR_np_(p)>`]]
|
|
|
|
[[`_str_np_("str") >> _cu_`] [`std::string`]]
|
|
[[`_cu_ >> _str_np_("str")`] [`std::string`]]
|
|
[[`_str_np_("str") >> -_cu_`] [`std::string`]]
|
|
[[`-_cu_ >> _str_np_("str")`] [`std::string`]]
|
|
|
|
[[`!p1 | p2[a]`] [None.]]
|
|
[[`p | p`] [`_ATTR_np_(p)`]]
|
|
[[`p1 | p2`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>`]]
|
|
[[`p | `_e_] [`std::optional<_ATTR_np_(p)>`]]
|
|
[[`p1 | p2 | _e_`] [`std::optional<std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>>`]]
|
|
[[`p1 | p2[a] | p3`] [`std::optional<std::variant<_ATTR_np_(p1), _ATTR_np_(p3)>>`]]
|
|
]
|
|
|
|
|
|
[heading Directives that affect attribute generation]
|
|
|
|
`_omit_np_[p]` disables attribute generation for the parser `p`.
|
|
`_raw_np_[p]` changes the attribute from `_ATTR_np_(p)` to a view that
|
|
indicates the subrange of the input that was matched by `p`. See _directives_
|
|
for details.
|
|
|
|
[endsect]
|
|
|
|
[section The `parse()` API]
|
|
|
|
There are multiple overloads of _p_. These overloads have some things in
|
|
common:
|
|
|
|
* They each return a value contextually convertible to `bool`.
|
|
|
|
* They each take at least a range to parse and a parser. The "range to parse"
|
|
may be an iterator/sentinel pair or an single range-like object.
|
|
|
|
* They each require forward iterability of the input.
|
|
|
|
* They each accept any input range with an integral element type. This means
|
|
that they can each parse ranges of `char`, `char8_t`, `uint16_t`, `int`,
|
|
etc.
|
|
|
|
* When you call any of the iterator/sentinel pair overloads of _p_, for
|
|
example `_p_np_(first, last, p, _ws_)`, it parses the range `[first, last)`,
|
|
advancing `first` as it goes. If the parse succeeds, the entire input may or
|
|
may not have been matched. The value of `first` will indicate the last
|
|
location wthin the input that `p` matched. The *whole* input was matched if
|
|
and only if `first == last`.
|
|
|
|
* When you call any of the range-like overloads of _p_, for example `_p_np_(r,
|
|
p, _ws_)`, _p_ only indicates success if *all* of `r` was matched by `p`.
|
|
|
|
[heading The overloads]
|
|
|
|
There are eight overloads of _p_, because there are three either/or options in
|
|
how you call it.
|
|
|
|
[heading Iterator/sentinel versus range-like]
|
|
|
|
You can call _p_ with an iterator and sentinel that delimit a range of
|
|
integral values. For example:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = /* some parser ... */;
|
|
|
|
char const * str_1 = /* ... */;
|
|
// Using null_sentinel, str_1 can point to three billion characters, and
|
|
// we can call parse() without having to find the end of the string first.
|
|
auto result_1 = bp::parse(str_1, boost::text::null_sentinel, p, bp::ws);
|
|
|
|
char str_2[] = /* ... */;
|
|
auto result_2 = bp::parse(std::begin(str_2), std::end(str_2), p, bp::ws);
|
|
|
|
The iterator/sentinel overloads can parse successfully without matching the
|
|
entire input. You can tell if the entire input was matched by checking if
|
|
`first == last` is true after _p_ returns.
|
|
|
|
You can also call _p_ with a range of integral values. When the range is a
|
|
reference to an array of characters, any terminating `0` is ignored; this
|
|
allows calls like `_p_np_("str", p)` to work naturally.
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = /* some parser ... */;
|
|
|
|
std::u8string str_1 = "str";
|
|
auto result_1 = bp::parse(str_1, p, bp::ws);
|
|
|
|
// The null terminator is ignored. This call parses s-t-r, not s-t-r-0.
|
|
auto result_2 = bp::parse(U"str", p, bp::ws);
|
|
|
|
char const * str_3 = "str";
|
|
auto result_3 = bp::parse(boost::text::as_utf16(str_3), p, bp::ws);
|
|
|
|
You can also call _p_ with a pointer to a null-termianted string of integral
|
|
values. _p_ considers pointers to null-terminated strings to be ranges,
|
|
since, for any pointer `T *` to a null-terminated string, `T *` is isomorphic
|
|
with `view<T *, boost::text::null_sentinel>`.
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = /* some parser ... */;
|
|
|
|
char const * str_1 = /* ... */ ;
|
|
auto result_1 = bp::parse(str_1, p, bp::ws);
|
|
char8_t const * str_2 = /* ... */ ;
|
|
auto result_2 = bp::parse(str_2, p, bp::ws);
|
|
char16_t const * str_3 = /* ... */ ;
|
|
auto result_3 = bp::parse(str_3, p, bp::ws);
|
|
char32_t const * str_4 = /* ... */ ;
|
|
auto result_4 = bp::parse(str_4, p, bp::ws);
|
|
|
|
int const array[] = { 's', 't', 'r', 0 };
|
|
int const * array_ptr = array;
|
|
auto result_5 = bp::parse(array_ptr, p, bp::ws);
|
|
|
|
Since there is no way to indicate that `p` matches the input, but only a
|
|
prefix of the input was matched, the range-like (non-iterator/sentinel)
|
|
overloads of _p_ indicate failure if the entire input is not matched.
|
|
|
|
[heading With or without an attribute out-parameter]
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = '"' >> *(bp::char_ = '"') >> '"';
|
|
char const * str = "\"two words\"" ;
|
|
|
|
std::string result_1;
|
|
bool const success = bp::parse(str, p, result_1); // success is true; result_1 is "two words"
|
|
auto result_2 = bp::parse(str, p); // !!result_2 is true; *result_2 is "two words"
|
|
|
|
When you call _p_ *with* an attribute out-parameter and parser `p`, the
|
|
expected type is *something like* `_ATTR_np_(p)`. It doesn't have to be
|
|
exactly that; I'll explain in a bit. The return type is `bool`.
|
|
|
|
When you call _p_ *without* an attribute out-parameter and parser `p`, the
|
|
return type is `std::optional<_ATTR_np_(p)>`. Note that when `_ATTR_np_(p)`
|
|
is itself an `optional`, the return type is
|
|
`std::optional<std::optional<...>>`. Each of those optionals tells you
|
|
something different. The outer one tells you whether the parse succeeded. If
|
|
so, the parser was successful, but it still generates an attribute that is an
|
|
`optional` _emdash_ that's the inner one.
|
|
|
|
[heading With or without a skipper]
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = '"' >> *(bp::char_ = '"') >> '"';
|
|
char const * str = "\"two words\"" ;
|
|
|
|
auto result_1 = bp::parse(str, p); // !!result_1 is true; *result_1 is "two words"
|
|
auto result_2 = bp::parse(str, p, bp::ws); // !!result_2 is true; *result_2 is "twowords"
|
|
|
|
[heading Compatability of attribute out-parameters]
|
|
|
|
For any call to _p_ that takes an attribute out-parameter, like `_p_np_("str",
|
|
p, bp::ws, out)`, the call is well-formed for a number of possible types of
|
|
`out`; `decltype(out)` does not need to be exactly `_ATTR_np_(p)`.
|
|
|
|
For instance, this is valid code that does not abort (remember that the
|
|
attribute type of _str_ is `std::string`):
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = bp::string("foo");
|
|
|
|
std::vector<char> result;
|
|
bool const success = bp::parse("foo", p, result);
|
|
assert(success && result == std::vector<char>({'f', 'o', 'o'}));
|
|
|
|
Even though `p` generates a `std::string` attribute, when it actually takes
|
|
the data it generates and writes it into an attribute, it only assumes that
|
|
the attribute is a `container` (see _concepts_), not that it is some
|
|
particular container type. It will happily `insert()` into a `std::string` or
|
|
a `std::vector<char>` all the same. `std::string` and `std::vector<char>` are
|
|
both containers of `char`, but it will also insert into a container with a
|
|
different element type. `p` just needs to be able to insert the elements it
|
|
produces into the attribute-container. As long as an implicit conversion
|
|
allows that to work, everything is fine:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = bp::string("foo");
|
|
|
|
std::vector<int> result;
|
|
bool const success = bp::parse("foo", p, result);
|
|
assert(success && result == std::vector<int>({'f', 'o', 'o'}));
|
|
|
|
This works, too, even though it requires inserting elements from a generated
|
|
sequence of `uint32_t` into a container of `char` (remember that the attribute
|
|
type of `+_cp_` is `std::vector<uint32_t>`):
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = +bp::cp;
|
|
|
|
std::string result;
|
|
bool const success = bp::parse("foo", p, result);
|
|
assert(success && result == "foo");
|
|
|
|
This next example works as well, even though the change to a container is not
|
|
at the top level. It is an element of the result tuple:
|
|
|
|
namespace bp = boost::parser;
|
|
// p matches one or more non-spaces, followed by a single space, followed by one or more repetitions of "foo".
|
|
auto const p = +(bp::cp - ' ') >> ' ' >> +string("foo");
|
|
|
|
// attr_type is the attribute type generated by p.
|
|
using attr_type = decltype(bp::parse(u8"", p));
|
|
static_assert(
|
|
std::is_same_v<
|
|
attr_type,
|
|
std::optional<
|
|
boost::hana::tuple<std::vector<uint32_t>, std::string>>>);
|
|
|
|
// This is similar to attr_type, with the std::vector<uint32_t> changed to a std::string.
|
|
boost::hana::tuple<std::string, std::string> result;
|
|
bool const success = bp::parse(u8"rôle foofoo", p, result);
|
|
using namespace boost::hana::literals;
|
|
|
|
assert(success); // p matches.
|
|
assert(result[0_c].size() == 5u); // The 4 code points "rôle" get transcoded to 5 UTF-8 code points to fit in the std::string.
|
|
assert(result[0_c] == (char const *)u8"rôle");
|
|
assert(result[1_c] == "foofoo");
|
|
|
|
As indicated in the inline comments, there are a couple of things to take away
|
|
from this example:
|
|
|
|
* If you change a container (such as `std::string` to `std::vector<int>`, or
|
|
`std::vector<uint32_t>` to `std::deque<int>`), the call to _p_ will often
|
|
still be well-formed
|
|
|
|
* When changing out a container type, if both containers contain integral
|
|
values, and the removed container's element type is 4 bytes in size, and the
|
|
new container's element type is 1 byte in size, _Parser_ assumes that this
|
|
is a UTF-32-to-UTF-8 conversion, and silently transcodes the data when
|
|
inserting into the new container.
|
|
|
|
[caution The detection of the need tp transcode from UTF-32 to UTF-8 applies to *all* integral values. If you call _p_ with this parser:
|
|
|
|
auto const p = +boost::parser::uint_;
|
|
|
|
using a `std::string` as an out-parameter, it will happily transcode your
|
|
unsigned ints to UTF-8. This is almost certainly not what you want. Don't
|
|
worry, though; this kind of case comes up pretty rarely, but wanting to parse
|
|
in Unicode mode and catch results in UTF-8 strings comes up all the time.]
|
|
|
|
Let's look at a case where another simple-seeming type replacement does *not* work:
|
|
|
|
namespace bp = boost::parser;
|
|
auto const p = +(bp::int_ >> +bp::cp);
|
|
|
|
using attr_type = decltype(bp::parse(u8"", p));
|
|
static_assert(std::is_same_v<
|
|
attr_type,
|
|
std::optional<std::vector<
|
|
boost::hana::tuple<int, std::vector<uint32_t>>>>>);
|
|
|
|
std::vector<boost::hana::tuple<int, std::string>> result;
|
|
#if 0
|
|
bool const success = bp::parse(u8"42 rôle", p, bp::ws, result); // ill-formed!
|
|
#endif
|
|
|
|
In this case, removing a `std::vector<uint32_t>` and putting a `std::string`
|
|
in its place makes the code ill-formed, even though we saw a similar
|
|
replacement earlier. The reason this one does not work is that the replaced
|
|
container is part of the element type of yet another container. At some point
|
|
in the code, `p` would try to insert a `boost::hana::tuple<int,
|
|
std::vector<uint32_t>>` _emdash_ the element type of the attribute type it
|
|
normally generates _emdash_ into a vector of `boost::hana::tuple<int,
|
|
std::string>`s. There's no implicit conversion there, so teh code is
|
|
ill-formed.
|
|
|
|
The take-away for this last example is that the ability to arbitrarily swap
|
|
out data types within the type of the attribute you pass to _p_ is very
|
|
flexible, but is also limited to structurally simple cases. When we discuss
|
|
rules in the next section, we'll see how this flexibility in the types of
|
|
attributes can help when writing complicated parsers.
|
|
|
|
[note Those were all examples of swapping out one container type for another.
|
|
They make good examples because that is more likely to be surprising, and so
|
|
it's getting lots of coverage here. You can also do much simpler things like
|
|
parse using a _ui_, and writing its attribute into a `double`. In general,
|
|
you can swap any type `T` out of the attribute, as long as `T` is not part of
|
|
the element type for some container within the attribute. ]
|
|
|
|
[heading Unicode versus non-Unicode parsing]
|
|
|
|
A call to _p_ either considers the entire input to be in a UTF format (UTF-8,
|
|
UTF-16, or UTF-32), or it considers the entire input to be in some unknown
|
|
encoding. Here is how it deduces which case the call falls under:
|
|
|
|
* If the input range is a sequence of `char8_t`, or if the input is a
|
|
`boost::text::utf8_view`, the input is UTF-8.
|
|
|
|
* Otherwise, if the input is a sequence of 1-byte integral values, the input
|
|
is in an unknown encoding.
|
|
|
|
* Otherwise, the input is in a UTF encoding.
|
|
|
|
[tip if you want to want to parse in ASCII-only mode, or in some unkown
|
|
enciding, using only sequences of `char`, like `std::string` or `char const
|
|
*`.]
|
|
|
|
[tip If you want to ensure all input is parsed as Unicode, pass the input
|
|
range `r` as `boost::text::as_utf32(r)` _emdash_ that's the first thing that
|
|
happens to it inside _p_ in the Unicode parsing path anyway.]
|
|
|
|
[note Since passing `boost::text::utf8_view` is a special case, and since a
|
|
sequence of `char` is otherwise considered an unknown encoding,
|
|
`boost::parse::parse(boost::text::as_utf8(r), p)` treats `r` as UTF-8, whereas
|
|
`boost::parse::parse(r.begin(), r.end(), p)` does not.]
|
|
|
|
[heading The `trace_mode` parameter to _p_]
|
|
|
|
Debugging parsers is notoriously difficult once they reach a certain size. To
|
|
get a verbose trace of your parse, pass `boost::parse::trace::on` as the final
|
|
parameter to _p_. It will show you the current parser being matched, the
|
|
front of the input, and any attributes generated. If an attribute appears
|
|
which it cannot print using stream insertion, it prints
|
|
`"<<unprintable-value>>"`.
|
|
|
|
TODO: `with_globabls()`, `with_error_handler()`
|
|
|
|
[endsect]
|
|
|
|
[section Rules]
|
|
|
|
TODO
|
|
|
|
Getting at one of a rule's arguments and passing it as an argument to another
|
|
parser can be very verbose. __p_ is a variable template that allows you to
|
|
refer to the `n`th argument to the current rule, so that you can, in turn,
|
|
pass it to on of the rule's subparsers:
|
|
|
|
auto const indent_n_def = boost::parser::repeat(boost::parser::_p<0>)[' '_l];
|
|
|
|
Using __p_ can prevent you from having to write a bunch of lambdas that get
|
|
each get an argument out of the parse context using `_params_np_(ctx)[0_c]` or
|
|
similar.
|
|
|
|
[endsect]
|
|
|
|
[section Unicode Support]
|
|
|
|
TODO
|
|
|
|
TODO: Unicode in symbol tables
|
|
|
|
[endsect]
|
|
|
|
[section Callback Parsing]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Best Practices]
|
|
|
|
TODO: Parse Unicode from the start.
|
|
|
|
TODO: Write rules, and test them in isolation.
|
|
|
|
TODO: Compile separately when you know the type of your input will not change.
|
|
|
|
[endsect]
|
|
|
|
[section Writing Your Own Parser]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[endsect]
|