mirror of
https://github.com/boostorg/parser.git
synced 2026-01-27 07:02:12 +00:00
697 lines
28 KiB
Plaintext
697 lines
28 KiB
Plaintext
[/
|
|
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
/]
|
|
|
|
[section Tutorial]
|
|
|
|
[section Terminology]
|
|
|
|
First, let's cover some terminology that we'll be using throughout the docs:
|
|
|
|
A /semantic action/ is an arbitrary bit of logic associated with a parser,
|
|
that is only executed when the parser succeeds.
|
|
|
|
Simpler parsers can be combined to form more complex parsers. Given some
|
|
combining operation `C`, and parsers `P0`, `P1`, ... `PN`, `C(P0, P1, ... PN)`
|
|
creates a new parser `Q`. This creates a /parse tree/. `Q` is the parent of
|
|
`P1`, `P2` is the child of `Q`, etc. The parsers are applied in the top-down
|
|
fashion implied by this. When you use `Q` to parse a string, it will use
|
|
`P0`, `P1`, etc. to do the actual work. If `P3` is being used to parse the
|
|
input, that means that `Q` is as well, since the way `Q` parses is by
|
|
dispatching to its children to do some or all of the work. At any point in
|
|
the parse, there will be exactly one parser without children that is being
|
|
used to parse the input; all other parsers being used are its ancestors in the
|
|
parse tree.
|
|
|
|
A /subparser/ is a parser that is the child of another parser.
|
|
|
|
The /top-level parser/ is the root of the tree of parsers.
|
|
|
|
The /current parser/ or /innermost parser/ is the parser with no children that
|
|
is currently being used to parse the input.
|
|
|
|
A /rule/ is a kind of parser that makes building large, complex parsers
|
|
easier. A /subrule/ is a rule that is the child of some other rule. The
|
|
/current rule/ or /innermost rule/ is the one rule currently being used to
|
|
parse the input that has no subrules. Note that while there is always exactly
|
|
one current parser, there may or may not be a current rule _emdash_ rules are
|
|
one kind of parser, and you may or may not be using them in your top-level
|
|
parser.
|
|
|
|
The /top-level parse/ is the parse operation being performed by the top-level
|
|
parser. This term is necessary, because though most parse failures are local
|
|
to a particular parser, some parse failures cause the call to _p_ to indicate
|
|
failure of the entire parse. For these cases, we say that such a local
|
|
failure "causes the top-level parse to fail".
|
|
|
|
Next, we'll look at some simple programs that parse using _Parser_. We'll
|
|
start small and build up from there.
|
|
|
|
[endsect]
|
|
|
|
[section Hello, Whomever]
|
|
|
|
This is just about the most minimal example of using _Parser_ that one could
|
|
write. We take a string from the command line, or `"World"` if none is given,
|
|
and then we parse it:
|
|
|
|
[hello_example]
|
|
|
|
The expression `*bp::char_` is a parser-expression. It uses one of the many
|
|
parsers that _Parser_ provides, _ch_. Like all _Parser_ parsers, it has
|
|
certain operations defined on it. In this case, `*bp::char_` is using an
|
|
overloaded `operator*()` as the C++ version of a _kl_ operator. Since C++ has
|
|
no postfix unary `*` operator, we have to use the one we have, so it is used
|
|
as a prefix.
|
|
|
|
So, `*bp::char_` means "any number of characters". In other words, it really
|
|
cannot fail. Even an empty string will match it.
|
|
|
|
The parse operation is performed by calling the _p_ function, passing the
|
|
parser as one of the arguments:
|
|
|
|
bp::parse(input, *bp::char_, result);
|
|
|
|
The arguments here are: `input`, the string to parse; `*bp::char_`, the parser
|
|
used to do the parse; and `result`, and out-parameter into which to put the
|
|
result of the parse. Don't get too caught up on this method of getting the
|
|
parse result out of _p_; there are multiple ways of doing so, and we'll cover
|
|
all of them in subsequent examples.
|
|
|
|
Also, just ignore for now the fact that _Parser_ somehow figured out that the
|
|
result type of the `*bp::char_` parser is a `std::string`. There are clear
|
|
rules for this that we'll cover later.
|
|
|
|
The effects of this call to _p_ is not very interesting _emdash_ since the
|
|
parser we gave it cannot ever fail, and because we're placing the output in
|
|
the same type as the input, it just copies the contents of `input` to
|
|
`result`.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example]
|
|
|
|
Let's look at a slightly more complicated example, even if it is still
|
|
trivial. Instead of taking any old `char`s we're given, let's require some
|
|
structure. Let's parse one or more `double`s, separated by commas.
|
|
|
|
The _Parser_ parser for `double` is _d_. So, to parse a single `double`, we'd
|
|
use _d_. If we wanted to parse two `double`s in a row, we'd use:
|
|
|
|
boost::parser::double_ >> boost::parser::double_
|
|
|
|
`operator>>()` in this expression is the sequence-operator; read is as
|
|
"followed by". If we combine the sequence-operator with _kl_, we can get the
|
|
parser we want by writing:
|
|
|
|
boost::parser::double_ >> *(',' >> boost::parser::double_)
|
|
|
|
This is a parser that matches at least one `double` _emdash_ because of the
|
|
first _d_ in the expression above _emdash_ followed by zero or more instances
|
|
of a-comma-followed-by-a-`double`. Notice that we can use `','` directly.
|
|
Though it is not a parser, `operator>>()` and the other operators defined on
|
|
_Parser_ parsers have overloads that accept character/parser pairs of
|
|
arguments; these operator overloads will create the right parser to recognize
|
|
`','`.
|
|
|
|
[trivial_example]
|
|
|
|
The first example filled in an out-parameter to deliver the result of the
|
|
parse. This call to _p_ returns a result instead. As you can see, the result
|
|
is contextually convertible to `bool`, and `*result` is some sort of range.
|
|
In fact, the return type of this call to _p_ is
|
|
`std::optional<std::vector<double>>`. Naturally, if the parse fails,
|
|
`std::nullopt` is returned. We'll look at how _Parser_ maps the type of the
|
|
parser to the return type, or the filled in out-parameter's type, a bit later.
|
|
|
|
If I run it in a shell, this is the result:
|
|
|
|
[pre
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6,8.9
|
|
Great! It looks like you entered:
|
|
5.6
|
|
8.9
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6, 8.9
|
|
Good job! Please proceed to the recovery annex for cake.
|
|
]
|
|
|
|
It does not recognize `"5.6, 8.9"`. This is because it expects a comma
|
|
followed /immediately/ by a `double`, but I inserted a space after the comma.
|
|
The same failure to parse would occur if I put a space before the comma, or
|
|
before or after the list of `double`s.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example That Gracefully Handles Whitespace]
|
|
|
|
Let's modify the trivial parser we just saw to ignore any spaces that might
|
|
exist among the `double`s and commas. To skip whitespace wherever we find it,
|
|
we can pass a /skip parser/ to our call to _p_ (we don't need to touch the
|
|
parser passed to _p_). Here, we use `ascii::space`, which matches any ASCII
|
|
character `c` for which `std::isspace(c)` is true.
|
|
|
|
[trivial_skipper_example]
|
|
|
|
The skip parser, or /skipper/, is run between the subparsers within the
|
|
parser passed to _p_. In this case, the skipper is run before the first
|
|
`double` is parsed, before any subsequent comma or `double` is parsed, and at
|
|
the end. So, the strings `"3.6,5.9"` and `" 3.6 , \t 5.9 "` are parsed the
|
|
same by this program.
|
|
|
|
Skipping is an important concept in _Parser_. You can skip anything, not just
|
|
ASCII whitespace; there are lots of other things you might want to skip. The
|
|
skipper you pass to _p_ can be an arbitrary parser. For example, if you write
|
|
a parser for a scripting language, you can write a skipper to skip whitespace,
|
|
inline comments, and end-of-line comments.
|
|
|
|
We'll be using skip parsers almost exclusively in the rest of the
|
|
documentation. The ability to ignore the parts of your input that you don't
|
|
care about is so convenient that parsing without skipping is a rarity in
|
|
practice.
|
|
|
|
[endsect]
|
|
|
|
[section Semantic Actions]
|
|
|
|
Like all parsing systems (lex & yacc, _Spirit_, etc.), _Parser_ has a
|
|
mechanism for associating semantic actions with different parts of the parse.
|
|
Here is nearly the same program as we saw in the previous example, except that
|
|
it is implemented in terms of a semantic action that appends each parsed
|
|
`double` to a result, instead of automatically building and returning the
|
|
result:
|
|
|
|
[semantic_action_example]
|
|
|
|
Run in a shell, it looks like this:
|
|
|
|
[pre
|
|
$ example/semantic_actions
|
|
Enter a list of doubles, separated by commas. 4,3
|
|
Got one!
|
|
Got one!
|
|
You entered:
|
|
4
|
|
3
|
|
]
|
|
|
|
In _Parser_, semantic actions are implemented in terms of invocable objects
|
|
that take a single parameter to a parse-context object. In the example we
|
|
used this lambda as our invocable:
|
|
|
|
[semantic_action_example_lambda]
|
|
|
|
We're both printing a message to `std::cout` and recording a parsed result in
|
|
the lambda. It could do both, either, or neither of these things if you like.
|
|
The way we get the parsed `double` in the lambda is by asking the parse
|
|
context for it. `_attr(ctx)` is how you ask the parse context for the
|
|
attribute produced by the parser to which the semantic action is attached.
|
|
There are lots of functions like `_attr()` that can be used to access the
|
|
state in the parse context. We'll cover more of them later on.
|
|
|
|
[endsect]
|
|
|
|
[section The Parse Context]
|
|
|
|
Now would be a good time to describe the parse context in some detail. Any
|
|
semantic action that you write will need to use the state in the parse
|
|
context, so you need to know what's available.
|
|
|
|
The parse context is a `hana::map` from tag types to elements. Elements are
|
|
added to or removevd from it at different times during the parse. For
|
|
instance, when a parser with a semantic action succeeds, it adds the attribute
|
|
it produces to the parse context, then calls the invocable semantic action.
|
|
This is efficient to do, because the `hana::map` remains fairly small, usually
|
|
around ten elements, and each element is stored as a pointer. Copying the
|
|
entire map when mutating the context is therefore fast.
|
|
|
|
[note All these functions that take the parse context as their first parameter
|
|
will find by found by Argument-Dependent Lookup. You will probably never need
|
|
to qualify them with `boost::parser::`.]
|
|
|
|
[heading Accessors for data that are always available]
|
|
|
|
_pass_ is a `bool` indicating the success of failure of the current parse.
|
|
This can be used to force the current parse to pass or fail:
|
|
|
|
[](auto & ctx) {
|
|
// If the attribute meets this predicate, fail the parse.
|
|
if (some_condition(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
Note that for a semantic action to be executed, its associated parser must
|
|
already have succeeded. So unless you previously wrote `_pass(ctx) = false`
|
|
somewhere, `_pass(ctx) = true` does nothing; it's redundant.
|
|
|
|
_begin_ and _end_ return the beginning and end of the range that you passed to
|
|
_p_, respectively. _where_ returns a _v_ indicating the bounds of the input
|
|
matched by the current parse. _where_ can be useful if you just want to parse
|
|
some text and return a result consisting of where certain elements are
|
|
located, without producing any other attributes.
|
|
|
|
_error_handler_ returns a reference to the error handler associated with the
|
|
parser passed to _p_. Any error handler must have the following member
|
|
functions:
|
|
|
|
[error_handler_api_1]
|
|
|
|
[error_handler_api_2]
|
|
|
|
If you call the second one, the one without the iterator parameter, it will
|
|
call the first with `_where(context).begin()` as the iterator parameter. The
|
|
one without the iterator is the one you will use most often. The one with the
|
|
explicit iterator parameter can be useful in situations where you have
|
|
messages that are related to each other, associated with multiple locations.
|
|
For instance, if you are parsing XML, you may want to report that a close-tag
|
|
does not match its associated open-tag by showing the line where the open-tag
|
|
was found. That may of course not be located anywhere near
|
|
`_where(ctx).begin()`. (A description of _globals_ is below.)
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
open_tag_msg,
|
|
ctx,
|
|
_globals(ctx).open_tags_position.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
close_tag_msg,
|
|
ctx);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
There are also some convenience functions that make the above code a little
|
|
less verbose (_report_error_ and _report_warning_):
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_report_error(ctx, open_tag_msg, _globals(ctx).open_tag_positions.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_report_error(ctx, close_tag_msg);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
You should use these less verbose functions almost all the time. The only
|
|
time you would want to use _error_handler_ is when you are using a custom
|
|
error handler, and you want access to some part of it's interface besides
|
|
`diagnose()`.
|
|
|
|
[heading Accessors for data that are only sometimes available]
|
|
|
|
_attr_ is the value of the current parser's attribute. It is available only
|
|
when the current parser's parse is successful. If the parser has no semantic
|
|
action, no attribute gets added to the parse context. It can be used to read
|
|
and write the current parser's attribute:
|
|
|
|
[](auto & ctx) { _attr(ctx) = 3; }
|
|
|
|
If the current parser has no attribute, a _n_ is returned.
|
|
|
|
_val_ is the value of the attribute of the current rules being used to parse
|
|
(if any), and is available even before the rule's parse is successful. It can
|
|
be used to set the current rule's attribute, even from a parser that is a
|
|
subparser inside the rule. Let's say we're writing a parser with a semantic
|
|
action that is within a rule. If we want to set the current rule's value to
|
|
whatever this subparser parses, we would write this semantic action:
|
|
|
|
[](auto & ctx) { _val(ctx) = _attr(ctx); }
|
|
|
|
If there is no current rule, or the current rule has no attribute, a _n_ is
|
|
returned.
|
|
|
|
_globals_ returns a reference to a user-supplied struct that contains whatever
|
|
data you want to use during the parse. We'll get into this more later, but
|
|
for now, here's how you might use it:
|
|
|
|
[](auto & ctx) {
|
|
// black_list is some set of proscribed values that are not allowed.
|
|
if (_globals(ctx).black_list.contains(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
_locals_ returns a reference to one or more values that are local to the
|
|
current rule being parsed, if any. If there are two or more local values,
|
|
_locals_ returns a reference to a `hana::tuple`. Rules are something we
|
|
haven't gotten to yet, but here is how you use _locals_:
|
|
|
|
[](auto & ctx) {
|
|
auto & local = _locals(ctx);
|
|
// Use local here. If it is a hana::tuple, access its members like this:
|
|
using namespace hana::literals;
|
|
auto & first_element = local[0_c];
|
|
auto & second_element = local[1_c];
|
|
}
|
|
|
|
If there is no current rule, or the current rule has no locals, a _n_ is
|
|
returned.
|
|
|
|
_params_, like _locals_, applies to the current rule being used to parse, if
|
|
any. It also returns a reference to a single value, if the current rule has
|
|
only one parameter, or a `hana::tuple` to multiple values if the current rule
|
|
has multiple parameters.
|
|
|
|
If there is no current rule, or the current rule has no parameters, a _n_ is
|
|
returned.
|
|
|
|
[note _n_ is a type that is used as a return value in _Parser_ for parse
|
|
context accessors. _n_ is convertible to anything that has a default
|
|
constructor, convertible from anything, assignable form anything, and has
|
|
templated overloads for all the overloadable operators. The intention is that
|
|
a misuse of _val_, _globals_, etc. should compile, and produce an assertion at
|
|
runtime. Experience has shown that using a debugger for investigating the
|
|
stack that leads to your mistake is a far better user experience than sifting
|
|
through compiler diagnostics. See the rationale section for a more detailed
|
|
explanation.]
|
|
|
|
[endsect]
|
|
|
|
[section Symbol Tables]
|
|
|
|
When writing a parser, it often comes up that there is a set of strings that,
|
|
when parsed, are associated with a set of values 1-to-1. It is tedious to
|
|
write parsers that recognize all the possible input strings when you have to
|
|
associate each one with an attribute via a semantic action. Instead, we can
|
|
use a symbol table.
|
|
|
|
Say we want to parse Roman numerals, one of the most common work-related
|
|
parsing problems. We want to recognize numbers that start with any number of
|
|
"M"s, representing thousands, followed by the hundreds, the tens, and the
|
|
ones. Any of these may be absent from the input, but not all. Here are three
|
|
symbol _Parser_ tables that we can use to recognize ones, tens, and hundreds
|
|
values, respectively:
|
|
|
|
[roman_numeral_symbol_tables]
|
|
|
|
A _symbols_ maps strings of `char` to their associated attributes. The type
|
|
of the attribute must be specified as a template parameter to _symbols_
|
|
_emdash_ `int` in this case.
|
|
|
|
Any "M"s we encounter should add 1000 to the result, and all other values come
|
|
from the symbol tables. Here are the semantic actions we'll need to do that:
|
|
|
|
[roman_numeral_actions]
|
|
|
|
`add_1000` just adds `1000` to `result`. `add` adds whatever attribute is
|
|
produced by its parser to `result`.
|
|
|
|
Now we just need to put the pieces together to make a parser:
|
|
|
|
[roman_numeral_parser]
|
|
|
|
We've got a few new bits in play here, so let's break it down. `'M'_l` is a
|
|
/literal parser/. That is, it is a parser that parses a literal `char`, code
|
|
point, or string. In this case, a `char` "M" is being parsed. The `_l` bit
|
|
at the end is a _udl_ suffix that you can put after any `char`, `char32_t`, or
|
|
`char const *` to form a literal parser. You can also make a literal parser
|
|
by writing _lit_ for some `x` of one of the previously mentioned types.
|
|
|
|
Why do we need any of this, considering that we just used a literal `','` in
|
|
our previous example? The reason is that `'M'` is not used in an expression
|
|
with another _Parser_ parser. It is used within `*'M'_l[add_1000]`. If we'd
|
|
written `*'M'[add_1000]`, clearly that would be ill-formed; `char` has no
|
|
`operator*()`, nor an `operator[]()`, associated with it.
|
|
|
|
[tip Any time you want to use a `char`, `char32_t`, or string literal in a
|
|
_Parser_ parser, write it as-is if it is combined with a preexisting _Parser_
|
|
subparser `p`, as in `'x' >> p`. Otherwise, you need to wrap it in a call to
|
|
_lit_, or use the `_l` _udl_ suffix.]
|
|
|
|
On to the next bit: `-hundreds[add]`. By now, the use of the index operator
|
|
should be pretty familiar; it associates the semantic action `add` with the
|
|
parser `hundreds`. The `operator-()` at the beginning is new. It means that
|
|
the parser it is applied to is optional. You can read it as "zero or one".
|
|
So, if `hundreds` is not successfully parsed after `*'M'[add_1000]`, nothing
|
|
happens, because `hundreds` is allowed to be missing _emdash_ it's optional.
|
|
If `hundreds` is parsed successfully, say by matching `"CC"`, the resulting
|
|
attribute, `200`, is added to `result` inside `add`.
|
|
|
|
Here is the full listing of the program. Notice that it would have been
|
|
inappropriate to use a whitespace skipper here, since the entire parse is a
|
|
single number, so it was removed.
|
|
|
|
[roman_numeral_example]
|
|
|
|
[endsect]
|
|
|
|
[section Mutable Symbol Tables]
|
|
|
|
The previous example showed how to use a symbol table as a fixed lookup table.
|
|
What if we want to add things to the table during the parse? We can do that,
|
|
but we need to do so within a semantic action. First, here is our symbol
|
|
table, already with a single value in it:
|
|
|
|
[self_filling_symbol_table_table]
|
|
|
|
No surprise that it works to use the symbol table as a parser to parse the one
|
|
string in the symbol table. Now, here's our parser:
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
Here, we've attached the semantic action not to a simple parser like _d_, but
|
|
to the sequence parser `(bp::char_ >> bp::int_)`. This sequence parser
|
|
contains two parsers, each with its own attribute, so it produces two
|
|
attributes as a `hana::tuple`.
|
|
|
|
[self_filling_symbol_table_action]
|
|
|
|
Inside the semantic action, we can get the first element of the attribute
|
|
tuple using _udls_ provided by Boost.Hana, and `hana::tuple::operator[]()`.
|
|
The first attribute, from the _ch_, is `_attr(ctx)[0_c]`, and the second, from
|
|
the _i_, is `_attr(ctx)[1_c]`. To add the symbol to the symbol table, we call
|
|
`insert()`.
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
During the parse, `("X", 9)` is parsed and added to the symbol table. Then,
|
|
the second `'X'` is recognized by the symbol table parser. However:
|
|
|
|
[self_filling_symbol_table_after_parse]
|
|
|
|
If we parse again, we find that `"X"` did not stay in the symbol table. The
|
|
fact that `symbols` was declared const might have given you a hint that this
|
|
would happen. Also, notice that the call to `insert()` in the semantic action
|
|
uses the parse context; that's where all the symbol table changes are stored
|
|
during the parse.
|
|
|
|
The full program:
|
|
|
|
[self_filling_symbol_table_example]
|
|
|
|
[note It is possible to add symbols to a _symbols_ permanently. To do so, you
|
|
have to use a mutable _symbols_ object `s`, and add the symbols by calling
|
|
`s.add()`, instead of `s.insert()`.]
|
|
|
|
[endsect]
|
|
|
|
[section Alternative Parsers]
|
|
|
|
Frequently, you need to parse something that might have one of several forms.
|
|
`operator|()` is overloaded to form alternative parsers. For example:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_1 = bp::int_ | bp::eps;
|
|
|
|
`parser_1` matches an integer, or if that fails, it matches /epsilon/, the
|
|
empty string. This is equivalent to writing:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_2 = -bp::int_;
|
|
|
|
However, neither `parser_1` nor `parser_2` is equivalent to writing this:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_3 = bp::eps | bp::int_;
|
|
|
|
The reason is that alternative parsers try each of their subparsers, one at a
|
|
time, and stop on the first one that matches. /Epsilon/ matches anything,
|
|
since it is zero length and consumes no input. It even matches the end of
|
|
input. This means that `parser_3` is equivalent to _e_ by itself.
|
|
|
|
[endsect]
|
|
|
|
[section The Parsers And Their Uses]
|
|
|
|
TODO
|
|
|
|
TODO: Cover the various collapsing rules with op*, op+, etc.
|
|
|
|
[endsect]
|
|
|
|
[section Combining Operations]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Directives]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Attribute Generation]
|
|
|
|
So far, we've seen several different types of attributes that come from
|
|
different parsers, `int` for _i_, `hana::tuple<char, int>` for
|
|
`boost::parser::char_ >> boost::parser::int_`, etc. Let's get into how this
|
|
works with a bit more rigor.
|
|
|
|
[heading Parser attributes]
|
|
|
|
This table summarizes the attributes generated for all _Parser_ parsers. In
|
|
the table `x` and `y` represent arbitrary objects of any type.
|
|
|
|
[table Parsers and Their Attributes
|
|
[[Parser] [Attribute]]
|
|
|
|
[[ _e_ ] [ None. ]]
|
|
[[ _eol_ ] [ None. ]]
|
|
[[ _eoi_ ] [ None. ]]
|
|
[[ _attr_np_`(x)` ] [ `decltype(x)` ]]
|
|
[[ _ch_ ] [ /See below./ ]]
|
|
[[ _cp_ ] [ `uint32_t` ]]
|
|
[[ _cu_ ] [ `char` ]]
|
|
[[ _lit_np_`(x)`] [ None. ]]
|
|
[[ _str_np_`(x)`] [ `std::string` ]]
|
|
[[ _b_ ] [ `bool` ]]
|
|
|
|
[[ _bin_ ] [ `unsigned int` ]]
|
|
[[ _oct_ ] [ `unsigned int` ]]
|
|
[[ _hex_ ] [ `unsigned int` ]]
|
|
[[ _us_ ] [ `unsigned short` ]]
|
|
[[ _ui_ ] [ `unsigned int` ]]
|
|
[[ _ul_ ] [ `unsigned long` ]]
|
|
[[ _ull_ ] [ `unsigned long long` ]]
|
|
|
|
[[ _s_ ] [ `short` ]]
|
|
[[ _i_ ] [ `int` ]]
|
|
[[ _l_ ] [ `long` ]]
|
|
[[ _ll_ ] [ `long long` ]]
|
|
[[ _f_ ] [ `float` ]]
|
|
[[ _d_ ] [ `double` ]]
|
|
|
|
[[ _symbols_t_ ] [ `T` ]]
|
|
]
|
|
|
|
Also, all parsers in the `ascii` namespace have the attribute `char`.
|
|
|
|
_ch_ is a bit odd, since its attribute type is polymorphic. When you use _ch_
|
|
to parse text in the non-Unicode code path (i.e. a string of `char`), the
|
|
attribute is `char`. When you use the exact same _ch_ to parse in the
|
|
Unicode-aware code path, all matching is code point based, and so the
|
|
attribute type is the type used to represent code points. For typical uses,
|
|
that type is `uint32_t`. All parsing of UTF-8 falls under this typical case.
|
|
The only time the code point type will be something different is if you call
|
|
_p_ with a code point sequence whose element type is something besides
|
|
`uint32_t`.
|
|
|
|
[heading Combining operation attributes]
|
|
|
|
Combining operations of course affect the generation of attributes. In this
|
|
table: `ATTR()` is a notional macro that expands to the attribute type of the
|
|
parser passed to it; `m` and `n` are integral values; `c` is a
|
|
condition/predicate; `x`, `y1`, `y2`, ... are arbitrary objects; `a` is a
|
|
semantic action; and `p`, `p1`, `p2`, ... are parsers.
|
|
|
|
[table Combining Operations and Their Attributes
|
|
[[Parser] [Attribute]]
|
|
|
|
[[`!p`] [None.]]
|
|
[[`&p`] [None.]]
|
|
|
|
[[`*p`] [`std::vector<ATTR(p)>`]]
|
|
[[`+p`] [`std::vector<ATTR(p)>`]]
|
|
[[`+*p`] [`std::vector<ATTR(p)>`]]
|
|
[[`*+p`] [`std::vector<ATTR(p)>`]]
|
|
[[`-p`] [`std::optional<ATTR(p)>`]]
|
|
|
|
[[`p1 >> p2`] [`hana::tuple<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 > p2`] [`hana::tuple<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 >> p2 >> p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 > p2 >> p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 >> p2 > p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 > p2 > p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
|
|
[[`p1 | p2`] [`std::variant<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 | p2 | p3`] [`std::variant<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
|
|
[[`p1 % p2`] [`std::vector<ATTR(p1)>`]]
|
|
|
|
[[`p[a]`] [None.]]
|
|
|
|
[[_rpt_np_`(n)[p]`] [`ATTR(p)`]]
|
|
[[_rpt_np_`(m, n)[p]`] [`ATTR(p)`]]
|
|
[[_if_np_`(c)[p]`] [`std::optional<ATTR(p)>`]]
|
|
[[_sw_np_`(x)(y1, p1)(y2, p2)...`] [`std::variant<ATTR(p1), ATTR(p2), ...>`]]
|
|
]
|
|
|
|
TODO: Still missing lots of case, like *char_ (string or vector?); char_ >>
|
|
string("foo"); string("foo") >> string("bar"); *int_ >> *int_; etc.
|
|
|
|
[heading Directives that affect attribute generation]
|
|
|
|
_omit_np_`[p]` disables attribute generation for the parser `p`. Not only
|
|
does _omit_np_`[p]` have not attribute, but any attribute generation work that
|
|
normally happens within `p` is skipped.
|
|
|
|
_raw_np_`[p]` changes the attribute from whatever `p`'s attribute is to
|
|
_v_`<I>`, where `I` is the type of the iterator used within the parse. Note
|
|
that this may not be the same as the iterator type passed to _p_. For
|
|
instance, when parsing UTF-8, the iterator passed to _p_ may be `char8_t const
|
|
*`, but within the parse it will be a UTF-9 to UTF-32 transcoding (converting)
|
|
iterator. Just like _omit_`, _raw_ causes all attribute-production work
|
|
within `p` to be skipped.
|
|
|
|
[endsect]
|
|
|
|
[section The `parse()` API]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Rules]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Unicode Support]
|
|
|
|
TODO
|
|
|
|
TODO: Unicode in symbol tables
|
|
|
|
[endsect]
|
|
|
|
[section Callback Parsing]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[endsect]
|