mirror of
https://github.com/boostorg/parser.git
synced 2026-01-25 06:22:13 +00:00
866 lines
36 KiB
Plaintext
866 lines
36 KiB
Plaintext
[/
|
|
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
/]
|
|
|
|
[section Tutorial]
|
|
|
|
[section Terminology]
|
|
|
|
First, let's cover some terminology that we'll be using throughout the docs:
|
|
|
|
A /semantic action/ is an arbitrary bit of logic associated with a parser,
|
|
that is only executed when the parser succeeds.
|
|
|
|
Simpler parsers can be combined to form more complex parsers. Given some
|
|
combining operation `C`, and parsers `P0`, `P1`, ... `PN`, `C(P0, P1, ... PN)`
|
|
creates a new parser `Q`. This creates a /parse tree/. `Q` is the parent of
|
|
`P1`, `P2` is the child of `Q`, etc. The parsers are applied in the top-down
|
|
fashion implied by this. When you use `Q` to parse a string, it will use
|
|
`P0`, `P1`, etc. to do the actual work. If `P3` is being used to parse the
|
|
input, that means that `Q` is as well, since the way `Q` parses is by
|
|
dispatching to its children to do some or all of the work. At any point in
|
|
the parse, there will be exactly one parser without children that is being
|
|
used to parse the input; all other parsers being used are its ancestors in the
|
|
parse tree.
|
|
|
|
A /subparser/ is a parser that is the child of another parser.
|
|
|
|
The /top-level parser/ is the root of the tree of parsers.
|
|
|
|
The /current parser/ or /innermost parser/ is the parser with no children that
|
|
is currently being used to parse the input.
|
|
|
|
A /rule/ is a kind of parser that makes building large, complex parsers
|
|
easier. A /subrule/ is a rule that is the child of some other rule. The
|
|
/current rule/ or /innermost rule/ is the one rule currently being used to
|
|
parse the input that has no subrules. Note that while there is always exactly
|
|
one current parser, there may or may not be a current rule _emdash_ rules are
|
|
one kind of parser, and you may or may not be using them in your top-level
|
|
parser.
|
|
|
|
The /top-level parse/ is the parse operation being performed by the top-level
|
|
parser. This term is necessary, because though most parse failures are local
|
|
to a particular parser, some parse failures cause the call to _p_ to indicate
|
|
failure of the entire parse. For these cases, we say that such a local
|
|
failure "causes the top-level parse to fail".
|
|
|
|
Next, we'll look at some simple programs that parse using _Parser_. We'll
|
|
start small and build up from there.
|
|
|
|
[endsect]
|
|
|
|
[section Hello, Whomever]
|
|
|
|
This is just about the most minimal example of using _Parser_ that one could
|
|
write. We take a string from the command line, or `"World"` if none is given,
|
|
and then we parse it:
|
|
|
|
[hello_example]
|
|
|
|
The expression `*bp::char_` is a parser-expression. It uses one of the many
|
|
parsers that _Parser_ provides, _ch_. Like all _Parser_ parsers, it has
|
|
certain operations defined on it. In this case, `*bp::char_` is using an
|
|
overloaded `operator*()` as the C++ version of a _kl_ operator. Since C++ has
|
|
no postfix unary `*` operator, we have to use the one we have, so it is used
|
|
as a prefix.
|
|
|
|
So, `*bp::char_` means "any number of characters". In other words, it really
|
|
cannot fail. Even an empty string will match it.
|
|
|
|
The parse operation is performed by calling the _p_ function, passing the
|
|
parser as one of the arguments:
|
|
|
|
bp::parse(input, *bp::char_, result);
|
|
|
|
The arguments here are: `input`, the string to parse; `*bp::char_`, the parser
|
|
used to do the parse; and `result`, and out-parameter into which to put the
|
|
result of the parse. Don't get too caught up on this method of getting the
|
|
parse result out of _p_; there are multiple ways of doing so, and we'll cover
|
|
all of them in subsequent examples.
|
|
|
|
Also, just ignore for now the fact that _Parser_ somehow figured out that the
|
|
result type of the `*bp::char_` parser is a `std::string`. There are clear
|
|
rules for this that we'll cover later.
|
|
|
|
The effects of this call to _p_ is not very interesting _emdash_ since the
|
|
parser we gave it cannot ever fail, and because we're placing the output in
|
|
the same type as the input, it just copies the contents of `input` to
|
|
`result`.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example]
|
|
|
|
Let's look at a slightly more complicated example, even if it is still
|
|
trivial. Instead of taking any old `char`s we're given, let's require some
|
|
structure. Let's parse one or more `double`s, separated by commas.
|
|
|
|
The _Parser_ parser for `double` is _d_. So, to parse a single `double`, we'd
|
|
use _d_. If we wanted to parse two `double`s in a row, we'd use:
|
|
|
|
boost::parser::double_ >> boost::parser::double_
|
|
|
|
`operator>>()` in this expression is the sequence-operator; read is as
|
|
"followed by". If we combine the sequence-operator with _kl_, we can get the
|
|
parser we want by writing:
|
|
|
|
boost::parser::double_ >> *(',' >> boost::parser::double_)
|
|
|
|
This is a parser that matches at least one `double` _emdash_ because of the
|
|
first _d_ in the expression above _emdash_ followed by zero or more instances
|
|
of a-comma-followed-by-a-`double`. Notice that we can use `','` directly.
|
|
Though it is not a parser, `operator>>()` and the other operators defined on
|
|
_Parser_ parsers have overloads that accept character/parser pairs of
|
|
arguments; these operator overloads will create the right parser to recognize
|
|
`','`.
|
|
|
|
[trivial_example]
|
|
|
|
The first example filled in an out-parameter to deliver the result of the
|
|
parse. This call to _p_ returns a result instead. As you can see, the result
|
|
is contextually convertible to `bool`, and `*result` is some sort of range.
|
|
In fact, the return type of this call to _p_ is
|
|
`std::optional<std::vector<double>>`. Naturally, if the parse fails,
|
|
`std::nullopt` is returned. We'll look at how _Parser_ maps the type of the
|
|
parser to the return type, or the filled in out-parameter's type, a bit later.
|
|
|
|
If I run it in a shell, this is the result:
|
|
|
|
[pre
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6,8.9
|
|
Great! It looks like you entered:
|
|
5.6
|
|
8.9
|
|
$ example/trivial
|
|
Enter a list of doubles, separated by commas. No pressure. 5.6, 8.9
|
|
Good job! Please proceed to the recovery annex for cake.
|
|
]
|
|
|
|
It does not recognize `"5.6, 8.9"`. This is because it expects a comma
|
|
followed /immediately/ by a `double`, but I inserted a space after the comma.
|
|
The same failure to parse would occur if I put a space before the comma, or
|
|
before or after the list of `double`s.
|
|
|
|
[endsect]
|
|
|
|
[section A Trivial Example That Gracefully Handles Whitespace]
|
|
|
|
Let's modify the trivial parser we just saw to ignore any spaces that might
|
|
exist among the `double`s and commas. To skip whitespace wherever we find it,
|
|
we can pass a /skip parser/ to our call to _p_ (we don't need to touch the
|
|
parser passed to _p_). Here, we use `ascii::space`, which matches any ASCII
|
|
character `c` for which `std::isspace(c)` is true.
|
|
|
|
[trivial_skipper_example]
|
|
|
|
The skip parser, or /skipper/, is run between the subparsers within the
|
|
parser passed to _p_. In this case, the skipper is run before the first
|
|
`double` is parsed, before any subsequent comma or `double` is parsed, and at
|
|
the end. So, the strings `"3.6,5.9"` and `" 3.6 , \t 5.9 "` are parsed the
|
|
same by this program.
|
|
|
|
Skipping is an important concept in _Parser_. You can skip anything, not just
|
|
ASCII whitespace; there are lots of other things you might want to skip. The
|
|
skipper you pass to _p_ can be an arbitrary parser. For example, if you write
|
|
a parser for a scripting language, you can write a skipper to skip whitespace,
|
|
inline comments, and end-of-line comments.
|
|
|
|
We'll be using skip parsers almost exclusively in the rest of the
|
|
documentation. The ability to ignore the parts of your input that you don't
|
|
care about is so convenient that parsing without skipping is a rarity in
|
|
practice.
|
|
|
|
[endsect]
|
|
|
|
[section Semantic Actions]
|
|
|
|
Like all parsing systems (lex & yacc, _Spirit_, etc.), _Parser_ has a
|
|
mechanism for associating semantic actions with different parts of the parse.
|
|
Here is nearly the same program as we saw in the previous example, except that
|
|
it is implemented in terms of a semantic action that appends each parsed
|
|
`double` to a result, instead of automatically building and returning the
|
|
result:
|
|
|
|
[semantic_action_example]
|
|
|
|
Run in a shell, it looks like this:
|
|
|
|
[pre
|
|
$ example/semantic_actions
|
|
Enter a list of doubles, separated by commas. 4,3
|
|
Got one!
|
|
Got one!
|
|
You entered:
|
|
4
|
|
3
|
|
]
|
|
|
|
In _Parser_, semantic actions are implemented in terms of invocable objects
|
|
that take a single parameter to a parse-context object. In the example we
|
|
used this lambda as our invocable:
|
|
|
|
[semantic_action_example_lambda]
|
|
|
|
We're both printing a message to `std::cout` and recording a parsed result in
|
|
the lambda. It could do both, either, or neither of these things if you like.
|
|
The way we get the parsed `double` in the lambda is by asking the parse
|
|
context for it. `_attr(ctx)` is how you ask the parse context for the
|
|
attribute produced by the parser to which the semantic action is attached.
|
|
There are lots of functions like `_attr()` that can be used to access the
|
|
state in the parse context. We'll cover more of them later on.
|
|
|
|
[endsect]
|
|
|
|
[section The Parse Context]
|
|
|
|
Now would be a good time to describe the parse context in some detail. Any
|
|
semantic action that you write will need to use the state in the parse
|
|
context, so you need to know what's available.
|
|
|
|
The parse context is a `hana::map` from tag types to elements. Elements are
|
|
added to or removevd from it at different times during the parse. For
|
|
instance, when a parser with a semantic action succeeds, it adds the attribute
|
|
it produces to the parse context, then calls the invocable semantic action.
|
|
This is efficient to do, because the `hana::map` remains fairly small, usually
|
|
around ten elements, and each element is stored as a pointer. Copying the
|
|
entire map when mutating the context is therefore fast.
|
|
|
|
[note All these functions that take the parse context as their first parameter
|
|
will find by found by Argument-Dependent Lookup. You will probably never need
|
|
to qualify them with `boost::parser::`.]
|
|
|
|
[heading Accessors for data that are always available]
|
|
|
|
_pass_ is a `bool` indicating the success of failure of the current parse.
|
|
This can be used to force the current parse to pass or fail:
|
|
|
|
[](auto & ctx) {
|
|
// If the attribute meets this predicate, fail the parse.
|
|
if (some_condition(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
Note that for a semantic action to be executed, its associated parser must
|
|
already have succeeded. So unless you previously wrote `_pass(ctx) = false`
|
|
somewhere, `_pass(ctx) = true` does nothing; it's redundant.
|
|
|
|
_begin_ and _end_ return the beginning and end of the range that you passed to
|
|
_p_, respectively. _where_ returns a _v_ indicating the bounds of the input
|
|
matched by the current parse. _where_ can be useful if you just want to parse
|
|
some text and return a result consisting of where certain elements are
|
|
located, without producing any other attributes.
|
|
|
|
_error_handler_ returns a reference to the error handler associated with the
|
|
parser passed to _p_. Any error handler must have the following member
|
|
functions:
|
|
|
|
[error_handler_api_1]
|
|
|
|
[error_handler_api_2]
|
|
|
|
If you call the second one, the one without the iterator parameter, it will
|
|
call the first with `_where(context).begin()` as the iterator parameter. The
|
|
one without the iterator is the one you will use most often. The one with the
|
|
explicit iterator parameter can be useful in situations where you have
|
|
messages that are related to each other, associated with multiple locations.
|
|
For instance, if you are parsing XML, you may want to report that a close-tag
|
|
does not match its associated open-tag by showing the line where the open-tag
|
|
was found. That may of course not be located anywhere near
|
|
`_where(ctx).begin()`. (A description of _globals_ is below.)
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
open_tag_msg,
|
|
ctx,
|
|
_globals(ctx).open_tags_position.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_error_handler(ctx).diagnose(
|
|
boost::parser::diagnostic_kind::error,
|
|
close_tag_msg,
|
|
ctx);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
There are also some convenience functions that make the above code a little
|
|
less verbose (_report_error_ and _report_warning_):
|
|
|
|
[](auto & ctx) {
|
|
// Assume we have a std::vector of open tags, and another
|
|
// std::vector of iterators to where the open tags were parsed, in our
|
|
// globals.
|
|
if (_attr(ctx) != _globals(ctx).open_tags.back()) {
|
|
std::string open_tag_msg =
|
|
"Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:";
|
|
_report_error(ctx, open_tag_msg, _globals(ctx).open_tag_positions.back());
|
|
std::string close_tag_msg =
|
|
"does not match close-tag \"" + _attr(ctx) + "\" here:";
|
|
_report_error(ctx, close_tag_msg);
|
|
|
|
// Explicitly fail the parse. Diagnostics to not affect parse success.
|
|
_pass(ctx) = false;
|
|
}
|
|
}
|
|
|
|
You should use these less verbose functions almost all the time. The only
|
|
time you would want to use _error_handler_ is when you are using a custom
|
|
error handler, and you want access to some part of it's interface besides
|
|
`diagnose()`.
|
|
|
|
[heading Accessors for data that are only sometimes available]
|
|
|
|
_attr_ is the value of the current parser's attribute. It is available only
|
|
when the current parser's parse is successful. If the parser has no semantic
|
|
action, no attribute gets added to the parse context. It can be used to read
|
|
and write the current parser's attribute:
|
|
|
|
[](auto & ctx) { _attr(ctx) = 3; }
|
|
|
|
If the current parser has no attribute, a _n_ is returned.
|
|
|
|
_val_ is the value of the attribute of the current rules being used to parse
|
|
(if any), and is available even before the rule's parse is successful. It can
|
|
be used to set the current rule's attribute, even from a parser that is a
|
|
subparser inside the rule. Let's say we're writing a parser with a semantic
|
|
action that is within a rule. If we want to set the current rule's value to
|
|
whatever this subparser parses, we would write this semantic action:
|
|
|
|
[](auto & ctx) { _val(ctx) = _attr(ctx); }
|
|
|
|
If there is no current rule, or the current rule has no attribute, a _n_ is
|
|
returned.
|
|
|
|
_globals_ returns a reference to a user-supplied struct that contains whatever
|
|
data you want to use during the parse. We'll get into this more later, but
|
|
for now, here's how you might use it:
|
|
|
|
[](auto & ctx) {
|
|
// black_list is some set of proscribed values that are not allowed.
|
|
if (_globals(ctx).black_list.contains(_attr(ctx)))
|
|
_pass(ctx) = false;
|
|
}
|
|
|
|
_locals_ returns a reference to one or more values that are local to the
|
|
current rule being parsed, if any. If there are two or more local values,
|
|
_locals_ returns a reference to a `hana::tuple`. Rules are something we
|
|
haven't gotten to yet, but here is how you use _locals_:
|
|
|
|
[](auto & ctx) {
|
|
auto & local = _locals(ctx);
|
|
// Use local here. If it is a hana::tuple, access its members like this:
|
|
using namespace hana::literals;
|
|
auto & first_element = local[0_c];
|
|
auto & second_element = local[1_c];
|
|
}
|
|
|
|
If there is no current rule, or the current rule has no locals, a _n_ is
|
|
returned.
|
|
|
|
_params_, like _locals_, applies to the current rule being used to parse, if
|
|
any. It also returns a reference to a single value, if the current rule has
|
|
only one parameter, or a `hana::tuple` to multiple values if the current rule
|
|
has multiple parameters.
|
|
|
|
If there is no current rule, or the current rule has no parameters, a _n_ is
|
|
returned.
|
|
|
|
[note _n_ is a type that is used as a return value in _Parser_ for parse
|
|
context accessors. _n_ is convertible to anything that has a default
|
|
constructor, convertible from anything, assignable form anything, and has
|
|
templated overloads for all the overloadable operators. The intention is that
|
|
a misuse of _val_, _globals_, etc. should compile, and produce an assertion at
|
|
runtime. Experience has shown that using a debugger for investigating the
|
|
stack that leads to your mistake is a far better user experience than sifting
|
|
through compiler diagnostics. See the rationale section for a more detailed
|
|
explanation.]
|
|
|
|
[heading TODO: extended example of deep template stack vs. debugger.]
|
|
|
|
[endsect]
|
|
|
|
[section Symbol Tables]
|
|
|
|
When writing a parser, it often comes up that there is a set of strings that,
|
|
when parsed, are associated with a set of values 1-to-1. It is tedious to
|
|
write parsers that recognize all the possible input strings when you have to
|
|
associate each one with an attribute via a semantic action. Instead, we can
|
|
use a symbol table.
|
|
|
|
Say we want to parse Roman numerals, one of the most common work-related
|
|
parsing problems. We want to recognize numbers that start with any number of
|
|
"M"s, representing thousands, followed by the hundreds, the tens, and the
|
|
ones. Any of these may be absent from the input, but not all. Here are three
|
|
symbol _Parser_ tables that we can use to recognize ones, tens, and hundreds
|
|
values, respectively:
|
|
|
|
[roman_numeral_symbol_tables]
|
|
|
|
A _symbols_ maps strings of `char` to their associated attributes. The type
|
|
of the attribute must be specified as a template parameter to _symbols_
|
|
_emdash_ `int` in this case.
|
|
|
|
Any "M"s we encounter should add 1000 to the result, and all other values come
|
|
from the symbol tables. Here are the semantic actions we'll need to do that:
|
|
|
|
[roman_numeral_actions]
|
|
|
|
`add_1000` just adds `1000` to `result`. `add` adds whatever attribute is
|
|
produced by its parser to `result`.
|
|
|
|
Now we just need to put the pieces together to make a parser:
|
|
|
|
[roman_numeral_parser]
|
|
|
|
We've got a few new bits in play here, so let's break it down. `'M'_l` is a
|
|
/literal parser/. That is, it is a parser that parses a literal `char`, code
|
|
point, or string. In this case, a `char` "M" is being parsed. The `_l` bit
|
|
at the end is a _udl_ suffix that you can put after any `char`, `char32_t`, or
|
|
`char const *` to form a literal parser. You can also make a literal parser
|
|
by writing _lit_ for some `x` of one of the previously mentioned types.
|
|
|
|
Why do we need any of this, considering that we just used a literal `','` in
|
|
our previous example? The reason is that `'M'` is not used in an expression
|
|
with another _Parser_ parser. It is used within `*'M'_l[add_1000]`. If we'd
|
|
written `*'M'[add_1000]`, clearly that would be ill-formed; `char` has no
|
|
`operator*()`, nor an `operator[]()`, associated with it.
|
|
|
|
[tip Any time you want to use a `char`, `char32_t`, or string literal in a
|
|
_Parser_ parser, write it as-is if it is combined with a preexisting _Parser_
|
|
subparser `p`, as in `'x' >> p`. Otherwise, you need to wrap it in a call to
|
|
_lit_, or use the `_l` _udl_ suffix.]
|
|
|
|
On to the next bit: `-hundreds[add]`. By now, the use of the index operator
|
|
should be pretty familiar; it associates the semantic action `add` with the
|
|
parser `hundreds`. The `operator-()` at the beginning is new. It means that
|
|
the parser it is applied to is optional. You can read it as "zero or one".
|
|
So, if `hundreds` is not successfully parsed after `*'M'[add_1000]`, nothing
|
|
happens, because `hundreds` is allowed to be missing _emdash_ it's optional.
|
|
If `hundreds` is parsed successfully, say by matching `"CC"`, the resulting
|
|
attribute, `200`, is added to `result` inside `add`.
|
|
|
|
Here is the full listing of the program. Notice that it would have been
|
|
inappropriate to use a whitespace skipper here, since the entire parse is a
|
|
single number, so it was removed.
|
|
|
|
[roman_numeral_example]
|
|
|
|
[endsect]
|
|
|
|
[section Mutable Symbol Tables]
|
|
|
|
The previous example showed how to use a symbol table as a fixed lookup table.
|
|
What if we want to add things to the table during the parse? We can do that,
|
|
but we need to do so within a semantic action. First, here is our symbol
|
|
table, already with a single value in it:
|
|
|
|
[self_filling_symbol_table_table]
|
|
|
|
No surprise that it works to use the symbol table as a parser to parse the one
|
|
string in the symbol table. Now, here's our parser:
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
Here, we've attached the semantic action not to a simple parser like _d_, but
|
|
to the sequence parser `(bp::char_ >> bp::int_)`. This sequence parser
|
|
contains two parsers, each with its own attribute, so it produces two
|
|
attributes as a `hana::tuple`.
|
|
|
|
[self_filling_symbol_table_action]
|
|
|
|
Inside the semantic action, we can get the first element of the attribute
|
|
tuple using _udls_ provided by Boost.Hana, and `hana::tuple::operator[]()`.
|
|
The first attribute, from the _ch_, is `_attr(ctx)[0_c]`, and the second, from
|
|
the _i_, is `_attr(ctx)[1_c]`. To add the symbol to the symbol table, we call
|
|
`insert()`.
|
|
|
|
[self_filling_symbol_table_parser]
|
|
|
|
During the parse, `("X", 9)` is parsed and added to the symbol table. Then,
|
|
the second `'X'` is recognized by the symbol table parser. However:
|
|
|
|
[self_filling_symbol_table_after_parse]
|
|
|
|
If we parse again, we find that `"X"` did not stay in the symbol table. The
|
|
fact that `symbols` was declared const might have given you a hint that this
|
|
would happen. Also, notice that the call to `insert()` in the semantic action
|
|
uses the parse context; that's where all the symbol table changes are stored
|
|
during the parse.
|
|
|
|
The full program:
|
|
|
|
[self_filling_symbol_table_example]
|
|
|
|
[note It is possible to add symbols to a _symbols_ permanently. To do so, you
|
|
have to use a mutable _symbols_ object `s`, and add the symbols by calling
|
|
`s.add()`, instead of `s.insert()`.]
|
|
|
|
[endsect]
|
|
|
|
[section Alternative Parsers]
|
|
|
|
Frequently, you need to parse something that might have one of several forms.
|
|
`operator|()` is overloaded to form alternative parsers. For example:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_1 = bp::int_ | bp::eps;
|
|
|
|
`parser_1` matches an integer, or if that fails, it matches /epsilon/, the
|
|
empty string. This is equivalent to writing:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_2 = -bp::int_;
|
|
|
|
However, neither `parser_1` nor `parser_2` is equivalent to writing this:
|
|
|
|
using namesapce bp = boost::parser;
|
|
auto const parser_3 = bp::eps | bp::int_;
|
|
|
|
The reason is that alternative parsers try each of their subparsers, one at a
|
|
time, and stop on the first one that matches. /Epsilon/ matches anything,
|
|
since it is zero length and consumes no input. It even matches the end of
|
|
input. This means that `parser_3` is equivalent to _e_ by itself.
|
|
|
|
[endsect]
|
|
|
|
[section The Parsers And Their Uses]
|
|
|
|
TODO
|
|
|
|
TODO: Cover the various collapsing rules with op*, op+, etc.
|
|
|
|
[endsect]
|
|
|
|
[section Combining Operations]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Directives]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Attribute Generation]
|
|
|
|
So far, we've seen several different types of attributes that come from
|
|
different parsers, `int` for _i_, `hana::tuple<char, int>` for
|
|
`boost::parser::char_ >> boost::parser::int_`, etc. Let's get into how this
|
|
works with a bit more rigor.
|
|
|
|
[note Some parsers have no attribute at all. In the tables below, the type of
|
|
the attribute is listed as "None." There is a non-`void` type that is
|
|
returned from each parser that lacks an attribute. This keeps the logic
|
|
simple; having to handle the two cases _emdash_ `void` or non-`void` _emdash_
|
|
would make the library significantly more complicated. The type of this
|
|
non-`void` attribute associated with these parsers is an implementation
|
|
detail. The type comes from the `boost::parser::detail` namespace and is
|
|
pretty useless. You should never see this type in practice. Within semantic
|
|
actions, asking for the attribute of a non-attribute-producing parser (using
|
|
`_attr(ctx)`) will yield a value of the special type `boost::parser::none`.
|
|
When calling _p_ in a form that returns the attribute parsed, when there is no
|
|
attribute, simply returns `bool`; this indicates the success of failure of the
|
|
parse.]
|
|
|
|
[heading Parser attributes]
|
|
|
|
This table summarizes the attributes generated for all _Parser_ parsers. In
|
|
the table `x` and `y` represent arbitrary objects of any type.
|
|
|
|
[table Parsers and Their Attributes
|
|
[[Parser] [Attribute] [Notes]]
|
|
|
|
[[ _e_ ] [ None. ] []]
|
|
[[ _eol_ ] [ None. ] []]
|
|
[[ _eoi_ ] [ None. ] []]
|
|
[[ _attr_np_`(x)` ] [ `decltype(x)` ] []]
|
|
[[ _ch_ ] [ /See below./ ]
|
|
[Includes all the `_p` _udls_ that take a single character, and all parsers in the `boost::parser::ascii` namespace.]]
|
|
[[ _cp_ ] [ `uint32_t` ] []]
|
|
[[ _cu_ ] [ `char` ] []]
|
|
[[ _lit_np_`(x)`] [ None. ]
|
|
[Includes all the `_l` _udls_.]]
|
|
[[ _str_np_`(x)`] [ `std::string` ]
|
|
[Includes all the `_p` _udls_ that take a string.]]
|
|
[[ _b_ ] [ `bool` ] []]
|
|
|
|
[[ _bin_ ] [ `unsigned int` ] []]
|
|
[[ _oct_ ] [ `unsigned int` ] []]
|
|
[[ _hex_ ] [ `unsigned int` ] []]
|
|
[[ _us_ ] [ `unsigned short` ] []]
|
|
[[ _ui_ ] [ `unsigned int` ] []]
|
|
[[ _ul_ ] [ `unsigned long` ] []]
|
|
[[ _ull_ ] [ `unsigned long long` ] []]
|
|
|
|
[[ _s_ ] [ `short` ] []]
|
|
[[ _i_ ] [ `int` ] []]
|
|
[[ _l_ ] [ `long` ] []]
|
|
[[ _ll_ ] [ `long long` ] []]
|
|
[[ _f_ ] [ `float` ] []]
|
|
[[ _d_ ] [ `double` ] []]
|
|
|
|
[[ _symbols_t_ ] [ `T` ]]
|
|
]
|
|
|
|
_ch_ is a bit odd, since its attribute type is polymorphic. When you use _ch_
|
|
to parse text in the non-Unicode code path (i.e. a string of `char`), the
|
|
attribute is `char`. When you use the exact same _ch_ to parse in the
|
|
Unicode-aware code path, all matching is code point based, and so the
|
|
attribute type is the type used to represent code points. For typical uses,
|
|
that type is `uint32_t`. All parsing of UTF-8 falls under this typical case.
|
|
The only time the code point type will be something different is if you call
|
|
_p_ with a code point sequence whose element type is something besides
|
|
`uint32_t`. For example, when you parse plain `char`s, meaning that the
|
|
parsing is in the non-Unicode code path, the attribute of _ch_ is `char`:
|
|
|
|
auto result = parse("some text", boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<char>>));
|
|
|
|
When you parse UTF-8, the matching is done on a code point basis, and the code
|
|
point type is `uint32_t`, so the attribute type is `uint32_t`:
|
|
|
|
auto result = parse(boost::text::as_utf8("some text"), boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<uint32_t>>));
|
|
|
|
When you parse code points by explicitly giving a code point range to _p_, the
|
|
attribute type is whatever the input range's element type is:
|
|
|
|
auto result = parse(U"some text", boost::parser::char_);
|
|
static_assert(std::is_same_v<decltype(result), std::optional<char32_t>>));
|
|
|
|
|
|
[heading Combining operation attributes]
|
|
|
|
Combining operations of course affect the generation of attributes. In the
|
|
tables below: `ATTR()` is a notional macro that expands to the attribute type
|
|
of the parser passed to it; `m` and `n` are integral values; `c` is a
|
|
condition/predicate; `x`, `y1`, `y2`, ... are arbitrary objects; `a` is a
|
|
semantic action; and `p`, `p1`, `p2`, ... are parsers that generate
|
|
attributes.
|
|
|
|
[table Combining Operations and Their Attributes
|
|
[[Parser] [Attribute]]
|
|
|
|
[[`!p`] [None.]]
|
|
[[`&p`] [None.]]
|
|
|
|
[[`*p`] [`std::vector<ATTR(p)>`]]
|
|
[[`+p`] [`std::vector<ATTR(p)>`]]
|
|
[[`+*p`] [`std::vector<ATTR(p)>`]]
|
|
[[`*+p`] [`std::vector<ATTR(p)>`]]
|
|
[[`-p`] [`std::optional<ATTR(p)>`]]
|
|
|
|
[[`p1 >> p2`] [`hana::tuple<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 > p2`] [`hana::tuple<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 >> p2 >> p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 > p2 >> p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 >> p2 > p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
[[`p1 > p2 > p3`] [`hana::tuple<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
|
|
[[`p1 | p2`] [`std::variant<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p1 | p2 | p3`] [`std::variant<ATTR(p1), ATTR(p2), ATTR(p3)>`]]
|
|
|
|
[[`p1 % p2`] [`std::vector<ATTR(p1)>`]]
|
|
|
|
[[`p[a]`] [None.]]
|
|
|
|
[[_rpt_np_`(n)[p]`] [`ATTR(p)`]]
|
|
[[_rpt_np_`(m, n)[p]`] [`ATTR(p)`]]
|
|
[[_if_np_`(c)[p]`] [`std::optional<ATTR(p)>`]]
|
|
[[_sw_np_`(x)(y1, p1)(y2, p2)...`] [`std::variant<ATTR(p1), ATTR(p2), ...>`]]
|
|
]
|
|
|
|
There are a relatively small number of rules that define how sequence parsers
|
|
and alternative parsers's attributes are generated. (Don't worry, there are
|
|
examples below.)
|
|
|
|
[heading Sequence parser attribute rules]
|
|
|
|
The attribute generation behavior of sequence parsers is conceptually pretty
|
|
simple:
|
|
|
|
* the attributes of subparsers form a tuple of values;
|
|
|
|
* subparsers that do not generate attributes do not contribute to the
|
|
sequence's attribute;
|
|
|
|
* subparsers that do generate attributes usually contribute an individual
|
|
element to the tuple result; except
|
|
|
|
* when containers of the same element type are next to each other, or
|
|
individual elements are next to containers of their type, the two adjacent
|
|
attributes collapse into one attribute; and
|
|
|
|
* if the result of all that is a degenerate tuple `hana::tuple<T>` (even if
|
|
`T` is a type that means "no attribute"), the attribute becomes `T`.
|
|
|
|
More formally, the attribute generation algorithm works like this. For a
|
|
sequence parser `p`, let the list of attribute types for the subparsers of `p`
|
|
be `{a0, a1, a2, ..., an}`.
|
|
|
|
We get the attribute of `p` by evaluating a compile-time left fold operation,
|
|
`left-fold({a1, a2, ..., an}, a0, OP)`. `OP` is the combining operation that
|
|
takes the current attribute type (initially `a0`) and the next attribute type,
|
|
and returns the new current attribute type. The current attribute type at the
|
|
end is the attribute type for `p`.
|
|
|
|
`OP` attempts to apply a series of rules, one at a time. The rules are noted
|
|
as `A >> B -> C`, where `A` is the type of the current attribute type, `B` is
|
|
the type of the next attribute type, and `C` is the new current attribute
|
|
type. In these rules, `C<T>` is a container of `T`; `none` is a special type
|
|
that indicates that there is no attribute; `T` is a type; and `Ts...` is a
|
|
parameter pack of one or more types. Note that `T` may be the special type
|
|
`none`.
|
|
|
|
* `T >> none -> T`
|
|
* `C<T> >> C<T> -> C<T>`
|
|
* `T >> T -> vector<T>`
|
|
* `C<T> >> T -> C<T>`
|
|
* `C<T> >> optional<T> -> C<T>`
|
|
* `T >> C<T> -> C<T>`
|
|
* `optional<T> >> C<T> -> C<T>`
|
|
* `hana::tuple<none> >> T -> hana::tuple<T>`
|
|
* `hana::tuple<Ts...> >> T -> hana::tuple<Ts..., T>`
|
|
|
|
Again, if the result is that the attribute is `hana::tuple<T>`, the attribute
|
|
becomes `T`.
|
|
|
|
[note What constitutes a container in the rules above is determined by the
|
|
`container` concept:
|
|
[container_concept]
|
|
]
|
|
|
|
[heading Alternative parser attribute rules]
|
|
|
|
The rules for alternative parsers are much simpler. For an alternative parer
|
|
`p`, let the list of attribute types for the subparsers of `p` be `{a0, a1,
|
|
a2, ..., an}`. The attribute of `p` is `std::variant<a0, a1, a2, ..., an>`,
|
|
with these exceptions:
|
|
|
|
* all the `none` attributes are left out, but if any were taken out, the
|
|
attribute become a `std::optional`;
|
|
|
|
* if the result is `std::variant<T>`, the result becomes `T` instead; and
|
|
|
|
* if the result is `std::variant<>`, the result becomes `none` instead.
|
|
|
|
[heading Formation of containers in attributes]
|
|
|
|
There are no special rules for forming containers from non-containers. For
|
|
instance, one of the rules above for sequence containers is `T >> T ->
|
|
vector<T>`. So, you get a vector if you have multiple values in sequence.
|
|
Another rule is that the attribute of `*p` is `std::vector<ATTR(p)>`. The
|
|
point is, _Parser_ will generate your favorite container out of sequences and
|
|
repetitions, as long as your favorite container is `std::vector`.
|
|
|
|
Another rule for sequence containers is that an value `x` and a container `c`
|
|
containing elements of `x`'s type will form a single container. However,
|
|
`x`'s type must be exactly the same as the elements in `c`. So, the attribute
|
|
of `char_ >> string("str")` is odd. In the non-Unicode code path, `char_`'s
|
|
attribute type is guaranteed to be `char`, so `ATTR(char_ >> string("str"))`
|
|
is `std::string`. If you are parsing UTF-8 in the Unicode code path,
|
|
`char_`'s attribute type is `uint32_t`, and `ATTR(char_ >> string("str"))` is
|
|
therefor `hana::tuple<uint32_t, std::string>`.
|
|
|
|
Again, there are no special rules here.
|
|
|
|
[heading Examples of attributes generated by sequence and alternative parsers]
|
|
|
|
In the table: `p`, `p1`, `p2`, ... are parsers that generate attributes, and
|
|
`a` is a semantic action. Note that only `>>` is used here. `>` has the
|
|
exact same attribute generation rules.
|
|
|
|
[table (Pseudocode) Combining Operations and Their Attributes
|
|
[[Pseudocode] [Attribute]]
|
|
|
|
[[_e_` >> `_e_] [None.]]
|
|
[[`p >> `_e_] [`ATTR(p)`]]
|
|
[[_e_` >> p`] [`ATTR(p)`]]
|
|
|
|
[[_cu_` >> `_str_np_`("str")`] [`std::string`]]
|
|
[[_str_np_`("str") >> `_cu_] [`std::string`]]
|
|
[[`*`_cu_` >> `_str_np_`("str")`][`hana::tuple<std::vector<char>, std::string>`]]
|
|
[[_str_np_`("str") >> *`_cu_] [`hana::tuple<std::string, std::vector<char>>`]]
|
|
|
|
[[`p >> p`] [`std::vector<ATTR(p)>`]]
|
|
[[`*p >> p`] [`std::vector<ATTR(p)>`]]
|
|
[[`p >> *p`] [`std::vector<ATTR(p)>`]]
|
|
[[`*p >> -p`] [`std::vector<ATTR(p)>`]]
|
|
[[`-p >> *p`] [`std::vector<ATTR(p)>`]]
|
|
|
|
[[_str_np_`("str") >> `_cu_] [`std::string`]]
|
|
[[_cu_` >> `_str_np_`("str")`] [`std::string`]]
|
|
[[_str_np_`("str") >> -`_cu_] [`std::string`]]
|
|
[[`-`_cu_` >> `_str_np_`("str")`][`std::string`]]
|
|
|
|
[[`!p1 | p2[a]`] [None.]]
|
|
[[`p | p`] [`ATTR(p)`]]
|
|
[[`p1 | p2`] [`std::variant<ATTR(p1), ATTR(p2)>`]]
|
|
[[`p | `_e_] [`std::optional<ATTR(p)>`]]
|
|
[[`p1 | p2 | `_e_] [`std::optional<std::variant<ATTR(p1), ATTR(p2)>>`]]
|
|
[[`p1 | p2[a] | p3`] [`std::optional<std::variant<ATTR(p1), ATTR(p3)>>`]]
|
|
]
|
|
|
|
|
|
[heading Directives that affect attribute generation]
|
|
|
|
_omit_np_`[p]` disables attribute generation for the parser `p`. Not only
|
|
does _omit_np_`[p]` have no attribute, but any attribute generation work that
|
|
normally happens within `p` is skipped.
|
|
|
|
_raw_np_`[p]` changes the attribute from `ATTR(p)` to _v_`<I>`, where `I` is
|
|
the type of the iterator used within the parse. Note that this may not be the
|
|
same as the iterator type passed to _p_. For instance, when parsing UTF-8,
|
|
the iterator passed to _p_ may be `char8_t const *`, but within the parse it
|
|
will be a UTF-8 to UTF-32 transcoding (converting) iterator. Just like
|
|
_omit_, _raw_ causes all attribute-generation work within `p` to be skipped.
|
|
|
|
[endsect]
|
|
|
|
[section The `parse()` API]
|
|
|
|
TODO
|
|
|
|
TODO: This is where attribute compatability is covered.
|
|
|
|
TODO: Be sure to note that parse(as_utf8(...), p) is in the Unicode path and:
|
|
auto r = as_utf8(...)
|
|
parse(r.begin(), r.end(), p) is not.
|
|
|
|
[endsect]
|
|
|
|
[section Rules]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[section Unicode Support]
|
|
|
|
TODO
|
|
|
|
TODO: Unicode in symbol tables
|
|
|
|
[endsect]
|
|
|
|
[section Callback Parsing]
|
|
|
|
TODO
|
|
|
|
[endsect]
|
|
|
|
[endsect]
|