[/ / Distributed under the Boost Software License, Version 1.0. (See accompanying / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) /] [section Tutorial] [section Terminology] First, let's cover some terminology that we'll be using throughout the docs: A /semantic action/ is an arbitrary bit of logic associated with a parser, that is only executed when the parser matches. Simpler parsers can be combined to form more complex parsers. Given some combining operation `C`, and parsers `P0`, `P1`, ... `PN`, `C(P0, P1, ... PN)` creates a new parser `Q`. This creates a /parse tree/. `Q` is the parent of `P1`, `P2` is the child of `Q`, etc. The parsers are applied in the top-down fashion implied by this topology. When you use `Q` to parse a string, it will use `P0`, `P1`, etc. to do the actual work. If `P3` is being used to parse the input, that means that `Q` is as well, since the way `Q` parses is by dispatching to its children to do some or all of the work. At any point in the parse, there will be exactly one parser without children that is being used to parse the input; all other parsers being used are its ancestors in the parse tree. A /subparser/ is a parser that is the child of another parser. The /top-level parser/ is the root of the tree of parsers. The /current parser/ or /bottommost parser/ is the parser with no children that is currently being used to parse the input. A /rule/ is a kind of parser that makes building large, complex parsers easier. A /subrule/ is a rule that is the child of some other rule. The /current rule/ or /bottommost rule/ is the one rule currently being used to parse the input that has no subrules. Note that while there is always exactly one current parser, there may or may not be a current rule _emdash_ rules are one kind of parser, and you may or may not be using one at a given point in the parse. The /top-level parse/ is the parse operation being performed by the top-level parser. This term is necessary because, though most parse failures are local to a particular parser, some parse failures cause the call to _p_ to indicate failure of the entire parse. For these cases, we say that such a local failure "causes the top-level parse to fail". Throughout the _Parser_ documentation, I will refer to "the call to _p_". Read this as "the call to any one of the functions described in _p_api_". That includes _pp_, _cbp_, and _cbpp_. There are a couple of special kinds of parsers that come up often in this documentation. One is a /sequence parser/; you will see it created using `operator>>()`, as in `p1 >> p2 >> p3`. A sequence parser tries to match all of its subparsers to the input, one at a time, in order. It matches the input iff all its subparsers do. The other is an /alternative parser/; you will see it created using `operator|()`, as in `p1 | p2 | p3`. A alternative parser tries to match all of its subparsers to the input, one at a time, in order; it stops after matching at most one subparser. It matches the input iff one of its subparsers does. _Parser_ parsers each have an /attribute/ associated with them, or explicitly have no attribute. An attribute is a value that the parser generates when it matches the input. For instance, the parser _d_ generates a `double` when it matches the input. Since it is not possible to write a type trait that returns the attribute type of a parser, we need notation for concisely communicating that relationship. _ATTR_ is a notional macro that expands to the attribute type of the parser passed to it; `_ATTR_np_(_d_)` is `double`. Next, we'll look at some simple programs that parse using _Parser_. We'll start small and build up from there. [endsect] [section Hello, Whomever] This is just about the most minimal example of using _Parser_ that one could write. We take a string from the command line, or `"World"` if none is given, and then we parse it: [hello_example] The expression `*bp::char_` is a parser-expression. It uses one of the many parsers that _Parser_ provides: _ch_. Like all _Parser_ parsers, it has certain operations defined on it. In this case, `*bp::char_` is using an overloaded `operator*()` as the C++ version of a _kl_ operator. Since C++ has no postfix unary `*` operator, we have to use the one we have, so it is used as a prefix. So, `*bp::char_` means "any number of characters". In other words, it really cannot fail. Even an empty string will match it. The parse operation is performed by calling the _p_ function, passing the parser as one of the arguments: bp::parse(input, *bp::char_, result); The arguments here are: `input`, the range to parse; `*bp::char_`, the parser used to do the parse; and `result`, an out-parameter into which to put the result of the parse. Don't get too caught up on this method of getting the parse result out of _p_; there are multiple ways of doing so, and we'll cover all of them in subsequent sections. Also, just ignore for now the fact that _Parser_ somehow figured out that the result type of the `*bp::char_` parser is a _std_str_. There are clear rules for this that we'll cover later. The effects of this call to _p_ is not very interesting _emdash_ since the parser we gave it cannot ever fail, and because we're placing the output in the same type as the input, it just copies the contents of `input` to `result`. [endsect] [section A Trivial Example] Let's look at a slightly more complicated example, even if it is still trivial. Instead of taking any old `char`s we're given, let's require some structure. Let's parse one or more `double`s, separated by commas. The _Parser_ parser for `double` is _d_. So, to parse a single `double`, we'd just use that. If we wanted to parse two `double`s in a row, we'd use: boost::parser::double_ >> boost::parser::double_ `operator>>()` in this expression is the sequence-operator; read it as "followed by". If we combine the sequence-operator with _kl_, we can get the parser we want by writing: boost::parser::double_ >> *(',' >> boost::parser::double_) This is a parser that matches at least one `double` _emdash_ because of the first _d_ in the expression above _emdash_ followed by zero or more instances of a-comma-followed-by-a-`double`. Notice that we can use `','` directly. Though it is not a parser, `operator>>()` and the other operators defined on _Parser_ parsers have overloads that accept character/parser pairs of arguments; these operator overloads will create the right parser to recognize `','`. [trivial_example] The first example filled in an out-parameter to deliver the result of the parse. This call to _p_ returns a result instead. As you can see, the result is contextually convertible to `bool`, and `*result` is some sort of range. In fact, the return type of this call to _p_ is `std::optional>`. Naturally, if the parse fails, `std::nullopt` is returned. We'll look at how _Parser_ maps the type of the parser to the return type, or the filled in out-parameter's type, a bit later. If I run it in a shell, this is the result: [pre $ example/trivial Enter a list of doubles, separated by commas. No pressure. 5.6,8.9 Great! It looks like you entered: 5.6 8.9 $ example/trivial Enter a list of doubles, separated by commas. No pressure. 5.6, 8.9 Good job! Please proceed to the recovery annex for cake. ] It does not recognize `"5.6, 8.9"`. This is because it expects a comma followed /immediately/ by a `double`, but I inserted a space after the comma. The same failure to parse would occur if I put a space before the comma, or before or after the list of `double`s. [endsect] [section A Trivial Example That Gracefully Handles Whitespace] Let's modify the trivial parser we just saw to ignore any spaces it might find among the `double`s and commas. To skip whitespace wherever we find it, we can pass a /skip parser/ to our call to _p_ (we don't need to touch the parser passed to _p_). Here, we use `ws`, which matches any Unicode whitespace character. [trivial_skipper_example] [tip Even though there is another parser `ascii::space` that we could have used here, and even though we know we're in a non-Unicode parsing context here from the `std::string` input, I did not use `ascii::space`. The `ascii::` namespace parsers exist mostly to aid porting from Boost.Spirit. They are inherently dangerous in a Unicode environment, since they use the `std::is*()` functions (like `std::isspace()`). `ws` is correct, plus it is written in such a way that it matches the ASCII subset of values first, meaning it is no less efficient to use for parsing ASCII than `ascii::space`.] The skip parser, or /skipper/, is run between the subparsers within the parser passed to _p_. In this case, the skipper is run before the first `double` is parsed, before any subsequent comma or `double` is parsed, and at the end. So, the strings `"3.6,5.9"` and `" 3.6 , \t 5.9 "` are parsed the same by this program. Skipping is an important concept in _Parser_. You can skip anything, not just whitespace; there are lots of other things you might want to skip. The skipper you pass to _p_ can be an arbitrary parser. For example, if you write a parser for a scripting language, you can write a skipper to skip whitespace, inline comments, and end-of-line comments. We'll be using skip parsers almost exclusively in the rest of the documentation. The ability to ignore the parts of your input that you don't care about is so convenient that parsing without skipping is a rarity in practice. [endsect] [section Semantic Actions] Like all parsing systems (lex & yacc, _Spirit_, etc.), _Parser_ has a mechanism for associating semantic actions with different parts of the parse. Here is nearly the same program as we saw in the previous example, except that it is implemented in terms of a semantic action that appends each parsed `double` to a result, instead of automatically building and returning the result. To do this, we replace the _d_ from the previous example with `_d_[action]`; `action` is our semantic action: [semantic_action_example] Run in a shell, it looks like this: [pre $ example/semantic_actions Enter a list of doubles, separated by commas. 4,3 Got one! Got one! You entered: 4 3 ] In _Parser_, semantic actions are implemented in terms of invocable objects that take a single parameter to a parse-context object. The parse-context object represents the current state of the parse. In the example we used this lambda as our invocable: [semantic_action_example_lambda] We're both printing a message to `std::cout` and recording a parsed result in the lambda. It could do both, either, or neither of these things if you like. The way we get the parsed `double` in the lambda is by asking the parse context for it. `_attr(ctx)` is how you ask the parse context for the attribute produced by the parser to which the semantic action is attached. There are lots of functions like `_attr()` that can be used to access the state in the parse context. We'll cover more of them later on. The next section defines what exactly the parse context is and how it works. [endsect] [section The Parse Context] Now would be a good time to describe the parse context in some detail. Any semantic action that you write will need to use state in the parse context, so you need to know what's available. The parse context is an object that stores the current state of the parse _emdash_ the current- and end-iterators, the error handler, etc. Data may seem to be "added" to or "removed" from it at different times during the parse. For instance, when a parser `p` with a semantic action `a` succeeds, the context adds the attribute that `p` produces to the parse context, then calls `a`, passing it the context. Though the context object appears to have things added to or removed from it, it does not. In reality, there is no one context object. Contexts are formed at various times during the parse, usually when starting a subparser. Each context is formed by taking the previous context and adding or changing members as needed to form a new context object. When the function containing the new context object returns, its context object (if any) is destructed. This is efficient to do, because the parse context has only about a dozen data members, and each data member is less than or equal to the size of a pointer. Copying the entire context when mutating the context is therefore fast. The context does no memory allocation. [tip All these functions that take the parse context as their first parameter will find by found by Argument-Dependent Lookup. You will probably never need to qualify them with `boost::parser::`.] [heading Accessors for data that are always available] By convention, the names of all _Parser_ functions that take a parse context, and are therefore intended for use inside semantic actions, contain a leading underscore. [heading _pass_] _pass_ returns a reference to a `bool` indicating the success of failure of the current parse. This can be used to force the current parse to pass or fail: [](auto & ctx) { // If the attribute fails to meet this predicate, fail the parse. if (!necessary_condition(_attr(ctx))) _pass(ctx) = false; } Note that for a semantic action to be executed, its associated parser must already have succeeded. So unless you previously wrote `_pass(ctx) = false` within your action, `_pass(ctx) = true` does nothing; it's redundant. [heading _begin_, _end_ and _where_] _begin_ and _end_ return the beginning and end of the range that you passed to _p_, respectively. _where_ returns a _v_ indicating the bounds of the input matched by the current parse. _where_ can be useful if you just want to parse some text and return a result consisting of where certain elements are located, without producing any other attributes. [heading _error_handler_] _error_handler_ returns a reference to the error handler associated with the parser passed to _p_. Any error handler must have the following member functions: [error_handler_api_1] [error_handler_api_2] If you call the second one, the one without the iterator parameter, it will call the first with `_where(context).begin()` as the iterator parameter. The one without the iterator is the one you will use most often. The one with the explicit iterator parameter can be useful in situations where you have messages that are related to each other, associated with multiple locations. For instance, if you are parsing XML, you may want to report that a close-tag does not match its associated open-tag by showing the line where the open-tag was found. That may of course not be located anywhere near `_where(ctx).begin()`. (A description of _globals_ is below.) [](auto & ctx) { // Assume we have a std::vector of open tags, and another // std::vector of iterators to where the open tags were parsed, in our // globals. if (_attr(ctx) != _globals(ctx).open_tags.back()) { std::string open_tag_msg = "Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:"; _error_handler(ctx).diagnose( boost::parser::diagnostic_kind::error, open_tag_msg, ctx, _globals(ctx).open_tags_position.back()); std::string close_tag_msg = "does not match close-tag \"" + _attr(ctx) + "\" here:"; _error_handler(ctx).diagnose( boost::parser::diagnostic_kind::error, close_tag_msg, ctx); // Explicitly fail the parse. Diagnostics to not affect parse success. _pass(ctx) = false; } } [heading _report_error_ and _report_warning_] There are also some convenience functions that make the above code a little less verbose, _report_error_ and _report_warning_: [](auto & ctx) { // Assume we have a std::vector of open tags, and another // std::vector of iterators to where the open tags were parsed, in our // globals. if (_attr(ctx) != _globals(ctx).open_tags.back()) { std::string open_tag_msg = "Previous open-tag \"" + _globals(ctx).open_tags.back() + "\" here:"; _report_error(ctx, open_tag_msg, _globals(ctx).open_tag_positions.back()); std::string close_tag_msg = "does not match close-tag \"" + _attr(ctx) + "\" here:"; _report_error(ctx, close_tag_msg); // Explicitly fail the parse. Diagnostics to not affect parse success. _pass(ctx) = false; } } You should use these less verbose functions almost all the time. The only time you would want to use _error_handler_ is when you are using a custom error handler, and you want access to some part of it's interface besides `diagnose()`. [heading Accessors for data that are only sometimes available] [heading _attr_] _attr_ returns a reference to the value of the current parser's attribute. It is available only when the current parser's parse is successful. If the parser has no semantic action, no attribute gets added to the parse context. It can be used to read and write the current parser's attribute: [](auto & ctx) { _attr(ctx) = 3; } If the current parser has no attribute, a _n_ is returned. [heading _val_] _val_ returns a reference to the value of the attribute of the current rule being used to parse (if any), and is available even before the rule's parse is successful. It can be used to set the current rule's attribute, even from a parser that is a subparser inside the rule. Let's say we're writing a parser with a semantic action that is within a rule. If we want to set the current rule's value to whatever this subparser parses, we would write this semantic action: [](auto & ctx) { _val(ctx) = _attr(ctx); } If there is no current rule, or the current rule has no attribute, a _n_ is returned. [heading _globals_] _globals_ returns a reference to a user-supplied struct that contains whatever data you want to use during the parse. We'll get into this more later, but for now, here's how you might use it: [](auto & ctx) { // black_list is some set of values that are not allowed. if (_globals(ctx).black_list.contains(_attr(ctx))) _pass(ctx) = false; } [heading _locals_] _locals_ returns a reference to one or more values that are local to the current rule being parsed, if any. If there are two or more local values, _locals_ returns a reference to a _bp_tup_. Rules with locals are something we haven't gotten to yet, but here is how you use _locals_: [](auto & ctx) { auto & local = _locals(ctx); // Use local here. If boost::parser::tuple aliases to hana::tuple, access // its members like this: using namespace hana::literals; auto & first_element = local[0_c]; auto & second_element = local[1_c]; } If there is no current rule, or the current rule has no locals, a _n_ is returned. [heading _params_] _params_, like _locals_, applies to the current rule being used to parse, if any. It also returns a reference to a single value, if the current rule has only one parameter, or a _bp_tup_ of multiple values if the current rule has multiple parameters. If there is no current rule, or the current rule has no parameters, a _n_ is returned. [note _n_ is a type that is used as a return value in _Parser_ for parse context accessors. _n_ is convertible to anything that has a default constructor, convertible from anything, assignable form anything, and has templated overloads for all the overloadable operators. The intention is that a misuse of _val_, _globals_, etc. should compile, and produce an assertion at runtime. Experience has shown that using a debugger for investigating the stack that leads to your mistake is a far better user experience than sifting through compiler diagnostics. See the _rationale_ section for a more detailed explanation.] [heading __no_case_func_] __no_case_func_ returns `true` if the current parse context is inside one or more (possibly nested) _no_case_ directives. [endsect] [section Rule Parsers] This example is very similar to the others we've seen so far. This one is different only because it uses a _r_. As an analogy, think of a parser like _ch_ or _d_ as an individual line of code, and a _r_ as a function. Like a function, a _r_ has its own name, and can even be forward declared. Here is how we define a _r_, which is analogous to forward declaring a function: [rule_intro_rule_definition_rule] This declares the rule itself. The _r_ is a parser, and we can immediately use it in other parsers. That definition is pretty dense; take note of these things: * The first template parameter is a tag type `struct doubles`. Here we've declared the tag type and used it all in one go; you can also use a previously declared tag type. * The second template parameter is the attribute type of the parser. * This rule object itself is called `doubles`. * We've given `doubles` the string name `"doubles"` so that _Parser_ knows what to call it when producing a trace of the parser during debugging. Ok, so if `doubles` is a parser, what does it do? We define the rule's behavior by defining a separate parser that by now should look pretty familiar: [rule_intro_rule_definition_rule_def] This is analogous to writing a definition for a forward-declared function. Note that we used the name `doubles_def`. Right now, the `doubles` rule parser and the `doubles_def` non-rule parser have no connection to each other. That's intentional _emdash_ we want to be able to define them separately. To connect them, we declare functions with an interface that _Parser_ understands, and use the tag type `struct doubles` to connect them together. We use a macro for that: [rule_intro_rule_definition_macro] This macro expands to two overloads of a function called `parse_rule()` that each take a `struct doubles` parameter and parse using `doubles_def`. The `_def` suffix is a naming convention that this macro relies on to work. The tag type allows the rule parser, `doubles`, to call one of these overloads when used as a parser. Now that we have the `doubles` parser, we can use it like we might any other parser: [rule_intro_parse_call] [note We used _RULE_ in this example. There is also another macro that allows you to define multiple rules in one macro expansion, _RULES_.] The full program: [rule_intro_example] [note The existence of _rs_ means that will probably never have to write a low-level parser. You can just put existing parsers together into _rs_ instead.] [endsect] [section Parsing `struct`s] So far, we've seen only simple parsers that parse the same value repeatedly (with or without commas and spaces). It's also very common to parse a few values in a specific sequence. Let's say you want to parse an employee record. Here's a parser you might write: namespace bp = boost::parser; auto employee_parser = bp::lit("employee") >> '{' >> bp::int_ >> ',' >> quoted_string >> ',' >> quoted_string >> ',' >> bp::double_ >> '}'; The attribute type for `employee_parser` is `_bp_tup_`. That's great, in that you got all the parsed data for the record without having to write any semantic actions. It's not so great that you now have to get all the individual elements out by their indices, using `get()`. It would be much nicer to parse into a `struct` that has data members _emdash_ with names _emdash_ of the types listed in the _bp_tup_. Fortunately, this just works in _Parser_. The main requirement is that the `struct` you provide be an aggregate type. [parsing_into_a_struct_example] Unfortunately, this is taking advantage of the loose attribute assignment logic; the `employee_parser` parser still has a _bp_tup_ attribute. See _p_api_ for a description of attribute out-param compatibility. For this reason, it's even more common to want to make a rule that returns a specific type like `employee`. Just by giving the rule a `struct` type, we make sure that this parser always generates an `employee` struct as its attribute, no matter where it is in the parse. If we made a simple parser `P` that uses the `employee_p` rule, like `bp::int >> employee_p`, `P`'s attribute type would be `_bp_tup_`. [struct_rule_example] Just as you can pass a `struct` as an out-param to `parse()` when the parser's attribute type is a tuple, you can also pass a tuple as an out-param to `parse()` when the parser's attribute type is a struct: // Using the employee_p rule from above, with attribute type employee... _bp_tup_ tup; auto const result = bp::parse(input, employee_p, bp::ws, tup); // Ok! [important This automatic use of `struct`s as if they were tuples depends on a bit of metaprogramming. Due to compiler limits, the metaprogram that detects the number of data members of a `struct` is limited to a maximum number of members. Fortunately, that limit is pretty high _emdash_ 50 members.] [endsect] [section Symbol Tables] When writing a parser, it often comes up that there is a set of strings that, when parsed, are associated with a set of values one-to-one. It is tedious to write parsers that recognize all the possible input strings when you have to associate each one with an attribute via a semantic action. Instead, we can use a symbol table. Say we want to parse Roman numerals, one of the most common work-related parsing problems. We want to recognize numbers that start with any number of "M"s, representing thousands, followed by the hundreds, the tens, and the ones. Any of these may be absent from the input, but not all. Here are three symbol _Parser_ tables that we can use to recognize ones, tens, and hundreds values, respectively: [roman_numeral_symbol_tables] A _symbols_ maps strings of `char` to their associated attributes. The type of the attribute must be specified as a template parameter to _symbols_ _emdash_ in this case, `int`. Any "M"s we encounter should add 1000 to the result, and all other values come from the symbol tables. Here are the semantic actions we'll need to do that: [roman_numeral_actions] `add_1000` just adds `1000` to `result`. `add` adds whatever attribute is produced by its parser to `result`. Now we just need to put the pieces together to make a parser: [roman_numeral_parser] We've got a few new bits in play here, so let's break it down. `'M'_l` is a /literal parser/. That is, it is a parser that parses a literal `char`, code point, or string. In this case, a `char` "M" is being parsed. The `_l` bit at the end is a _udl_ suffix that you can put after any `char`, `char32_t`, or `char const *` to form a literal parser. You can also make a literal parser by writing _lit_, passing an argument of one of the previously mentioned types. Why do we need any of this, considering that we just used a literal `','` in our previous example? The reason is that `'M'` is not used in an expression with another _Parser_ parser. It is used within `*'M'_l[add_1000]`. If we'd written `*'M'[add_1000]`, clearly that would be ill-formed; `char` has no `operator*()`, nor an `operator[]()`, associated with it. [tip Any time you want to use a `char`, `char32_t`, or string literal in a _Parser_ parser, write it as-is if it is combined with a preexisting _Parser_ subparser `p`, as in `'x' >> p`. Otherwise, you need to wrap it in a call to _lit_, or use the `_l` _udl_ suffix.] On to the next bit: `-hundreds[add]`. By now, the use of the index operator should be pretty familiar; it associates the semantic action `add` with the parser `hundreds`. The `operator-()` at the beginning is new. It means that the parser it is applied to is optional. You can read it as "zero or one". So, if `hundreds` is not successfully parsed after `*'M'[add_1000]`, nothing happens, because `hundreds` is allowed to be missing _emdash_ it's optional. If `hundreds` is parsed successfully, say by matching `"CC"`, the resulting attribute, `200`, is added to `result` inside `add`. Here is the full listing of the program. Notice that it would have been inappropriate to use a whitespace skipper here, since the entire parse is a single number, so it was removed. [roman_numeral_example] [important _symbols_ stores all its strings in UTF-32 internally. If you do Unicode or ASCII parsing, this will not matter to you at all. If you do non-Unicode parsing of a character encoding that is not a subset of Unicode (EBCDIC, for instance), it could cause problems. See the section on _unicode_ for more information.] [endsect] [section Mutable Symbol Tables] The previous example showed how to use a symbol table as a fixed lookup table. What if we want to add things to the table during the parse? We can do that, but we need to do so within a semantic action. First, here is our symbol table, already with a single value in it: [self_filling_symbol_table_table] No surprise that it works to use the symbol table as a parser to parse the one string in the symbol table. Now, here's our parser: [self_filling_symbol_table_parser] Here, we've attached the semantic action not to a simple parser like _d_, but to the sequence parser `(bp::char_ >> bp::int_)`. This sequence parser contains two parsers, each with its own attribute, so it produces two attributes as a tuple. [self_filling_symbol_table_action] Inside the semantic action, we can get the first element of the attribute tuple using _udls_ provided by Boost.Hana, and `boost::hana::tuple::operator[]()`. The first attribute, from the _ch_, is `_attr(ctx)[0_c]`, and the second, from the _i_, is `_attr(ctx)[1_c]` (if _bp_tup_ aliases to _std_tup_, you'd use `std::get` or _bp_get_ instead). To add the symbol to the symbol table, we call `insert()`. [self_filling_symbol_table_parser] During the parse, `("X", 9)` is parsed and added to the symbol table. Then, the second `'X'` is recognized by the symbol table parser. However: [self_filling_symbol_table_after_parse] If we parse again, we find that `"X"` did not stay in the symbol table. The fact that `symbols` was declared const might have given you a hint that this would happen. Also, notice that the call to `insert()` in the semantic action uses the parse context; that's where all the symbol table changes are stored during the parse. The full program: [self_filling_symbol_table_example] [note It is possible to add symbols to a _symbols_ permanently. To do so, you have to use a mutable _symbols_ object `s`, and add the symbols by calling `s.insert_for_next_parse()`, instead of `s.insert()`. These two operations are orthogonal, so if you want to both add a symbol to the table for the current top-level parse, and leave it in the table for subsequent top-level parses, you need to call both functions. ] [tip _symbols_ also has a call operator that does exactly what `.insert_for_next_parse()` does. This allows you to chain additions with a convenient syntax, like this: symbols roman_numerals; roman_numerals.insert_for_next_parse("I", 1)("V", 5)("X", 10); ] [important _symbols_ stores all its strings in UTF-32 internally. If you do Unicode or ASCII parsing, this will not matter to you at all. If you do non-Unicode parsing of a character encoding that is not a subset of Unicode (EBCDIC, for instance), it could cause problems. See the section on _unicode_ for more information.] [endsect] [section Alternative Parsers] Frequently, you need to parse something that might have one of several forms. `operator|()` is overloaded to form alternative parsers. For example: namespace bp = boost::parser; auto const parser_1 = bp::int_ | bp::eps; `parser_1` matches an integer, or if that fails, it matches /epsilon/, the empty string. This is equivalent to writing: namespace bp = boost::parser; auto const parser_2 = -bp::int_; However, neither `parser_1` nor `parser_2` is equivalent to writing this: namespace bp = boost::parser; auto const parser_3 = bp::eps | bp::int_; // Does not do what you think. The reason is that alternative parsers try each of their subparsers, one at a time, and stop on the first one that matches. /Epsilon/ matches anything, since it is zero length and consumes no input. It even matches the end of input. This means that `parser_3` is equivalent to _e_ by itself. [note For this reason, writing `_e_ | p` for any parser p is considered a bug. Debug builds will assert when `_e_ | p` is encountered. ] [endsect] [section The Parsers And Their Uses] _Parser_ comes with all the parsers most parsing tasks will ever need. Each one is a `constexpr` object, or a `constexpr` function. Some of the non-functions are also callable, such as _ch_, which may be used directly, or with arguments, as in _ch_`('a', 'z')`. Any parser that can be called, whether a function or callable object, will be called a /callable parser/ from now on. Note that there are no nullary callable parsers; they each take one or more arguments. Each callable parser takes one or more /parse arguments/. A parse argument may be a value or an invocable object that accepts a reference to the parse context. The reference parameter may be mutable or constant. For example: struct get_attribute { template auto operator()(Context & ctx) { return _attr(ctx); } }; This can also be a lambda. For example: [](auto const & ctx) { return _attr(ctx); } The operation that produces a value from a parse argument, which may be a value or a callable taking a parse context argument, is referred to as /resolving/ the parse argument. Some callable parsers take a /parse predicate/. A parse predicate is not quite the same as a parse argument, because it must be a callable object, and cannot be a value. A parse predicate's return type must be contextually convertible to `bool`. For example: struct equals_three { template bool operator()(Context const & ctx) { return _attr(ctx) == 3; } }; This may of course be a lambda: [](auto & ctx) { return _attr(ctx) == 3; } An example of how parse arguments are used: namespace bp = boost::parser; // This parser matches one code point that is at least 'a', and at most // the value of last_char, which comes from the globals. auto last_char = [](auto & ctx) { return _globals(ctx).last_char; } auto subparser = bp::char_('a', last_char); Don't worry for now about what the globals are for now; the take-away is that you can make any argument you pass to a parser depend on the current state of the parse, by using the parse context: namespace bp = boost::parser; // This parser parses two code points. For the parse to succeed, the // second one must be >= 'a' and <= the first one. auto set_last_char = [](auto & ctx) { _globals(ctx).last_char = _attr(x); }; auto parser = bp::char_[set_last_char] >> subparser; Each callable parser returns a new parser, parameterized using the arguments given in the invocation. This table lists all the _Parser_ parsers. For the callable parsers, a separate entry exists for each possible arity of arguments. For a parser `p`, if there is no entry for `p` without arguments, `p` is a function, and cannot itself be used as a parser; it must be called. In the table below: * each entry is a global object usable directly in your parsers, unless otherwise noted; * "code point" is used to refer to the elements of the input range, which assumes that the parse is being done in the Unicode-aware code path (if the parse is being done in the non-Unicode code path, read "code point" as "`char`"); * _RES_ is a notional macro that expands to the resolution of parse argument or evaluation of a parse predicate; * "`_RES_np_(pred) == true`" is a shorthand notation for "`_RES_np_(pred)` is contextually convertible to `bool` and `true`"; likewise for `false`; * `c` is a character of type `char`, `char8_t`, or `char32_t`; * `str` is a string literal of type `char const[]`, `char8_t const []`, or `char32_t const []`; * `pred` is a parse predicate; * `arg0`, `arg1`, `arg2`, ... are parse arguments; * `a` is a semantic action; * `r` is an object whose type models `parsable_range_like`; and * `p`, `p1`, `p2`, ... are parsers. [note The definition of `parsable_range_like` is: [parsable_range_like_concept] It is intended to be a range-like thing; a null-terminated sequence of characters is considered range-like, given that a pointer `T *` to a null-terminated string is isomorphic with `subrange`.] [note A slightly more complete description of the attributes generated by these parsers is in a subsequent section. The attributes are repeated here so you can use see all the properties of the parsers in one place.] [note Some of the parsers in this table consume no input. All parsers consume the input they match unless otherwise stated in the table below.] [table Parsers and Their Semantics [[Parser] [Semantics] [Attribute Type] [Notes]] [[ _e_ ] [ Matches /epsilon/, the empty string. Always matches, and consumes no input. ] [ None. ] [ Matching _e_ an unlimited number of times creates an infinite loop, which is undefined behavior in C++. _Parser_ will assert in debug mode when it encounters `*_e_`, `+_e_`, etc (this applies to unconditional _e_ only). ]] [[ `_e_(pred)` ] [ Fails to match the input if `_RES_np_(pred) == false`. Otherwise, the semantics are those of _e_. ] [ None. ] []] [[ _ws_ ] [ Matches a single whitespace code point (see note), according to the Unicode White_Space property. ] [ None. ] [ Prefer this over _space_, even when parsing ASCII; it is no less efficient to use _ws_, and it makes it easier to switch to Unicode mode later. For more info, see the [@https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt Unicode properties]. _ws_ may consume one code point or two. It only consumes two code points when it matches `"\r\n"`. ]] [[ _eol_ ] [ Matches a single newline (see note), following the "hard" line breaks in the Unicode line breaking algorithm. ] [ None. ] [ For more info, see the [@https://unicode.org/reports/tr14 Unicode Line Breaking Algorithm]. _eol_ may consume one code point or two. It only consumes two code points when it matches `"\r\n"`. ]] [[ _eoi_ ] [ Matches only at the end of input, and consumes no input. ] [ None. ] []] [[ _attr_np_`(arg0)` ] [ Always matches, and consumes no input. Generates the attribute `_RES_np_(arg0)`. ] [ `decltype(_RES_np_(arg0))`. ] []] [[ _ch_ ] [ Matches any single code point. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ] []] [[ `_ch_(arg0)` ] [ Matches exactly the code point `_RES_np_(arg0)`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ] []] [[ `_ch_(arg0, arg1)` ] [ Matches the next code point `n` in the input, if `_RES_np_(arg0) <= n && n <= _RES_np_(arg1)`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ] []] [[ `_ch_(r)` ] [ Matches the next code point `n` in the input, if `n` is one of the code points in `r`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See _attr_gen_. ] [ `r` is taken to be in a UTF encoding. The exact UTF used depends on `r`'s element type. If you do not pass UTF encoded ranges for `r`, the behavior of _ch_ is undefined. Note that ASCII is a subset of UTF-8, so ASCII is fine. EBCDIC is not. `r` is not copied; a reference to it is taken. The lifetime of `_ch_(r)` must be within the lifetime of `r`. This overload of _ch_ does *not* take parse arguments. ]] [[ _cp_ ] [ Matches a single code point. ] [ `char32_t` ] [ Similar to _ch_, but with a fixed `char32_t` attribute type; _cp_ has all the same call operator overloads as _ch_, though they are not repeated here, for brevity. ]] [[ _cu_ ] [ Matches a single code point. ] [ `char` ] [ Similar to _ch_, but with a fixed `char` attribute type; _cu_ has all the same call operator overloads as _ch_, though they are not repeated here, for brevity. Even though the name "`cu`" suggests that this parser match at the code unit level, it does not. The name refers to the attribute type generated, much like the names _i_ versus _ui_. ]] [[ `_alnum_` ] [ Matches a single code point for which `std::alnum()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_alpha_` ] [ Matches a single code point for which `std::isalpha()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_blank_` ] [ Matches a single code point for which `std::isblank()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_cntrl_` ] [ Matches a single code point for which `std::iscntrl()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_digit_` ] [ Matches a single code point for which `std::isdigit()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_graph_` ] [ Matches a single code point for which `std::isgraph()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_print_` ] [ Matches a single code point for which `std::isprint()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_punct_` ] [ Matches a single code point for which `std::ispunct()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_space_` ] [ Matches a single code point for which `std::isspace()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_xdigit_` ] [ Matches a single code point for which `std::isxdigit()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_lower_` ] [ Matches a single code point for which `std::islower()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ `_upper_` ] [ Matches a single code point for which `std::isupper()` is `true`. ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing. See the entry for _ch_. ] [ Intended for parsing of ASCII only. The results will be wrong for many, many cases if used for Unicode parsing. ]] [[ _lit_np_`(c)`] [ Matches exactly the given code point `c`. ] [ None. ] [_lit_ does *not* take parse arguments. ]] [[ `c_l` ] [ Matches exactly the given code point `c`. ] [ None. ] [ This is a _udl_ that represents `_lit_np_(c)`, for example `'F'_l`. ]] [[ _lit_np_`(r)`] [ Matches exactly the given string `r`. ] [ None. ] [ _lit_ does *not* take parse arguments. ]] [[ `str_l` ] [ Matches exactly the given string `str`. ] [ None. ] [ This is a _udl_ that represents `_lit_np_(s)`, for example `"a string"_l`. ]] [[ `_str_np_(r)`] [ Matches exactly `r`, and generates the match as an attribute. ] [ _std_str_ ] [ _str_ does *not* take parse arguments. ]] [[ `str_p`] [ Matches exactly `str`, and generates the match as an attribute. ] [ _std_str_ ] [ This is a _udl_ that represents `_str_np_(s)`, for example `"a string"_p`. ]] [[ _b_ ] [ Matches `"true"` or `"false"`. ] [ `bool` ] []] [[ _bin_ ] [ Matches a binary unsigned integral value. ] [ `unsigned int` ] [ For example, _bin_ would match `"101"`, and generate an attribute of `5u`. ]] [[ `_bin_(arg0)` ] [ Matches exactly the binary unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned int` ] []] [[ _oct_ ] [ Matches an octal unsigned integral value. ] [ `unsigned int` ] [ For example, _oct_ would match `"31"`, and generate an attribute of `25u`. ]] [[ `_oct_(arg0)` ] [ Matches exactly the octal unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned int` ] []] [[ _hex_ ] [ Matches a hexadecimal unsigned integral value. ] [ `unsigned int` ] [ For example, _hex_ would match `"ff"`, and generate an attribute of `255u`. ]] [[ `_hex_(arg0)` ] [ Matches exactly the hexadecimal unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned int` ] []] [[ _us_ ] [ Matches an unsigned integral value. ] [ `unsigned short` ] []] [[ `_us_(arg0)` ] [ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned short` ] []] [[ _ui_ ] [ Matches an unsigned integral value. ] [ `unsigned int` ] []] [[ `_ui_(arg0)` ] [ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned int` ] []] [[ _ul_ ] [ Matches an unsigned integral value. ] [ `unsigned long` ] []] [[ `_ul_(arg0)` ] [ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned long` ] []] [[ _ull_ ] [ Matches an unsigned integral value. ] [ `unsigned long long` ] []] [[ `_ull_(arg0)` ] [ Matches exactly the unsigned integral value `_RES_np_(arg0)`. ] [ `unsigned long long` ] []] [[ _s_ ] [ Matches a signed integral value. ] [ `short` ] []] [[ `_s_(arg0)` ] [ Matches exactly the signed integral value `_RES_np_(arg0)`. ] [ `short` ] []] [[ _i_ ] [ Matches a signed integral value. ] [ `int` ] []] [[ `_i_(arg0)` ] [ Matches exactly the signed integral value `_RES_np_(arg0)`. ] [ `int` ] []] [[ _l_ ] [ Matches a signed integral value. ] [ `long` ] []] [[ `_l_(arg0)` ] [ Matches exactly the signed integral value `_RES_np_(arg0)`. ] [ `long` ] []] [[ _ll_ ] [ Matches a signed integral value. ] [ `long long` ] []] [[ `_ll_(arg0)` ] [ Matches exactly the signed integral value `_RES_np_(arg0)`. ] [ `long long` ] []] [[ _f_ ] [ Matches a floating-point number. _f_ uses parsing implementation details from _Spirit_. The specifics of what formats are accepted can be found in their _spirit_reals_. Note that only the default `RealPolicies` is supported by _f_. ] [ `float` ] []] [[ _d_ ] [ Matches a floating-point number. _d_ uses parsing implementation details from _Spirit_. The specifics of what formats are accepted can be found in their _spirit_reals_. Note that only the default `RealPolicies` is supported by _d_. ] [ `double` ] []] [[ `_rpt_np_(arg0)[p]` ] [ Matches iff `p` matches exactly `_RES_np_(arg0)` times. ] [ `std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>` ] [ The special value _inf_ may be used; it indicates unlimited repetition. `decltype(_RES_np_(arg0))` must be implicitly convertible to `int64_t`. Matching _e_ an unlimited number of times creates an infinite loop, which is undefined behavior in C++. _Parser_ will assert in debug mode when it encounters `_rpt_np_(_inf_)[_e_]` (this applies to unconditional _e_ only). ]] [[ `_rpt_np_(arg0, arg1)[p]` ] [ Matches iff `p` matches between `_RES_np_(arg0)` and `_RES_np_(arg1)` times, inclusively. ] [ `std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>` ] [ The special value _inf_ may be used for the upper bound; it indicates unlimited repetition. `decltype(_RES_np_(arg0))` and `decltype(_RES_np_(arg1))` each must be implicitly convertible to `int64_t`. Matching _e_ an unlimited number of times creates an infinite loop, which is undefined behavior in C++. _Parser_ will assert in debug mode when it encounters `_rpt_np_(n, _inf_)[_e_]` (this applies to unconditional _e_ only). ]] [[ `_if_np_(pred)[p]` ] [ Equivalent to `_e_(pred) >> p`. ] [ `std::optional<_ATTR_np_(p)>` ] [ It is an error to write `_if_np_(pred)`. That is, it is an error to omit the conditionally matched parser `p`. ]] [[ `_sw_np_(arg0)(arg1, p1)(arg2, p2) ...` ] [ Equivalent to `p1` when `_RES_np_(arg0) == _RES_np_(arg1)`, `p2` when `_RES_np_(arg0) == _RES_np_(arg2)`, etc. If there is such no `argN`, the behavior of _sw_ is undefined. ] [ `std::variant<_ATTR_np_(p1), _ATTR_np_(p2), ...>` ] [ It is an error to write `_sw_np_(arg0)`. That is, it is an error to omit the conditionally matched parsers `p1`, `p2`, .... ]] [[ _symbols_t_ ] [ _symbols_ is an associative container of key, value pairs. Each key is a _std_str_ and each value has type `T`. In the Unicode parsing path, the strings are considered to be UTF-8 encoded; in the non-Unicode path, no encoding is assumed. _symbols_ Matches the longest prefix `pre` of the input that is equal to one of the keys `k`. If the length `len` of `pre` is zero, and there is no zero-length key, it does not match the input. If `len` is positive, the generated attribute is the value associated with `k`.] [ `T` ] [ Unlike the other entries in this table, _symbols_ is a type, not an object. ]] ] If you have an integral type `IntType` that is not covered by any of the _Parser_ parsers, you can use a more verbose declaration to declare a parser for `IntType`. If `IntType` were unsigned, you would use `uint_parser`. If it were signed, you would use `int_parser`. For example: constexpr parser_interface> hex_int; `uint_parser` and `int_parser` accept three more non-type template parameters after the type parameter. They are `Radix`, `MinDigits`, and `MaxDigits`. `Radix` defaults to `10`, `MinDigits` to `1`, and `MaxDigits` to `-1`, which is a sentinel value meaning that there is no max number of digits. So, if you wanted to parse exactly eight hexadecimal digits in a row in order to recognize Unicode character literals like C++ has (e.g. `\Udeadbeef`), you could use this parser for the digits at the end: constexpr parser_interface> hex_int; [endsect] [section Directives] A directive is an element of your parser that doesn't have any meaning by itself. Some are second-order parsers that need a first-order parser to do the actual parsing. Others influence the parse in some way. Lexically, you can spot a directive by its use of `[]`. Non-directives never use `[]`, and directives always do. The directives that are second order parsers are technically directives, but since they are also used to create parsers, it is more useful just to focus on that. The directives _rpt_ and _if_ were already described in the section on parsers; we won't say more about them here. That leaves the directives that affect aspects of the parse: [heading _omit_] `_omit_np_[p]` disables attribute generation for the parser `p`. Not only does `_omit_np_[p]` have no attribute, but any attribute generation work that normally happens within `p` is skipped. This directive can be useful in cases like this: say you have some fairly complicated parser `p` that generates a large and expensive-to-construct attribute. Now say that you want to write a function that just counts how many times `p` can match a string (where the matches are non-overlapping). Instead of using `p` directly, and building all those attributes, or rewriting `p` without the attribute generation, use _omit_. [heading _raw_] `_raw_np_[p]` changes the attribute from `_ATTR_np_(p)` to to a view that delimits the subrange of the input that was matched by `p`. The type of the view is `_v_`, where `I` is the type of the iterator used within the parse. Note that this may not be the same as the iterator type passed to _p_. For instance, when parsing UTF-8, the iterator passed to _p_ may be `char8_t const *`, but within the parse it will be a UTF-8 to UTF-32 transcoding (converting) iterator. Just like _omit_, _raw_ causes all attribute-generation work within `p` to be skipped. Similar to the re-use scenario for _omit_ above, _raw_ could be used to find the *locations* of all non-overlapping matches of `p` in a string. [heading _string_view_] `_string_view_np_[p]` is very similar to `_raw_np_[p]`, except that it changes the attribute of `p` to `std::basic_string_view`, where `C` is the character type of the underlying sequence being parsed. _string_view_ requires that the underlying range being parsed is contiguous. Since this can only be detected in C++20 and later, _string_view_ is not available in C++17 mode. Similar to the re-use scenario for _omit_ above, _string_view_ could be used to find the *locations* of all non-overlapping matches of `p` in a string. Whether _raw_ or _string_view_ is more natural to use to report the locations depends on your use case, but they are essentially the same. [heading _no_case_] `_no_case_np_[p]` enables case-insensitive parsing within the parse of `p`. This applies to the text parsed by `_ch_()`, _str_, and _b_ parsers. The number parsers are already case-insensitive. The case-insensitivity is achieved by doing Unicode case folding on the text being parsed and the values in the parser being matched (see note below if you want to know more about Unicode case folding). In the non-Unicode code path, a full Unicode case folding is not done; instead, only the transformations of values less than `0x100` are done. Examples: namespace bp = boost::parser; auto const street_parser = bp::string(u8"Tobias Straße"); assert(!bp::parse("Tobias Strasse" | bp::as_utf32, street_parser)); // No match. assert(bp::parse("Tobias Strasse" | bp::as_utf32, bp::no_case[street_parser])); // Match! auto const alpha_parser = bp::no_case[bp::char_('a', 'z')]; assert(bp::parse("a" | bp::as_utf32, bp::no_case[alpha_parser])); // Match! assert(bp::parse("B" | bp::as_utf32, bp::no_case[alpha_parser])); // Match! Everything pretty much does what you'd naively expect inside _no_case_, except that the two-character range version of `char_` has a limitation. It only compares a code point from the input to its two arguments (e.g. `'a'` and `'z'` in the example above). It does not do anything special for multi-code point case folding expansions. For instance, `char_(U'ß', U'ß')` matches the input `U"s"`, which makes sense, since `U'ß'` expands to `U"ss"`. However, that same parser *does not* match the input `U"ß"`! In short, stick to pairs of code points that have single-code point case folding expansions. If you need to support the multi-expanding code points, use the other overload, like: `char_(U"abcd/*...*/ß")`. [note Unicode case folding is an operation that makes text uniformly one case, and if you do it to two bits of text `A` and `B`, then you can compare them bitwise to see if they are the same, except of case. Case folding may sometimes expand a code point into multiple code points (e.g. case folding `"ẞ"` yields `"ss"`. When such a multi-code point expansion occurs, the expanded code points are in the NFKC normalization form.] [heading _lexeme_] `_lexeme_np_[p]` disables use of the skipper, if a skipper is being used, within the parse of `p`. This is useful, for instance, if you want to enable skipping in most parts of your parser, but disable it only in one section where it doesn't belong. If you are skipping whitespace in most of your parser, but want to parse strings that may contain spaces, you should use _lexeme_: namespace bp = boost::parser; auto const string_parser = bp::lexeme['"' >> *(bp::char_ - '"') >> '"']; Without _lexeme_, our string parser would correctly match `"foo bar"`, but the generated attribute would be `"foobar"`. [heading _skip_] _skip_ is like the inverse of _lexeme_. It enables skipping in the parse, even if it was not enabled before. For example, within a call to _p_ that uses a skipper, let's say we have these parsers in use: namespace bp = boost::parser; auto const one_or_more = +bp::char_; auto const skip_or_skip_not_there_is_no_try = bp::lexeme[bp::skip[one_or_more] >> one_or_more]; The use of _lexeme_ disables skipping, but then the use of _skip_ turns it back on. The net result is that the first occurrence of `one_or_more` will use the skipper passed to _p_; the second will not. _skip_ has another use. You can parameterize skip with a different parser to change the skipper just within the scope of the directive. Let's say we passed _space_ to _p_, and we're using these parsers somewhere within that _p_ call: namespace bp = boost::parser; auto const zero_or_more = *bp::char_; auto const skip_both_ways = zero_or_more >> bp::skip(bp::ws)[zero_or_more]; The first occurrence of `zero_or_more` will use the skipper passed to _p_, which is _space_; the second will use _ws_ as its skipper. [endsect] [section Combining Operations] Certain overloaded operators are defined for all parsers in _Parser_. We've already seen some of them used in this tutorial, especially `operator>>()` and `operator|()`, which are used to form sequence parsers and alternative parsers, respectively. Here are all the operator overloaded for parsers. In the tables below: * `c` is a character of type `char` or `char32_t`; * `a` is a semantic action; * `r` is an object whose type models `parsable_range_like` (see _concepts_); and * `p`, `p1`, `p2`, ... are parsers. [note Some of the expressions in this table consume no input. All parsers consume the input they match unless otherwise stated in the table below.] [table Combining Operations and Their Semantics [[Expression] [Semantics] [Attribute Type] [Notes]] [[`!p`] [ Matches iff `p` does not match; consumes no input. ] [None.] []] [[`&p`] [ Matches iff `p` matches; consumes no input. ] [None.] []] [[`*p`] [ Parses using `p` repeatedly until `p` no longer matches; always matches. ] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`] [ Matching _e_ an unlimited number of times creates an infinite loop, which is undefined behavior in C++. _Parser_ will assert in debug mode when it encounters `*_e_` (this applies to unconditional _e_ only). ]] [[`+p`] [ Parses using `p` repeatedly until `p` no longer matches; matches iff `p` matches at least once. ] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`] [ Matching _e_ an unlimited number of times creates an infinite loop, which is undefined behavior in C++. _Parser_ will assert in debug mode when it encounters `+_e_` (this applies to unconditional _e_ only). ]] [[`-p`] [ Equivalent to `p | _e_`. ] [`std::optional<_ATTR_np_(p)>`] []] [[`p1 >> p2`] [ Matches iff `p1` matches and then `p2` matches. ] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `>>` is associative; `p1 >> p2 >> p3`, `(p1 >> p2) >> p3`, and `p1 >> (p2 >> p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]] [[`p >> c`] [ Equivalent to `p >> lit(c)`. ] [`_ATTR_np_(p)`] []] [[`p >> r`] [ Equivalent to `p >> lit(r)`. ] [`_ATTR_np_(p)`] []] [[`p1 > p2`] [ Matches iff `p1` matches and then `p2` matches. No back-tracking is allowed after `p1` matches; if `p1` matches but then `p2` does not, the top-level parse fails. ] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `>` is associative; `p1 > p2 > p3`, `(p1 > p2) > p3`, and `p1 > (p2 > p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]] [[`p > c`] [ Equivalent to `p > lit(c)`. ] [`_ATTR_np_(p)`] []] [[`p > r`] [ Equivalent to `p > lit(r)`. ] [`_ATTR_np_(p)`] []] [[`p1 | p2`] [ Matches iff either `p1` matches or `p2` matches. ] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>` (See note.)] [ `|` is associative; `p1 | p2 | p3`, `(p1 | p2) | p3`, and `p1 | (p2 | p3)` are all equivalent. This attribute type only applies to the case where `p1` and `p2` both generate attributes; see _attr_gen_ for the full rules. ]] [[`p | c`] [ Equivalent to `p | lit(c)`. ] [`_ATTR_np_(p)`] []] [[`p | r`] [ Equivalent to `p | lit(r)`. ] [`_ATTR_np_(p)`] []] [[`p1 - p2`] [ Equivalent to `!p2 >> p1`. ] [`_ATTR_np_(p1)`] []] [[`p - c`] [ Equivalent to `p - lit(c)`. ] [`_ATTR_np_(p)`] []] [[`p - r`] [ Equivalent to `p - lit(r)`. ] [`_ATTR_np_(p)`] []] [[`p1 % p2`] [ Equivalent to `p1 >> *(p2 >> p1)`. ] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p1)>`] []] [[`p % c`] [ Equivalent to `p % lit(c)`. ] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`] []] [[`p % r`] [ Equivalent to `p % lit(r)`. ] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`] []] [[`p[a]`] [ Matches iff `p` matches. If `p` matches, the semantic action `a` is executed. ] [None.] []] ] There are a couple of special rules not captured in the table above: First, the zero-or-more and one-or-more repetitions (`operator*()` and `operator+()`, respectively) may collapse when combined. For any parser `p`, `+(+p)` collapses to `+p`; `**p`, `*+p`, and `+*p` each collapse to just `*p`. Second, using _e_ in an alternative parser as any alternative *except* the last one is a common source of errors; _Parser_ disallows it. This is true because, for any parser `p`, `_e_ | p` is equivalent to _e_, since _e_ always matches. This is not true for _e_ parameterized with a condition. For any condition `cond`, `_e_(cond)` is allowed to appear anywhere within an alternative parser. [note When looking at _Parser_ parsers in a debugger, or when looking at their reference documentation, you may see reference to the template _p_iface_. This template exists to provide the operator overloads described above. It allows the parsers themselves to be very simple _emdash_ most parsers are just a struct with two member functions. _p_iface_ is essentially invisible when using _Parser_, and you should never have to name this template in your own code. ] [endsect] [section Attribute Generation] So far, we've seen several different types of attributes that come from different parsers, `int` for _i_, `_bp_tup_` for `boost::parser::char_ >> boost::parser::int_`, etc. Let's get into how this works with more rigor. [note Some parsers have no attribute at all. In the tables below, the type of the attribute is listed as "None." There is a non-`void` type that is returned from each parser that lacks an attribute. This keeps the logic simple; having to handle the two cases _emdash_ `void` or non-`void` _emdash_ would make the library significantly more complicated. The type of this non-`void` attribute associated with these parsers is an implementation detail. The type comes from the `boost::parser::detail` namespace and is pretty useless. You should never see this type in practice. Within semantic actions, asking for the attribute of a non-attribute-producing parser (using `_attr(ctx)`) will yield a value of the special type `boost::parser::none`. When calling _p_ in a form that returns the attribute parsed, when there is no attribute, simply returns `bool`; this indicates the success of failure of the parse.] [heading Parser attributes] This table summarizes the attributes generated for all _Parser_ parsers. In the table below: * _RES_ is a notional macro that expands to the resolution of parse argument or evaluation of a parse predicate, and * `x` and `y` represent arbitrary objects. [table Parsers and Their Attributes [[Parser] [Attribute Type] [Notes]] [[ _e_ ] [ None. ] []] [[ _eol_ ] [ None. ] []] [[ _eoi_ ] [ None. ] []] [[ `_attr_np_(x)` ] [ `decltype(_RES_np_(x))` ][]] [[ _ch_ ] [ The code point type in Unicode parsing, or `char` in non-Unicode parsing; see below. ] [Includes all the `_p` _udls_ that take a single character, and all parsers in the `boost::parser::ascii` namespace.]] [[ _cp_ ] [ `char32_t` ] []] [[ _cu_ ] [ `char` ] []] [[ `_lit_np_(x)`] [ None. ] [Includes all the `_l` _udls_.]] [[ `_str_np_(x)`] [ _std_str_ ] [Includes all the `_p` _udls_ that take a string.]] [[ _b_ ] [ `bool` ] []] [[ _bin_ ] [ `unsigned int` ] []] [[ _oct_ ] [ `unsigned int` ] []] [[ _hex_ ] [ `unsigned int` ] []] [[ _us_ ] [ `unsigned short` ] []] [[ _ui_ ] [ `unsigned int` ] []] [[ _ul_ ] [ `unsigned long` ] []] [[ _ull_ ] [ `unsigned long long` ] []] [[ _s_ ] [ `short` ] []] [[ _i_ ] [ `int` ] []] [[ _l_ ] [ `long` ] []] [[ _ll_ ] [ `long long` ] []] [[ _f_ ] [ `float` ] []] [[ _d_ ] [ `double` ] []] [[ _symbols_t_ ] [ `T` ]] ] _ch_ is a bit odd, since its attribute type is polymorphic. When you use _ch_ to parse text in the non-Unicode code path (i.e. a string of `char`), the attribute is `char`. When you use the exact same _ch_ to parse in the Unicode-aware code path, all matching is code point based, and so the attribute type is the type used to represent code points, `char32_t`. All parsing of UTF-8 falls under this case. For example, when you parse plain `char`s, meaning that the parsing is in the non-Unicode code path, the attribute of _ch_ is `char`: auto result = parse("some text", boost::parser::char_); static_assert(std::is_same_v>)); When you parse UTF-8, the matching is done on a code point basis, and the code point type is `char32_t`, so the attribute type is `char32_t`: auto result = parse("some text" | boost::parser::as_utf8, boost::parser::char_); static_assert(std::is_same_v>)); [tip If you know or suspect that you will want to use the same parser in Unicode and non-Unicode parsing modes, you can use _cp_ and/or _cu_ to enforce a non-polymorphic attribute type.] [heading Combining operation attributes] Combining operations of course affect the generation of attributes. In the tables below: * `m` and `n` are parse arguments that resolve to integral values; * `pred` is a parse predicate; * `arg0`, `arg1`, `arg2`, ... are parse arguments; * `a` is a semantic action; and * `p`, `p1`, `p2`, ... are parsers that generate attributes. [table Combining Operations and Their Attributes [[Parser] [Attribute Type]] [[`!p`] [None.]] [[`&p`] [None.]] [[`*p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`+p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`+*p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`*+p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`-p`] [`std::optional<_ATTR_np_(p)>`]] [[`p1 >> p2`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2)>`]] [[`p1 > p2`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2)>`]] [[`p1 >> p2 >> p3`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]] [[`p1 > p2 >> p3`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]] [[`p1 >> p2 > p3`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]] [[`p1 > p2 > p3`] [`_bp_tup_<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]] [[`p1 | p2`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>`]] [[`p1 | p2 | p3`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2), _ATTR_np_(p3)>`]] [[`p1 % p2`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p1)>`]] [[`p[a]`] [None.]] [[`_rpt_np_(arg0)[p]`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`_rpt_np_(arg0, arg1)[p]`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`_if_np_(pred)[p]`] [`std::optional<_ATTR_np_(p)>`]] [[`_sw_np_(arg0)(arg1, p1)(arg2, p2)...`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2), ...>`]] ] [important In case you did not notice it above, adding a semantic action to a parser erases the parser's attribute. The attribute is still available inside the semantic action as `_attr(ctx)`.] There are a relatively small number of rules that define how sequence parsers and alternative parsers' attributes are generated. (Don't worry, there are examples below.) [heading Sequence parser attribute rules] The attribute generation behavior of sequence parsers is conceptually pretty simple: * the attributes of subparsers form a tuple of values; * subparsers that do not generate attributes do not contribute to the sequence's attribute; * subparsers that do generate attributes usually contribute an individual element to the tuple result; except * when containers of the same element type are next to each other, or individual elements are next to containers of their type, the two adjacent attributes collapse into one attribute; and * if the result of all that is a degenerate tuple `_bp_tup_` (even if `T` is a type that means "no attribute"), the attribute becomes `T`. More formally, the attribute generation algorithm works like this. For a sequence parser `p`, let the list of attribute types for the subparsers of `p` be `a0, a1, a2, ..., an`. We get the attribute of `p` by evaluating a compile-time left fold operation, `left-fold({a1, a2, ..., an}, tuple, OP)`. `OP` is the combining operation that takes the current attribute type (initially `_bp_tup_`) and the next attribute type, and returns the new current attribute type. The current attribute type at the end of the fold operation is the attribute type for `p`. `OP` attempts to apply a series of rules, one at a time. The rules are noted as `X >> Y -> Z`, where `X` is the type of the current attribute, `Y` is the type of the next attribute, and `Z` is the new current attribute type. In these rules, `C` is a container of `T`; `none` is a special type that indicates that there is no attribute; `T` is a type; and `Ts...` is a parameter pack of one or more types. Note that `T` may be the special type `none`. [note The current attribute is always a tuple (call it `Tup`), so the "current attribute `X`" refers to the last element of `Tup`, not `Tup` itself, except for those rules that explicitly mention `_bp_tup_<>` as part of `X`'s type.] * `none >> T -> T` * `T >> none -> T` * `C >> T -> C` * `T >> C -> C` * `C >> optional -> C` * `optional >> C -> C` * `_bp_tup_ >> T -> _bp_tup_` * `_bp_tup_ >> T -> _bp_tup_` Again, if the final result is that the attribute is `_bp_tup_`, the attribute becomes `T`. [note What constitutes a container in the rules above is determined by the `container` concept: [container_concept] ] [heading Alternative parser attribute rules] The rules for alternative parsers are much simpler. For an alternative parer `p`, let the list of attribute types for the subparsers of `p` be `a0, a1, a2, ..., an`. The attribute of `p` is `std::variant`, with the following steps applied: * all the `none` attributes are left out, and if any are, the attribute is wrapped in a `std::optional`, like `std::optional>`; * if the attribute is `std::variant` or `std::optional>`, the attribute becomes instead `T` or `std::optional`, respectively; and * if the attribute is `std::variant<>` or `std::optional>`, the result becomes `none` instead. [heading Formation of containers in attributes] There are no special rules for forming containers from non-containers. For instance, one of the rules above for sequence containers is `T >> T -> vector`. So, you get a vector if you have multiple values in sequence. Another rule is that the attribute of `*p` is `std::vector<_ATTR_np_(p)>`. The point is, _Parser_ will generate your favorite container out of sequences and repetitions, as long as your favorite container is `std::vector`. Another rule for sequence containers is that a value `x` and a container `c` containing elements of `x`'s type will form a single container. However, `x`'s type must be exactly the same as the elements in `c`. So, the attribute of `char_ >> string("str")` is odd. In the non-Unicode code path, `char_`'s attribute type is guaranteed to be `char`, so `_ATTR_np_(char_ >> string("str"))` is _std_str_. If you are parsing UTF-8 in the Unicode code path, `char_`'s attribute type is `char32_t`, and `_ATTR_np_(char_ >> string("str"))` is therefor `_bp_tup_`. Again, there are no special rules for combining values and containers. Every combination results from an exact match. [heading Examples of attributes generated by sequence and alternative parsers] In the table: `a` is a semantic action; and `p`, `p1`, `p2`, ... are parsers that generate attributes. Note that only `>>` is used here; `>` has the exact same attribute generation rules. [table Sequence and Alternative Combining Operations and Their Attributes [[Expression] [Attribute Type]] [[`_e_ >> _e_`] [None.]] [[`p >> _e_`] [`_ATTR_np_(p)`]] [[`_e_ >> p`] [`_ATTR_np_(p)`]] [[`_cu_ >> _str_np_("str")`] [_std_str_]] [[`_str_np_("str") >> _cu_`] [_std_str_]] [[`*_cu_ >> _str_np_("str")`] [`_bp_tup_, std::string>`]] [[`_str_np_("str") >> *_cu_`] [`_bp_tup_>`]] [[`p >> p`] [`_bp_tup_<_ATTR_np_(p), _ATTR_np_(p)>`]] [[`*p >> p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`p >> *p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`*p >> -p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`-p >> *p`] [`std::string` if `_ATTR_np_(p)` is `char`, otherwise `std::vector<_ATTR_np_(p)>`]] [[`_str_np_("str") >> -_cu_`] [_std_str_]] [[`-_cu_ >> _str_np_("str")`] [_std_str_]] [[`!p1 | p2[a]`] [None.]] [[`p | p`] [`_ATTR_np_(p)`]] [[`p1 | p2`] [`std::variant<_ATTR_np_(p1), _ATTR_np_(p2)>`]] [[`p | `_e_] [`std::optional<_ATTR_np_(p)>`]] [[`p1 | p2 | _e_`] [`std::optional>`]] [[`p1 | p2[a] | p3`] [`std::optional>`]] ] [heading Directives that affect attribute generation] `_omit_np_[p]` disables attribute generation for the parser `p`. `_raw_np_[p]` changes the attribute from `_ATTR_np_(p)` to a view that indicates the subrange of the input that was matched by `p`. `_string_view_np_[p]` is just like `_raw_np_[p]`, except that it produces `std::basic_string_view`s. See _directives_ for details. [endsect] [section The `parse()` API] There are multiple top-level parse functions. They have some things in common: * They each return a value contextually convertible to `bool`. * They each take at least a range to parse and a parser. The "range to parse" may be an iterator/sentinel pair or an single range-like object. * They each require forward iterability of the range to parse. * They each accept any range with an character element type. This means that they can each parse ranges of `char`, `wchar_t`, `char8_t`, `char16_t`, or `char32_t`. * The overloads with `prefix_` in their name take an iterator/sentinel pair. For example `_pp_np_(first, last, p, _ws_)`, which parses the range `[first, last)`, advancing `first` as it goes. If the parse succeeds, the entire input may or may not have been matched. The value of `first` will indicate the last location within the input that `p` matched. The *whole* input was matched if and only if `first == last` after the call to _p_. * When you call any of the range-like overloads of _p_, for example `_p_np_(r, p, _ws_)`, _p_ only indicates success if *all* of `r` was matched by `p`. [note `wchar_t` is an accepted value type for the input. Please note that this is interpreted as UTF-16 on MSVC, and UTF-32 everywhere else.] [heading The overloads] There are eight overloads of _p_ and _pp_ combined, because there are three either/or options in how you call them. [heading Iterator/sentinel versus range-like] You can call _pp_ with an iterator and sentinel that delimit a range of character values. For example: namespace bp = boost::parser; auto const p = /* some parser ... */; char const * str_1 = /* ... */; // Using null_sentinel, str_1 can point to three billion characters, and // we can call prefix_parse() without having to find the end of the string first. auto result_1 = bp::prefix_parse(str_1, bp::null_sentinel, p, bp::ws); char str_2[] = /* ... */; auto result_2 = bp::prefix_parse(std::begin(str_2), std::end(str_2), p, bp::ws); The iterator/sentinel overloads can parse successfully without matching the entire input. You can tell if the entire input was matched by checking if `first == last` is true after _pp_ returns. By contrast, you call _p_ with a range of character values. When the range is a reference to an array of characters, any terminating `0` is ignored; this allows calls like `_p_np_("str", p)` to work naturally. namespace bp = boost::parser; auto const p = /* some parser ... */; std::u8string str_1 = "str"; auto result_1 = bp::parse(str_1, p, bp::ws); // The null terminator is ignored. This call parses s-t-r, not s-t-r-0. auto result_2 = bp::parse(U"str", p, bp::ws); char const * str_3 = "str"; auto result_3 = bp::parse(str_3 | boost::parser::as_utf16, p, bp::ws); You can also call _p_ with a pointer to a null-terminated string of character values. _p_ considers pointers to null-terminated strings to be ranges, since, for any pointer `T *` to a null-terminated string, `T *` is isomorphic with `subrange`. namespace bp = boost::parser; auto const p = /* some parser ... */; char const * str_1 = /* ... */ ; auto result_1 = bp::parse(str_1, p, bp::ws); char8_t const * str_2 = /* ... */ ; auto result_2 = bp::parse(str_2, p, bp::ws); char16_t const * str_3 = /* ... */ ; auto result_3 = bp::parse(str_3, p, bp::ws); char32_t const * str_4 = /* ... */ ; auto result_4 = bp::parse(str_4, p, bp::ws); int const array[] = { 's', 't', 'r', 0 }; int const * array_ptr = array; auto result_5 = bp::parse(array_ptr, p, bp::ws); Since there is no way to indicate that `p` matches the input, but only a prefix of the input was matched, the range-like (non-iterator/sentinel) overloads of _p_ indicate failure if the entire input is not matched. [heading With or without an attribute out-parameter] namespace bp = boost::parser; auto const p = '"' >> *(bp::char_ - '"') >> '"'; char const * str = "\"two words\"" ; std::string result_1; bool const success = bp::parse(str, p, result_1); // success is true; result_1 is "two words" auto result_2 = bp::parse(str, p); // !!result_2 is true; *result_2 is "two words" When you call _p_ *with* an attribute out-parameter and parser `p`, the expected type is *something like* `_ATTR_np_(p)`. It doesn't have to be exactly that; I'll explain in a bit. The return type is `bool`. When you call _p_ *without* an attribute out-parameter and parser `p`, the return type is `std::optional<_ATTR_np_(p)>`. Note that when `_ATTR_np_(p)` is itself an `optional`, the return type is `std::optional>`. Each of those optionals tells you something different. The outer one tells you whether the parse succeeded. If so, the parser was successful, but it still generates an attribute that is an `optional` _emdash_ that's the inner one. [heading With or without a skipper] namespace bp = boost::parser; auto const p = '"' >> *(bp::char_ - '"') >> '"'; char const * str = "\"two words\"" ; auto result_1 = bp::parse(str, p); // !!result_1 is true; *result_1 is "two words" auto result_2 = bp::parse(str, p, bp::ws); // !!result_2 is true; *result_2 is "twowords" [heading Compatibility of attribute out-parameters] For any call to _p_ that takes an attribute out-parameter, like `_p_np_("str", p, bp::ws, out)`, the call is well-formed for a number of possible types of `out`; `decltype(out)` does not need to be exactly `_ATTR_np_(p)`. For instance, this is well-formed code that does not abort (remember that the attribute type of _str_ is _std_str_): namespace bp = boost::parser; auto const p = bp::string("foo"); std::vector result; bool const success = bp::parse("foo", p, result); assert(success && result == std::vector({'f', 'o', 'o'})); Even though `p` generates a _std_str_ attribute, when it actually takes the data it generates and writes it into an attribute, it only assumes that the attribute is a `container` (see _concepts_), not that it is some particular container type. It will happily `insert()` into a _std_str_ or a _std_vec_char_ all the same. _std_str_ and _std_vec_char_ are both containers of `char`, but it will also insert into a container with a different element type. `p` just needs to be able to insert the elements it produces into the attribute-container. As long as an implicit conversion allows that to work, everything is fine: namespace bp = boost::parser; auto const p = bp::string("foo"); std::deque result; bool const success = bp::parse("foo", p, result); assert(success && result == std::deque({'f', 'o', 'o'})); This works, too, even though it requires inserting elements from a generated sequence of `char32_t` into a container of `char` (remember that the attribute type of `+_cp_` is _std_vec_char32_): namespace bp = boost::parser; auto const p = +bp::cp; std::string result; bool const success = bp::parse("foo", p, result); assert(success && result == "foo"); This next example works as well, even though the change to a container is not at the top level. It is an element of the result tuple: namespace bp = boost::parser; // p matches one or more non-spaces, followed by a single space, followed by one or more repetitions of "foo". auto const p = +(bp::cp - ' ') >> ' ' >> +string("foo"); // attr_type is the attribute type generated by p. using attr_type = decltype(bp::parse(u8"", p)); static_assert( std::is_same_v< attr_type, std::optional< tuple, std::vector>>>); // This is similar to attr_type, with the std::vector changed to a std::string. boost::hana::tuple result; bool const success = bp::parse(u8"rôle foofoo", p, result); using namespace boost::hana::literals; assert(success); // p matches. assert(result[0_c].size() == 5u); // The 4 code points "rôle" get transcoded to 5 UTF-8 code points to fit in the std::string. assert(result[0_c] == (char const *)u8"rôle"); assert(result[1_c] == std::vector({"foo", "foo"})); As indicated in the inline comments, there are a couple of things to take away from this example: * If you change an attribute out-param (such as _std_str_ to `std::vector`, or _std_vec_char32_ to `std::deque`), the call to _p_ will often still be well-formed. * When changing out a container type, if both containers contain character values, the removed container's element type is `char32_t` (or `wchar_t` for non-MSVC builds), and the new container's element type is `char` or `char8_t`, _Parser_ assumes that this is a UTF-32-to-UTF-8 conversion, and silently transcodes the data when inserting into the new container. Let's look at a case where another simple-seeming type replacement does *not* work: namespace bp = boost::parser; auto const p = +(bp::int_ >> +bp::cp); using attr_type = decltype(bp::parse(u8"", p)); static_assert(std::is_same_v< attr_type, std::optional>>>>); std::vector> result; #if 0 bool const success = bp::parse(u8"42 rôle", p, bp::ws, result); // ill-formed! #endif In this case, removing a _std_vec_char32_ and putting a _std_str_ in its place makes the code ill-formed, even though we saw a similar replacement earlier. The reason this one does not work is that the replaced container is part of the element type of yet another container. At some point in the code, `p` would try to insert a `boost::hana::tuple>` _emdash_ the element type of the attribute type it normally generates _emdash_ into a vector of `boost::hana::tuple`s. There's no implicit conversion there, so the code is ill-formed. The take-away for this last example is that the ability to arbitrarily swap out data types within the type of the attribute you pass to _p_ is very flexible, but is also limited to structurally simple cases. When we discuss _rs_ in the next section, we'll see how this flexibility in the types of attributes can help when writing complicated parsers. [note Those were all examples of swapping out one container type for another. They make good examples because that is more likely to be surprising, and so it's getting lots of coverage here. You can also do much simpler things like parse using a _ui_, and writing its attribute into a `double`. In general, you can swap any type `T` out of the attribute, as long as the swap would not result in some ill-formed assignment within the parse. ] [heading Unicode versus non-Unicode parsing] A call to _p_ either considers the entire input to be in a UTF format (UTF-8, UTF-16, or UTF-32), or it considers the entire input to be in some unknown encoding. Here is how it deduces which case the call falls under: * If the range is a sequence of `char8_t`, or if the input is a `boost::parser::utf8_view`, the input is UTF-8. * Otherwise, if the value type of the range is `char`, the input is in an unknown encoding. * Otherwise, the input is in a UTF encoding. [tip if you want to want to parse in ASCII-only mode, or in some other non-Unicode encoding, use only sequences of `char`, like _std_str_ or `char const *`.] [tip If you want to ensure all input is parsed as Unicode, pass the input range `r` as `r | boost::parser::as_utf32` _emdash_ that's the first thing that happens to it inside _p_ in the Unicode parsing path anyway.] [note Since passing `boost::parser::utf8_view` is a special case, and since a sequence of `char`s `r` is otherwise considered an unknown encoding, `boost::parser::parse(r | boost::parser::as_utf8, p)` treats `r` as UTF-8, whereas `boost::parser::parse(r, p)` does not.] [heading The `trace_mode` parameter to _p_] Debugging parsers is notoriously difficult once they reach a certain size. To get a verbose trace of your parse, pass `boost::parser::trace::on` as the final parameter to _p_. It will show you the current parser being matched, the next few characters to be parsed, and any attributes generated. See the _eh_debugging_ section of the tutorial for details. [heading Globals and error handlers] Each call to _p_ can optionally have a globals object associated with it. To use a particular globals object with you parser, you call _w_glb_ to create a new parser with the globals object in it: struct globals_t { int foo; std::string bar; }; auto const parser = /* ... */; globals_t globals{42, "yay"}; auto result = boost::parser::parse("str", boost::parser::with_globals(parser, globals)); Every semantic action within that call to _p_ can access the same `globals_t` object using `_globals(ctx)`. The default error handler is great for most needs, but if you want to change it, you can do so by creating a new parser with a call to _w_eh_: auto const parser = /* ... */; my_error_handler error_handler; auto result = boost::parser::parse("str", boost::parser::error_handler(parser, error_handler)); [tip If your parsing environment does not allow you to report errors to a terminal, you may want to use [classref boost::parser::callback_error_handler `callback_error_handler`] instead of the default error handler.] [important Globals and the error handler are ignored, if present, on any parser except the top-level parser.] [endsect] [section Rules] We saw in the previous section how _p_ is flexible in what types it will accept as attribute out-parameters. That flexibility is a blessing and a curse. For instance, say you wanted to use the parser `+_ch_` to parse a _std_str_, and capture the result in a _std_str_. `+_ch_` generates an attribute of _std_vec_char_ when parsing a sequence of `char`, so you'd have to write the result into a vector first, including all the allocations that implies, and then you'd have to allocate space in a string, and copy the entire result. Not great. The flexibility of attribute out-parameters lets you avoid that. On the other hand, if you want to parse your result into a _std_vec_char_, but *accidentally* pass a _std_str_, the code is well-formed. Usually, we expect type mismatches like this to be ill-formed in C++. Fortunately, _rs_ help you address both these concerns. [heading Using _rs_ to nail down attribute flexibility] Every rule has a specific attribute type. If one is not specified, the rule has no attribute. The fact that the attribute is a specific type allows you to remove attribute flexibility. For instance, say we have a rule defined like this: [rule_intro_rule_definition] You can then use it in a call to _p_, and _p_ will return a `std::optional>`: [rule_intro_parse_call] If you call _p_ with an attribute out-parameter, it must be exactly `std::vector`: std::vector vec_result; bp::parse(input, doubles, bp::ws, vec_result); // Ok. std::deque deque_result; bp::parse(input, doubles, bp::ws, deque_result); // Ill-formed! If we wanted to use a `std::deque` as the attribute type of our rule: // Attribute changed to std::deque. bp::rule> doubles = "doubles"; auto const doubles_def = bp::double_ >> *(',' >> bp::double_); BOOST_PARSER_DEFINE_RULES(doubles); int main() { std::deque deque_result; bp::parse(input, doubles, bp::ws, deque_result); // Ok. } So, the attribute flexibility is still available, but only *within* the rule _emdash_ the parser `bp::double_ >> *(',' >> bp::double_)` can parse into a `std::vector` or a `std::deque`, but the rule `doubles` must parse into only the exact attribute it was declared to generate. The reason for this is that, inside the rule parsing implementation, there is code something like this: using attr_t = _ATTR_np_(doubles_def); attr_t attr; parse(first, last, parser, attr); attribute_out_param = std::move(attr); Where `attribute_out_param` is the attribute out-parameter we pass to _p_. If that final move assignment is ill-formed, the call to _p_ is too. [heading Using rules to exploit attribute flexibility] So, even though a rule reduces the flexibility of attributes it can generate, the fact that it is so easy to write a new rule means that we can use rules themselves to get the attribute flexibility we want across our code: namespace bp = boost::parser; // We only need to write the definition once... auto const generic_doubles_def = bp::double_ >> *(',' >> bp::double_); bp::rule> vec_doubles = "vec_doubles"; auto const & vec_doubles_def = generic_doubles_def; // ... and re-use it, BOOST_PARSER_DEFINE_RULES(vec_doubles); // Attribute changed to std::deque. bp::rule> deque_doubles = "deque_doubles"; auto const & deque_doubles_def = generic_doubles_def; // ... and re-use it again. BOOST_PARSER_DEFINE_RULES(deque_doubles); Now we have one of each, and we did not have to copy any parsing logic that would have to be maintained in two places. [heading Forward declaration] One of the advantages of using rules is that you can declare all your rules up front and then use them immediately afterward. This lets you make rules that use each other without introducing cycles: namespace bp = boost::parser; // Assume we have some polymorphic type that can be an object/dictionary, // array, string, or int, called `value_type`. bp::rule const string = "string"; bp::rule> const object_element = "object-element"; bp::rule const object = "object"; bp::rule const array = "array"; bp::rule const value = "value"; auto const string_def = bp::lexeme['"' >> *(bp::char_ - '"') > '"']; auto const object_element_def = string > ':' > value; auto const object_def = '{'_l >> -(object_element % ',') > '}'; auto const array_def = '['_l >> -(value % ',') > ']'; auto const value_def = bp::int_ | bp::bool_ | string | array | object; BOOST_PARSER_DEFINE_RULES(string, object_element, object, array, value); Here we have a parser for a Javascript-value-like type `value_type`. `value_type` may be an array, which itself may contain other arrays, objects, strings, etc. Since we need to be able to parse objects within arrays and vice versa, we need each of those two parsers to be able to refer to each other. [heading _val_] Inside all of a rule's semantic actions, the expression `_val_np_(ctx)` is a reference to the attribute that the rule generates. This can be useful when you want subparsers to build up the attribute in a specific way: namespace bp = boost::parser; using namespace bp::literals; bp::rule> const ints = "ints"; auto twenty_zeros = [](auto & ctx) { _val(ctx).resize(20, 0); }; auto push_back = [](auto & ctx) { _val(ctx).push_back(_attr(ctx)); }; auto const ints_def = "20-zeros"_l[twenty_zeros] | +bp::int_[push_back]; BOOST_PARSER_DEFINE_RULES(ints); [tip That's just an example. It's almost always better to do things without using semantic actions. We could have instead written `ints_def` as `"20-zeros" >> bp::attr(std::vector(20)) | +bp::int_`, which has the same semantics, is a lot easier to read, and is a lot less code.] [heading Locals] The _r_ template takes another template parameter we have not discussed yet. You can pass a third parameter to _r_, which will be available within semantic actions used in the rule as `_locals_np_(ctx)`. This gives your rule some local state, if it needs it: struct foo_locals { char first_value = 0; }; namespace bp = boost::parser; bp::rule const foo = "foo"; auto record_first = [](auto & ctx) { _locals(ctx).first_value = _attr(ctx); } auto check_against_first = [](auto & ctx) { char const first = _locals(ctx).first_value; char const attr = _attr(ctx); if (attr == first) _pass(ctx) = false; _val(ctx) = (int(first) << 8) | int(attr); }; auto const foo_def = bp::cu[record_first] >> bp::cu[check_against_first]; BOOST_PARSER_DEFINE_RULES(foo); `foo` matches the input if it can match two elements of the input in a row, but only if they are not the same value. Without locals, it's a lot harder to write parsers that have to track state as they parse. [heading Parameters] Sometimes, it is convenient to parameterize parsers. Consider this parsing rule from the _yaml_ spec: [pre \[137\] c-flow-sequence(n,c) ::= “\[” s-separate(n,c)? ns-s-flow-seq-entries(n,in-flow(c))? “\]” ] This YAML rule says that the parsing should proceed into two YAML subrules, both of which have these `n` and `c` parameters. It is certainly possible to transliterate these YAML parsing rules to something that uses unparameterized _Parser_ _rs_, but it is quite painful to do so. You give parameters to a _r_ by calling its `with()` member. The values you pass to `with()` are used to create a _bp_tup_ that is available in semantic actions attached to the rule, using `_params_np_(ctx)`. namespace bp = boost::parser; // Declare our rules. bp::rule foo = "foo"; bp::rule bar = "bar"; // Get the first parameter for this rule. auto first_param = [](auto & ctx) { using namespace boost::hana::literals; return _params(ctx)[0_c]; }; auto const foo_def = bp::repeat(first_param)[' '_l]; // Match ' ' the number of times indicated by the first parameter to foo. // Assume that bar has a locals struct with a local_indent member, and // that set_local_indent and local_indent are lambdas that respectively write // and read _locals(ctx).local_indent. // Parse an integer, and then pass that as a parameter to foo. auto const bar_def = bp::int_[set_local_indent] >> foo.with(local_indent); BOOST_PARSER_DEFINE_RULES(foo, bar); Passing parameters to _rs_ like this allows you to easily write parsers that change the way they parse depending on contextual data that they have already parsed. [heading The __p_ variable template] Getting at one of a rule's arguments and passing it as an argument to another parser can be very verbose. __p_ is a variable template that allows you to refer to the `n`th argument to the current rule, so that you can, in turn, pass it to one of the rule's subparsers. Using this, `foo_def` above can be rewritten as: auto const foo_def = bp::repeat(bp::_p<0>)[' '_l]; Using __p_ can prevent you from having to write a bunch of lambdas that get each get an argument out of the parse context using `_params_np_(ctx)[0_c]` or similar. [endsect] [section Unicode Support] _Parser_ was designed from the start to be Unicode friendly. There are numerous references to the "Unicode code path" and the "non-Unicode code path" in the _Parser_ documentation. Though there are in fact two code paths for Unicode and non-Unicode parsing, the code is not very different in the two code paths, as they are written generically. The only difference is that the Unicode code path parses the input as a range of code points, and the non-Unicode path does not. In effect, this means that, in the Unicode code path, when you call `_p_np_(r, p)` for some input range `r` and some parser `p`, the parse happens as if you called `_p_np_(r | boost::parser::as_utf32, p)` instead. (Of course, it does not matter if `r` is a null-terminated pointer, a proper range, or an iterator/sentinel pair; those all work fine with `boost::parser::as_utf32`.) Matching "characters" within _Parser_'s parsers is assumed to be a code point match. In the Unicode path there is a code point from the input that is matched to each _ch_ parser. In the non-Unicode path, the encoding is unknown, and so each element of the input is considered to be a whole "character" in the input encoding, analogous to a code point. From this point on, I will therefore refer to a single element of the input exclusively as a code point. So, let's say we write this parser: constexpr auto char8_parser = boost::parser::char_('\xcc'); For any _ch_ parser that should match a value or values, the type of the value to match is retained. So `char8_parser` contains a `char` that it will use for matching. If we had written: constexpr auto char32_parser = boost::parser::char_(U'\xcc'); `char32_parser` would instead contain a `char32_t` that it would use for matching. So, at any point during the parse, if `char8_parser` were being used to match a code point `next_cp` from the input, we would see the moral equivalent of `next_cp == '\xcc'`, and if `char32_parser` were being used to match `next_cp`, we'd see the equivalent of `next_cp == U'\xcc'`. The take-away here is that you can write _ch_ parsers that match specific values, without worrying if the input is Unicode or not because, under the covers, what takes place is a simple comparison of two integral values. [note _Parser_ actually promotes any two values to a common type using `std::common_type` before comparing them. This is almost always works because the input and any parameter passed to _ch_ must be character types. ] Since matches are always done at a code point level (remember, a "code point" in the non-Unicode path is assumed to be a single `char`), you get different results trying to match UTF-8 input in the Unicode and non-Unicode code paths: namespace bp = boost::parser; { std::string str = (char const *)u8"\xcc\x80"; // encodes the code point U+0300 auto first = str.begin(); // Since we've done nothing to indicate that we want to do Unicode // parsing, and we've passed a range of char to parse(), this will do // non-Unicode parsing. std::string chars; assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars)); // Finds one match of the *char* 0xcc, because the value in the parser // (0xcc) was matched against the two code points in the input (0xcc and // 0x80), and the first one was a match. assert(chars == "\xcc"); } { std::u8string str = u8"\xcc\x80"; // encodes the code point U+0300 auto first = str.begin(); // Since the input is a range of char8_t, this will do Unicode // parsing. The same thing would have happened if we passed // str | boost::parser::as_utf32 or even str | boost::parser::as_utf8. std::string chars; assert(bp::parse(first, str.end(), *bp::char_('\xcc'), chars)); // Finds zero matches of the *code point* 0xcc, because the value in // the parser (0xcc) was matched against the single code point in the // input, 0x0300. assert(chars == ""); } [heading Implicit Transcoding] Additionally, it is expected that most programs will use UTF-8 for the encoding of Unicode strings. _Parser_ is written with this typical case in mind. This means that if you are parsing 32-bit code points (as you always are in the Unicode path), and you want to catch the result in a container `C` of `char` or `char8_t` values, _Parser_ will silently transcode from UTF-32 to UTF-8 and write the attribute into `C`. This means that _std_str_, `std::u8string`, etc. are fine to use as attribute out-parameters for `*_ch_`, and the result will be UTF-8. [note UTF-16 strings as attributes are not supported directly. If you want to use UTF-16 strings as attributes, you may need to do so by transcoding a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can do this by using `boost::parser::as_utf16`.] The treatment of strings as UTF-8 is nearly ubiquitous within _Parser_. For instance, though the entire interface of _symbols_ uses _std_str_ or `std::string_view`, UTF-32 comparisons are used internally. [heading Explicit Transcoding] I mentioned above that the use of `boost::parser::utf*_view` as the range to parse opts you in to Unicode parsing. Here's a bit more about these views and how best to use them. If you want to do Unicode parsing, you're always going to be comparing code points at each step of the parse. As such, you're going to implicitly convert any parse input to UTF-32, if needed. This is what all the parse API functions do internally. However, there are times when you have parse input that is a sequence of UTF-8-encoded `char`s, and you want to do Unicode-aware parsing. As mentioned previously, _Parser_ has a special case for `char` inputs, and it will *not* assume that `char` sequences are UTF-8. If you want to tell the parse API to do Unicode processing on them anyway, you can use the `as_utf32` range adapter. (Note that you can use any of the `as_utf*` adaptors and the semantics will not differ from the semantics below.) namespace bp = boost::parser; auto const p = '"' >> *(bp::char_ - '"' - 0xb6) >> '"'; char const * str = "\"two wörds\""; // ö is two code units, 0xc3 0xb6 auto result_1 = bp::parse(str, p); // Treat each char as a code point (typically ASCII). assert(!result_1); auto result_2 = bp::parse(str | bp::as_utf32, p); // Unicode-aware parsing on code points. assert(result_2); The first call to _p_ treats each `char` as a code point, and since `"ö"` is the pair of code units `0xc3` `0xb6`, the parse matches the second code unit against the `- 0xb6` part of the parser above, causing the parse to fail. This happens because each code unit/`char` in `str` is treated as an independent code point. The second call to _p_ succeeds because, when the parse gets to the code point for `'ö'`, it is `0xf6` (U+00F6), which does not match the `- 0xb6` part of the parser. The other adaptors `as_utf8` and `as_utf16` are also provided for completeness, if you want to use them. They each can transcode any sequence of character types. A null-terminated string is considered a sequence of character type. [endsect] [section Callback Parsing] In most parsing cases, being able to generate an attribute that represents the result of the parse, or being able to parse into such an attribute, is sufficient. Sometimes, it is not. If you need to parse a very large chunk of text, the generated attribute may be too large to fit in memory. In other cases, you may want to generate attributes sometimes, and not others. _cb_rs_ exist for these kinds of uses. A _cb_r_ is just like a rule, except that it allows the rule's attribute to be returned to the caller via a callback, as long as the parse is started with a call to _cbp_ instead of _p_. Within a call to _p_, a _cb_r_ is identical to a regular _r_. For a rule with no attribute, the signature of a callback function is `void (tag)`, where `tag` is the tag-type used when declaring the rule. For a rule with an attribute `attr`, the signature is `void (tag, attr)`. For instance, with this rule: boost::parser::callback_rule foo = "foo"; this would be an appropriate callback function: void foo_callback(foo_tag) { std::cout << "Parsed a 'foo'!\n"; } For this rule: boost::parser::callback_rule bar = "bar"; this would be an appropriate callback function: void bar_callback(bar_tag, std::string const & s) { std::cout << "Parsed a 'bar' containing " << s << "!\n"; } [important In the case of `bar_callback()`, we don't need to do anything with `s` besides insert it into a stream, so we took it as a `const` lvalue reference. _Parser_ moves all attributes into callbacks, so the signature could also have been `void bar_callback(bar_tag, std::string s)` or `void bar_callback(bar_tag, std::string && s)`.] You opt into callback parsing by parsing with a call to _cbp_ instead of _p_. If you use _cb_rs_ with _p_, they're just regular _rs_. This allows you to choose whether to do "normal" attribute-generating/attribute-assigning parsing with _p_, or callback parsing with _cbp_, without rewriting much parsing code, if any. The only reason all _rs_ are not _cb_rs_ is that you may want to have some _rs_ use callbacks within a parse, and have some that do not. For instance, if you want to report the attribute of _cb_r_ `r1` via callback, `r1`'s implementation may use some rule `r2` to generate some or all of its attribute. See _ex_cb_json_ for an extended example of callback parsing. [endsect] [section Error Handling and Debugging] [heading Error handling] _Parser_ has good error reporting built into it. Consider what happens when we fail to parse at an expectation point (created using `operator>()`). If I feed the parser from the _ex_cb_json_ example a file called sample.json containing this input (note the unmatched `'['`): [teletype]`` { "key": "value", "foo": [, "bar": [] } `` This is the error message that is printed to the terminal: [teletype]`` sample.json:3:12: error: Expected ']' here: "foo": [, "bar": [] ^ `` That message is formatted like the diagnostics produced by Clang and GCC. It quotes the line on which the failure occurred, and even puts a caret under the exact position at which the parse failed. This error message is suitable for many kinds of end-users, and interoperates well with anything that supports Clang and/or GCC diagnostics. Most of _Parser_'s error handlers format their diagnostics this way, though you are not bound by that. You can make an error handler type that does whatever you want, as long as it meets the error handler interface. See `error_handler` in _concepts_ for details. The _Parser_ error handlers are: * _default_eh_: Produces formatted diagnostics like the one above, and prints them to `std::cerr`. _default_eh_ has no associated file name, and both errors and diagnostics are printed to `std::cerr`. This handler is `constexpr`-friendly. * _stream_eh_: Produces formatted diagnostics. One or two streams may be used. If two are used, errors go to one stream and warnings go to the other. A file name can be associated with the parse; if it is, that file name will appear in all diagnostics. * _cb_eh_: Produces formatted diagnostics. Calls a callback with the diagnostic message to report the diagnostic, rather than streaming out the diagnostic. A file name can be associated with the parse; if it is, that file name will appear in all diagnostics. This handler is useful for recording the diagnostics in memory. * _rethrow_eh_: Does nothing but re-throw any exception that it is asked to handle. Its `diagnose()` member functions are no-ops. [tip If you want to provide your own error handler, but still want to use the same formatting as _Parser_ error handlers, you can use the functions `write_formatted_message()` and `write_formatted_expectation_failure_error_message()` to do that for you.] [heading Fixing ill-formed code] Sometimes, during the writing of a parser, you make a simple mistake that is diagnosed horrifyingly, due to the high number of template instantiations between the line you just wrote and the point of use (usually, the call to _p_). By "sometimes", I mean "almost always and many, many times". _Parser_ has a workaround for situations like this. The workaround is to make the ill-formed code well-formed in as many circumstances as possible, and then do a runtime assert instead. Usually, C++ programmers try whenever they can to catch mistakes as early as they can. That usually means making as much bad code ill-formed as possible. Counter-intuitively, this does not work well in parser combinator situations. For an example of just how dramatically different these two debugging scenarios can be with _Parser_, please see the very long discussion in the _n_is_weird_ section of _rationale_. If you are morally opposed to this approach, or just hate fun, good news: you can turn off the use of this technique entirely by defining `BOOST_PARSER_NO_RUNTIME_ASSERTIONS`. [heading Runtime Debugging] Debugging parsers is hard. Any parser above a certain complexity level is nearly impossible to debug simply by looking at the parser's code. Stepping through the parse in a debugger is even worse. To provide a reasonable chance of debugging your parsers, _Parser_ has a trace mode that you can turn on simply by providing an extra parameter to _p_ or _cbp_: boost::parser::parse(input, parser, boost::parser::trace::on); Every overload of _p_ and _cbp_ takes this final parameter, which is defaulted to `boost::parser::trace::off`. If we trace a substantial parser, we will see a *lot* of output. Each code point of the input must be considered, one at a time, to see if a certain rule matches. An an example, let's trace a parse using the JSON parser from _ex_json_. The input is `"null"`. `null` is one of the types that a Javascript value can have; the top-level parser in the JSON parser example is: auto const value_p_def = number | bp::bool_ | null | string | array_p | object_p; So, a JSON value can be a number, or a Boolean, a `null`, etc. During the parse, each alternative will be tried in turn, until one is matched. I picked `null` because it is relatively close to the beginning of the `value_p_def` alternative parser. Even so, the output is pretty huge. Let's break it down as we go: [teletype]`` [begin value; input="null"] `` Each parser is traced as `[begin foo; ...]`, then the parsing operations themselves, and then `[end foo; ...]`. The name of a rule is used as its name in the `begin` and `end` parts of the trace. Non-rules have a name that is similar to the way the parser looked when you wrote it. Most lines will have the next few code points of the input quoted, as we have here (`input="null"`). [teletype]`` [begin number | bool_ | null | string | ...; input="null"] `` This shows the beginning of the parser *inside* the rule `value` _emdash_ the parser that actually does all the work. In the example code, this parser is called `value_p_def`. Since it isn't a rule, we have no name for it, so we show its implementation in terms of subparsers. Since it is a bit long, we don't print the entire thing. That's why that ellipsis is there. [teletype]`` [begin number; input="null"] [begin raw[lexeme[ >> ...]][<>]; input="null"] `` Now we're starting to see the real work being done. `number` is a somewhat complicated parser that does not match `"null"`, so there's a lot to wade through when following the trace of its attempt to do so. One thing to note is that, since we cannot print a name for an action, we just print `"<>"`. Something similar happens when we come to an attribute that we cannot print, because it has no stream insertion operation. In that case, `"<>"` is printed. [teletype]`` [begin raw[lexeme[ >> ...]]; input="null"] [begin lexeme[-char_('-') >> char_('1', '9') >> ... | ... >> ...]; input="null"] [begin -char_('-') >> char_('1', '9') >> *ascii::digit | char_('0') >> -(char_('.') >> ...) >> -( >> ...); input="null"] [begin -char_('-'); input="null"] [begin char_('-'); input="null"] no match [end char_('-'); input="null"] matched "" attribute: <> [end -char_('-'); input="null"] [begin char_('1', '9') >> *ascii::digit | char_('0'); input="null"] [begin char_('1', '9') >> *ascii::digit; input="null"] [begin char_('1', '9'); input="null"] no match [end char_('1', '9'); input="null"] no match [end char_('1', '9') >> *ascii::digit; input="null"] [begin char_('0'); input="null"] no match [end char_('0'); input="null"] no match [end char_('1', '9') >> *ascii::digit | char_('0'); input="null"] no match [end -char_('-') >> char_('1', '9') >> *ascii::digit | char_('0') >> -(char_('.') >> ...) >> -( >> ...); input="null"] no match [end lexeme[-char_('-') >> char_('1', '9') >> ... | ... >> ...]; input="null"] no match [end raw[lexeme[ >> ...]]; input="null"] no match [end raw[lexeme[ >> ...]][<>]; input="null"] no match [end number; input="null"] [begin bool_; input="null"] no match [end bool_; input="null"] `` `number` and `boost::parser::bool_` did not match, but `null` will: [teletype]`` [begin null; input="null"] [begin "null" >> attr(null); input="null"] [begin "null"; input="null"] [begin string("null"); input="null"] matched "null" attribute: [end string("null"); input=""] matched "null" attribute: null `` Finally, this parser actually matched, and the match generated the attribute `null`, which is a special value of the type `json::value`. Since we were matching a string literal `"null"`, earlier there was no attribute until we reached the `attr(null)` parser. [teletype]`` [end "null"; input=""] [begin attr(null); input=""] matched "" attribute: null [end attr(null); input=""] matched "null" attribute: null [end "null" >> attr(null); input=""] matched "null" attribute: null [end null; input=""] matched "null" attribute: null [end number | bool_ | null | string | ...; input=""] matched "null" attribute: null [end value; input=""] -------------------- parse succeeded -------------------- `` At the very end of the parse, the trace code prints out whether the top-level parse succeeded or failed. Some things to be aware of when looking at _Parser_ trace output: * There are some parsers you don't know about, because they are not directly documented. For instance, `p[a]` forms an `action_parser` containing the parser `p` and semantic action `a`. This is essentially an implementation detail, but unfortunately the trace output does not hide this from you. * For a parser `p`, the trace-name may be intentionally different from the actual structure of `p`. For example, in the trace above, you see a parser called simply `"null"`. This parser is actually `boost::parser::omit[boost::parser::string("null")]`, but what you typically write is just `"null"`, so that's the name used. There are two special cases like this: the one described here for `omit[string]`, and another for `omit[char_]`. * Since there are no other special cases for how parser names are printed, you may see parsers that are unlike what you wrote in your code. In the sections about the parsers and combining operations, you will sometimes see a parser or combining operation described in terms of an equivalent parser. For example, `if_(pred)[p]` is described as "Equivalent to `_e_(pred) >> p`". In a trace, you will not see `if_`; you will see _e_ and `p` instead. * The values of arguments passed to parsers is printed whenever possible. Sometimes, a parse argument is not a value itself, but a callable that produces that value. In these cases, you'll see the resolved value of the parse argument. [endsect] [section Memory Allocation] _Parser_ seldom allocates memory. The exceptions to this are: * _symbols_ allocates memory for the symbol/attribute pairs it contains. If symbols are added during the parse, allocations must also occur then. * The error handlers that can take a file name allocate memory for the file name, if one is provided. * If trace is turned on by passing `boost::parser::trace::on` to a top-level parsing function, the names of parsers are allocated. * When a failed expectation is encountered (using `operator>()`), the name of the failed parser is placed into a _std_str_, which will usually cause an allocation. * _str_'s attribute is a _std_str_, the use of which implies allocation. You can avoid this allocation by explicitly using a different string type for the attribute that does not allocate. * The attribute for `_rpt_np_(p)` in all its forms, including `operator*()`, `operator+()`, and `operator%()`, is `std::vector<_ATTR_np_(p)>`, the use of which implies allocation. You can avoid this allocation by explicitly using a different sequence container for the attribute that does not allocate. `boost::container::static_vector` or C++26's `std::inplace_vector` may be useful as such replacements. With the exception of allocating the name of the parser that was expected in a failed expectation situation, _Parser_ does not does not allocate unless you tell it to, by using _symbols_, using a particular error_handler, turning on trace, or parsing into attributes that allocate. [endsect] [section Best Practices] [heading Parse unicode from the start] If you want to parse ASCII, using the Unicode parsing API will not actually cost you anything. Your input will be parsed, `char` by `char`, and compared to values that are Unicode code points (which are `char32_t`s). One caveat is that there may be an extra branch on each char, if the input is UTF-8. If your performance requirements can tolerate this, your life will be much easier if you just start with Unicode and stick with it. Starting with Unicode support and UTF-8 input will allow you to properly handle unexpected input, like non-ASCII languages (that's most of them), with no additional effort on your part. For instance, matching whitespace is a little funky in the general (meaning Unicode) case, and only matching the ones in the ASCII range is an unnecessary limitation. [heading Write rules, and test them in isolation] Treat rules as the unit of work in your parser. Write a rule, test its corners, and then use it to build larger rules or parsers. This allows you to get better coverage with less work, since exercising all the code paths of your rules, one by one, keeps the combinatorial number of paths through your code manageable. [heading Don't rely on the `boost::parser::ascii` parsers] These are broken for many use cases, because they use the implementations from the C library (e.g. `isalnum()`). Those implementations do not work for non-ASCII values, *even though they take `int` parameters*. A general implementation of each of these is also difficult even in Unicode parsing, because the meaning of many of them is contextual. For example, whether a code point is lower case or not can depend on where it is within the text you're parsing, and can depend on the language you're parsing. You're better off naming the specific code points or ranges of code points you want to match. The `boost::parser::ascii` parsers are included for ASCII-only users, and for those porting parsers from Spirit. [heading If your parser takes end-user input, give rules names that you would want an end-user to see] A typical error message produced by _Parser_ will say something like, "Expected FOO here", where FOO is some rule or parser. Give your rules names that will read well in error messages like this. For instance, the JSON examples have these rules: bp::rule const escape_seq = "\\uXXXX hexadecimal escape sequence"; bp::rule const escape_double_seq = "\\uXXXX hexadecimal escape sequence"; bp::rule const single_escaped_char = "'\"', '\\', '/', 'b', 'f', 'n', 'r', or 't'"; Some things to note: - `escape_seq` and `escape_double_seq` have the same name-string. To an end-user who is trying to figure out why their input failed to parse, it doesn't matter which kind of result a parser rule generates. They just want to know how to fix their input. For either rule, the fix is the same: put a hexadecimal escape sequence there. - `single_escaped_char` has a terrible-looking name. However, it's not really used as a name anywhere per se. In error messages, it works nicely, though. The error will be "Expected '"', '\', '/', 'b', 'f', 'n', 'r', or 't' here", which is pretty helpful. [heading Compile separately when you know the type of your input will not change] If your input type will not change (for instance, if you always parse from a `std::string` and nothing else), you can use separate compilation to keep from recompiling your parsing code over and over in every translation unit that includes it. For instance, in the JSON callback parser example, there is a call to `json::parse()`, which is a template. However, the function template is always instantiated with the same parameter: `json_callbacks`, a type defined in the example. It would be possible to remove the template parameter from `json::parse()`, forward declare `json_callbacks` and `json::parse()`, and define them in a different implementation file. [heading Have a simple test that you can run to find ill-formed-code-as-asserts] Most of these errors are found at parser construction time, so no actual parsing is even necessary. For instance, a test case might look like this: TEST(my_parser_tests, my_rule_test) { my_rule r; } [endsect] [endsect]