2
0
mirror of https://github.com/boostorg/parser.git synced 2026-01-19 16:32:13 +00:00
Files
parser/doc/intro.qbk
2024-03-26 04:14:17 -05:00

310 lines
14 KiB
Plaintext

[/
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
/]
[section Introduction]
_Parser_ is a _comb_ library. That is, it consists of a set of low-level
primitive parsers, and operations that can be used to combine those parsers
into more complicated parsers.
There are primitive parsers that parse /epsilon/ (the empty string), `char`s,
`int`s, `float`s, etc.
There are operations which combine parsers to create new parsers. For
instance, the _kl_ operation takes an existing parser `p` and creates a new
parser that matches zero or more occurrences of whatever `p` matches. Both
callable objects and operator overloads are used for the combining operations.
For instance, `operator*()` is used for _kl_, and you can also write
`repeat(n)[p]` to create a parser for exactly `n` repetitions of `p`.
_Parser_ also tries to accommodate the multiple ways that people often want to
get a parse result out of their parsing code. Some parsing may best be done
by returning an object that represents the result of the parse. Other parsing
may best be done by filling in a preexisting data structure. Yet other
parsing may best be done by parsing small sections of a large document, and
reporting the results of subparsers as they are finished, via callbacks.
_Parser_ accommodates all these ways of working, and even makes it possible to
do callback-based or non-callback-based parsing without rewriting any code
(except by changing the top-level call from _p_ to _cbp_).
All of _Parser_'s public interfaces are sentinel- and range-friendly, just like
the interfaces in `std::ranges`.
_Parser_ is Unicode-aware through and through. When you parse ranges of
`char`, _Parser_ does not assume any particular encoding _emdash_ not Unicode
or any other encoding. Parsing of inputs *other than* plain `char`s assumes
that the input is Unicode. In the Unicode-aware code paths, all parsing is
done by matching code points. This means that you can feed UTF-8 strings into
_Parser_, both as input and within your parser, and the right sort of matching
occurs. For instance, if your parser is trying to match repetitions of the
`char` `'\xcc'` (which is a lead byte from a UTF-8 sequence, and so is
malformed UTF-8 if not followed by an appropriate UTF-8 code unit), it will
*not* match the start of `"\xcc\x80"` (UTF-8 for the code point U+0300).
_Parser_ knows that the matching must be whole-code-point, and so it
interprets the `char` `'\xcc'` as the code point U+00CC.
Error reporting is important to get right, and it is important to make errors
easy to understand, especially for end-users. _Parser_ produces runtime parse
error messages that are very similar to the diagnostics that you get when
compiling with GCC and Clang (it even supports warnings that don't fail the
parse). The exact token associated with a diagnostic can be reported to the
user, with the containing line quoted, and with a marker pointing right at the
token. _Parser_ takes care of this for you; your parser does not need to
include any special code to make this happen. Of course, you can also replace
the error handler entirely, if it doesn't fit your needs.
Debugging complex parsers can be a real nightmare. _Parser_ makes it trivial
to get a trace of your entire parse, with easy-to-read (and very verbose)
indications of where each part of the trace is within the parse, the state of
values produced by the parse, etc. Again, you don't need to write any code to
make this happen _emdash_ you just pass a parameter to _p_.
Dependencies are still a nightmare in C++, so _Parser_ can be used as a purely
standalone library, independent of Boost.
[endsect]
[section Configuration and Optional Features]
_Parser_ can be used entirely on its own. If Boost is available, extra
functionality provided by Boost is also available.
By default, _Parser_ is usable entirely on its own. The only explicit opt-in
use of Boost is the use of Boost.Hana. If you turn on the use of Hana, the
tuple type used throughout _Parser_ will be `boost::hana::tuple` instead of
`std::tuple`. To enable this, simply define `BOOST_PARSER_USE_HANA_TUPLE`.
The Boost.Hana tuple is much nicer, because it has an `operator[]`, and a
whole lot of very useful algorithms; you will see this operator used
throughout the tutorial and examples. I encourage you to use the Hana tuple,
but I realize that it is an often-unfamiliar replacement for `std::tuple`,
which is a C++ vocabulary template, and so that is the default.
[important _Parser_ defines a template alias _bp_tup_ that aliases to _bh_tup_
by default, and _std_tup_ when `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined.
You can future-proof your code slightly by using _bp_tup_, so that the code is
well-formed, whether or not `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined. For
the same reason, _Parser_ also provides a generic _bp_get_ that works with
both kinds of tuple (since _std_tup_ has no `operator[]` and _bh_tup_ does not
work with `std::get`).]
The presence of Boost headers is detected using `__has_include()`. When it is
present, all the typical Boost conventions are used; otherwise, non-Boost
alternatives are used. This applies to the use of `BOOST_ASSERT` versus
`assert`, and printing typenames with Boost.TypeIndex versus with
`std::typeinfo`.
[note If you want to disable the use of the C macro `assert`, you define
`BOOST_DISABLE_ASSERTS`. This is true whether `BOOST_ASSERT` is available or
not. ]
[important _Parser_ uses inline namespaces around definitions of all functions
and types that use the optional Boost features; the name of the inline
namespace varies depending on whether the Boost implementation is used. So if
Boost.TypeIndex is available to one translation unit, but another TU must use
`std::typeinfo`, there are no ODR violations. The exception to this is the
use of `BOOST_ASSERT`/`assert`; assert macros are inherently ODR traps. ]
_Parser_ automatically treats aggregate `struct`s as if they were tuples in
many cases. There is some metaprogramming logic that makes this work, and
this logic has a hard limit on the size of a `struct` that it can operate on.
There is a configuration macro _AGGR_SIZE_ that you can adjust if the default
value is too small. Note that turning this value up significantly can
significantly increase compile times. Also, MSVC seems to have a hard time
with large values; I successfully set this value to `50` on MSVC, but `100`
broke the MSVC build entirely.
_Parser_ uses `std::optional` and `std::variant` internally. There is no way
to change this. However, when _Parser_ generates values as a result of the
parse (see _attr_gen_), it can place them into other implementations of
optional and/or variant, if you tell it to do so. You tell it which templates
are usable as an optional or variant by specializing the associated variable
template. For instance, here is how you would tell _Parser_ that
`boost::optional` is an optional-type:
template<typename T>
constexpr bool boost::parser::enable_optional<boost::optional<T>> = true;
Here's how you would do the same thing for `boost::variant2::variant`:
template<typename... Ts>
constexpr bool boost::parser::enable_variant<boost::variant2::variant<Ts...>> = true;
The requirements on a template used as an optional are pretty simple, since
_Parser_ does almost nothing but assign to them. For a type `O` to be a
usable optional, you must be able to assign to `O`, and `O` must have an
`operator*` that returns the stored value, or a (possibly cv-qualified)
reference to the stored value.
For variants, the requirement is even simpler; the variant type only needs to
be assignable.
[note The only thing affected by `enable_variant` is printing. If your
variant template can be printed with just `std::cout << v` (where `v` is a
variant, obviously), then you don't need to define `enable_variant` for your
variant template.]
_Parser_ uses `std::ranges::subrange` extensively. However, there is no C++17
equivalent. So, there is a `boost::parser::subrange` for C++17 builds. To
switch between these transparently in the code, while keeping CTAD
operational, _Parser_ defines _SUBRNG_. This is the name of the template, so
if you use it in your own code you would use it like `_SUBRNG_<I>` to
instantiate it.
_Parser_ allows you to debug your parsers by passing `trace::on` to _p_.
Sometimes, your run environment does not include a terminal. If you're
running _Parser_ code in the Visual Studio debugger, you can see this trace
output in the Visual Studio debugger output panel rather than in a terminal by
defining `BOOST_PARSER_TRACE_TO_VS_OUTPUT`.
[endsect]
[section This Library's Relationship to Boost.Spirit]
[note If you are familiar with Spirit 2 and/or Spirit X3, you may be
interested in this section. If you are not, and you have not read the
tutorial for _Parser_ yet, very little of this will make sense.]
_Spirit_ is a library that is already in Boost, and it has been around for a
long time.
However, it does not suit user needs in some ways.
* Spirit 2 suffers from very long compile times.
* Spirit 2 has error reporting that requires a lot of user intervention to
work.
* Spirit 2 requires user intervention, including a (long) recompile, to enable
parse tracing.
* Spirit X3 has rules that do not compose well _emdash_ the attributes
produced by a rule can change depending on the context in which you use the
rule.
* Spirit X3 is missing many of the convenient interfaces to parsers that
Spirit 2 had. For instance, you cannot add parameters to a parser (see
description of _locals_ in _more_about_rules_).
* All versions of Spirit have Unicode support, but it is quite difficult to
get working.
I wanted a library that does not suffer from any of the above limitations. It
should be noted that while Spirit X3 only has a couple of flaws in the list
above, the one related to rules is a deal-breaker. The ability to write
rules, test them in isolation, and then re-use them throughout a complex
parser is essential.
Though no version of _Spirit_ (Spirit 2 or Spirit X3) suffers from all those
limitations, there also does not exist any one version that avoids all of
them. _Parser_ does so. However, there are a lot of great ideas in _Spirit_
that have been retained in _Parser_. Both libraries:
* use the same operator overloads to combine parsers;
* use approximately the same set of directives to influence the parse
(e.g. `lexeme[]`);
* are built around a flexible parse context object that has state added to and
removed from it during the parse (again, comparing to Spirit X3).
[heading The Spirit X3 rule problem]
Some readers have wanted a concrete example of my claim that Spirit X3's rules
do not compose well. Consider this program.
#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <set>
#include <string>
#include <vector>
namespace x3 = boost::spirit::x3;
using ints_type = x3::rule<class ints, std::vector<int>>;
BOOST_SPIRIT_DECLARE(ints_type);
x3::rule<class ints, std::vector<int>> ints = "ints";
constexpr auto ints_def = x3::int_ % ',';
BOOST_SPIRIT_DEFINE(ints);
#define FIXED_ATTRIBUTE 0
int main()
{
std::string input = "43, 42";
auto first = input.begin();
auto const last = input.end();
#if FIXED_ATTRIBUTE
std::vector<int> result;
#else
std::set<int> result;
#endif
bool success = x3::phrase_parse(first, last, ints, x3::space, result);
if (success) {
// We want this to print "43 42\n".
for (auto x : result) {
std::cout << x << ' ';
}
std::cout << "\n";
}
return 0;
}
Defining `FIXED_ATTRIBUTE` to be `1` leads to a well-formed program that
prints `"42 43\n"` instead of the desired result. The problem here is that if
you feed an attribute out-param to `x3::phrase_parse()`, you get the
loose-match semantics that Spirit X3 and _Parser_ both do. This is a problem,
because the user explicitly asserted that the type of the `ints` rule's
attribute should be `std::vector<int>`. In my opinion, this code should be
ill-formed with `FIXED_ATTRIBUTE == 1`. To make it well-formed again, the
user could use `ints_def` directly, since it does not specify an attribute
type.
When the user explicitly states that a type is some fixed `T`, a library
should not ignore that. As a user of X3, I was bitten by this in such a way
that I considered X3 to be a nonviable option for my uses. I ran into a
problem that resulted from X3's ignoring one or more of my rules' attributes
so that it made the parse produce the wrong result, and I could see no way to
fix it.
When a library provides wider use cases via genericity, we generally consider
this a good thing. If it is too loose in its semantics, we generally say that
it is type-unsafe. Using _rs_ to nail down type flexibility is one way
_Parser_ tries to enable genericity where it is desired, and let the user turn
it off where it is not.
[endsect]
[section Cheat Sheet]
Here are all the tables containing the various _Parser_ parsers, examples,
etc., all in one place. These are repeated elsewhere in different sections of
the tutorial.
[heading The parsers]
[table_parsers_and_their_semantics]
[heading Operators defined on parsers]
[table_combining_operations]
[heading Attribute generation for certain parsers]
[table_attribute_generation]
[heading Attributes for operations on parsers]
[table_attribute_combinations]
[heading More attributes for operations on parsers]
[table_seq_or_attribute_combinations]
[endsect]