mirror of
https://github.com/boostorg/parser.git
synced 2026-01-24 06:02:12 +00:00
157 lines
7.5 KiB
Plaintext
157 lines
7.5 KiB
Plaintext
[/
|
|
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
/]
|
|
|
|
[section Introduction]
|
|
|
|
_Parser_ is a _comb_ library. That is, it consists of a set of low-level
|
|
primitive parsers, and operations that can be used to combine those parsers
|
|
into more complicated parsers.
|
|
|
|
There are primitive parsers that parse /epsilon/ (the empty string), `char`s,
|
|
`int`s, `float`s, etc.
|
|
|
|
There are operations which combine parsers to create new parsers. For
|
|
instance, the _kl_ operation takes an existing parser `p` and creates a new
|
|
parser that matches zero or more occurrences of whatever `p` matches. Both
|
|
callable objects and operator overloads are used for the combining operations.
|
|
For instance, `operator*()` is used for _kl_, and you can also write
|
|
`repeat(n)[p]` to create a parser for exactly `n` repetitions of `p`.
|
|
|
|
_Parser_ also tries to accommodate the multiple ways that people often want to
|
|
get a parse result out of their parsing code. Some parsing may best be done
|
|
by returning an object that represents the result of the parse. Other parsing
|
|
may best be done by filling in a preexisting data structure. Yet other
|
|
parsing may best be done by parsing small sections of a large document, and
|
|
reporting the results of subparsers as they are finished, via callbacks.
|
|
_Parser_ accommodates all these ways of working, and even makes it possible to
|
|
do callback-based or non-callback-based parsing without rewriting any code
|
|
(except by changing the top-level call from _p_ to _cbp_).
|
|
|
|
All of _Parser_'s public interfaces are sentinel- and range-friendly, just like
|
|
the interfaces in `std::ranges`.
|
|
|
|
_Parser_ is Unicode-aware through and through. When you parse ranges of
|
|
`char`, _Parser_ does not assume any particular encoding _emdash_ not Unicode
|
|
or any other encoding. Parsing of inputs *other than* plain `char`s assumes
|
|
that the input is Unicode. In the Unicode-aware code paths, all parsing is
|
|
done by matching code points. This means that you can feed UTF-8 strings into
|
|
_Parser_, both as input and within your parser, and the right sort of matching
|
|
occurs. For instance, if your parser is trying to match repetitions of the
|
|
`char` `'\xcc'` (which is a lead byte from a UTF-8 sequence, and so is
|
|
malformed UTF-8 if not followed by an appropriate UTF-8 code unit), it will
|
|
*not* match the start of `"\xcc\x80"` (UTF-8 for the code point U+0300).
|
|
_Parser_ knows that the matching must be whole-code-point, and so it
|
|
interprets the `char` `'\xcc'` as the code point U+00CC.
|
|
|
|
Error reporting is important to get right, and it is important to make errors
|
|
easy to understand, especially for end-users. _Parser_ produces runtime parse
|
|
error messages that are very similar to the diagnostics that you get when
|
|
compiling with GCC and Clang (it even supports warnings that don't fail the
|
|
parse). The exact token associated with a diagnostic can be reported to the
|
|
user, with the containing line quoted, and with a marker pointing right at the
|
|
token. _Parser_ takes care of this for you; your parser does not need to
|
|
include any special code to make this happen. Of course, you can also replace
|
|
the error handler entirely, if it doesn't fit your needs.
|
|
|
|
Debugging complex parsers can be a real nightmare. _Parser_ makes it trivial
|
|
to get a trace of your entire parse, with easy-to-read (and very verbose)
|
|
indications of where each part of the trace is within the parse, the state of
|
|
values produced by the parse, etc. Again, you don't need to write any code to
|
|
make this happen _emdash_ you just pass a parameter to _p_.
|
|
|
|
Dependencies are still a nightmare in C++, so _Parser_ can be used as a purely
|
|
standalone library, independent of Boost.
|
|
|
|
[endsect]
|
|
|
|
[section Configuration and Optional Features]
|
|
|
|
_Parser_ can be used entirely on its own. If Boost is available, extra
|
|
functionality provided by Boost is also available.
|
|
|
|
To use _Parser_ entirely on its own, simply define
|
|
`BOOST_PARSER_DISABLE_HANA_TUPLE`. This will force _std_tup_ to be the
|
|
tuple-template used throughout _Parser_. The Boost.Hana tuple is much nicer,
|
|
because it has an `operator[]`; you will see this operator used throughout the
|
|
tutorial and examples.
|
|
|
|
[important _Parser_ defines a template alias _bp_tup_ that aliases to _bh_tup_
|
|
by default, and _std_tup_ when `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined.
|
|
You can future-proof your code slightly by using _bp_tup_, so that the code is
|
|
well-formed, whether or not `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined. For
|
|
the same reason, _Parser_ also provides a generic _bp_get_ that works with
|
|
both kinds of tuple (since _std_tup_ has no `operator[]` and _bh_tup_ does not
|
|
work with `std::get`).]
|
|
|
|
The presence of Boost headers is detected using `__has_include()`. When it is
|
|
present, all the typical Boost conventions are used; otherwise, non-Boost
|
|
alternatives are used. This applies to the use of `BOOST_ASSERT` versus
|
|
`assert`, and printing typenames with Boost.TypeIndex versus with
|
|
`std::typeinfo`.
|
|
|
|
_Parser_ automatically treats aggregate `struct`s as if they were tuples in
|
|
many cases. There is some metaprogramming logic that makes this work, and
|
|
this logic has a hard limit on the size of a `struct` that it can operate on.
|
|
There is a configuration macro _AGGR_SIZE_ that you can adjust if the default
|
|
value is too small. Note that turning this value up significantly can
|
|
significantly increase compile times. Also, MSVC seems to have a hard time
|
|
with large values; I successfully set this value to `50` on MSVC, but `100`
|
|
broke the MSVC build entirely.
|
|
|
|
[endsect]
|
|
|
|
[section This Library's Relationship to Boost.Spirit]
|
|
|
|
[note If you are familiar with Spirit 2 and/or Spirit X3, you may be
|
|
interested in this section. If you are not, and you have not read the
|
|
tutorial for _Parser_ yet, very little of this will make sense.]
|
|
|
|
_Spirit_ is a library that is already in Boost, and it has been around for a
|
|
long time.
|
|
|
|
However, it does not suit user needs in some ways.
|
|
|
|
* Spirit 2 suffers from very long compile times.
|
|
|
|
* Spirit 2 has error reporting that requires a lot of user intervention to
|
|
work.
|
|
|
|
* Spirit 2 requires user intervention, including a (long) recompile, to enable
|
|
parse tracing.
|
|
|
|
* Spirit X3 has rules that do not compose well _emdash_ the attributes
|
|
produced by a rule can change depending on the context in which you use the
|
|
rule.
|
|
|
|
* Spirit X3 is missing many of the convenient interfaces to parsers that
|
|
Spirit 2 had. For instance, you cannot add parameters to a parser.
|
|
|
|
* All versions of Spirit have Unicode support, but it is quite difficult to
|
|
get working.
|
|
|
|
I wanted a library that does not suffer from any of the above limitations. It
|
|
should be noted that while Spirit X3 only has a couple of flaws in the list
|
|
above, the one related to rules is a deal-breaker. The ability to write
|
|
rules, test them in isolation, and then re-use them throughout a complex
|
|
parser is essential.
|
|
|
|
Though no version of _Spirit_ (Spirit 2 or Spirit X3) suffers from all those
|
|
limitations, there also does not exist any one version that avoids all of
|
|
them. _Parser_ does so. However, there are a lot of great ideas in _Spirit_
|
|
that have been retained in _Parser_. Both libraries:
|
|
|
|
* use the same operator overloads to combine parsers;
|
|
|
|
* use approximately the same set of directives to influence the parse
|
|
(e.g. `lexeme[]`);
|
|
|
|
* provide loosely-coupled rules that are separately compilable (at least for
|
|
Spirit X3); and
|
|
|
|
* are built around a flexible parse context object that has state added to and
|
|
removed from it during the parse (again, comparing to Spirit X3).
|
|
|
|
[endsect]
|