2
0
mirror of https://github.com/boostorg/parser.git synced 2026-01-24 06:02:12 +00:00
Files
parser/doc/intro.qbk

157 lines
7.5 KiB
Plaintext

[/
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
/]
[section Introduction]
_Parser_ is a _comb_ library. That is, it consists of a set of low-level
primitive parsers, and operations that can be used to combine those parsers
into more complicated parsers.
There are primitive parsers that parse /epsilon/ (the empty string), `char`s,
`int`s, `float`s, etc.
There are operations which combine parsers to create new parsers. For
instance, the _kl_ operation takes an existing parser `p` and creates a new
parser that matches zero or more occurrences of whatever `p` matches. Both
callable objects and operator overloads are used for the combining operations.
For instance, `operator*()` is used for _kl_, and you can also write
`repeat(n)[p]` to create a parser for exactly `n` repetitions of `p`.
_Parser_ also tries to accommodate the multiple ways that people often want to
get a parse result out of their parsing code. Some parsing may best be done
by returning an object that represents the result of the parse. Other parsing
may best be done by filling in a preexisting data structure. Yet other
parsing may best be done by parsing small sections of a large document, and
reporting the results of subparsers as they are finished, via callbacks.
_Parser_ accommodates all these ways of working, and even makes it possible to
do callback-based or non-callback-based parsing without rewriting any code
(except by changing the top-level call from _p_ to _cbp_).
All of _Parser_'s public interfaces are sentinel- and range-friendly, just like
the interfaces in `std::ranges`.
_Parser_ is Unicode-aware through and through. When you parse ranges of
`char`, _Parser_ does not assume any particular encoding _emdash_ not Unicode
or any other encoding. Parsing of inputs *other than* plain `char`s assumes
that the input is Unicode. In the Unicode-aware code paths, all parsing is
done by matching code points. This means that you can feed UTF-8 strings into
_Parser_, both as input and within your parser, and the right sort of matching
occurs. For instance, if your parser is trying to match repetitions of the
`char` `'\xcc'` (which is a lead byte from a UTF-8 sequence, and so is
malformed UTF-8 if not followed by an appropriate UTF-8 code unit), it will
*not* match the start of `"\xcc\x80"` (UTF-8 for the code point U+0300).
_Parser_ knows that the matching must be whole-code-point, and so it
interprets the `char` `'\xcc'` as the code point U+00CC.
Error reporting is important to get right, and it is important to make errors
easy to understand, especially for end-users. _Parser_ produces runtime parse
error messages that are very similar to the diagnostics that you get when
compiling with GCC and Clang (it even supports warnings that don't fail the
parse). The exact token associated with a diagnostic can be reported to the
user, with the containing line quoted, and with a marker pointing right at the
token. _Parser_ takes care of this for you; your parser does not need to
include any special code to make this happen. Of course, you can also replace
the error handler entirely, if it doesn't fit your needs.
Debugging complex parsers can be a real nightmare. _Parser_ makes it trivial
to get a trace of your entire parse, with easy-to-read (and very verbose)
indications of where each part of the trace is within the parse, the state of
values produced by the parse, etc. Again, you don't need to write any code to
make this happen _emdash_ you just pass a parameter to _p_.
Dependencies are still a nightmare in C++, so _Parser_ can be used as a purely
standalone library, independent of Boost.
[endsect]
[section Configuration and Optional Features]
_Parser_ can be used entirely on its own. If Boost is available, extra
functionality provided by Boost is also available.
To use _Parser_ entirely on its own, simply define
`BOOST_PARSER_DISABLE_HANA_TUPLE`. This will force _std_tup_ to be the
tuple-template used throughout _Parser_. The Boost.Hana tuple is much nicer,
because it has an `operator[]`; you will see this operator used throughout the
tutorial and examples.
[important _Parser_ defines a template alias _bp_tup_ that aliases to _bh_tup_
by default, and _std_tup_ when `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined.
You can future-proof your code slightly by using _bp_tup_, so that the code is
well-formed, whether or not `BOOST_PARSER_DISABLE_HANA_TUPLE` is defined. For
the same reason, _Parser_ also provides a generic _bp_get_ that works with
both kinds of tuple (since _std_tup_ has no `operator[]` and _bh_tup_ does not
work with `std::get`).]
The presence of Boost headers is detected using `__has_include()`. When it is
present, all the typical Boost conventions are used; otherwise, non-Boost
alternatives are used. This applies to the use of `BOOST_ASSERT` versus
`assert`, and printing typenames with Boost.TypeIndex versus with
`std::typeinfo`.
_Parser_ automatically treats aggregate `struct`s as if they were tuples in
many cases. There is some metaprogramming logic that makes this work, and
this logic has a hard limit on the size of a `struct` that it can operate on.
There is a configuration macro _AGGR_SIZE_ that you can adjust if the default
value is too small. Note that turning this value up significantly can
significantly increase compile times. Also, MSVC seems to have a hard time
with large values; I successfully set this value to `50` on MSVC, but `100`
broke the MSVC build entirely.
[endsect]
[section This Library's Relationship to Boost.Spirit]
[note If you are familiar with Spirit 2 and/or Spirit X3, you may be
interested in this section. If you are not, and you have not read the
tutorial for _Parser_ yet, very little of this will make sense.]
_Spirit_ is a library that is already in Boost, and it has been around for a
long time.
However, it does not suit user needs in some ways.
* Spirit 2 suffers from very long compile times.
* Spirit 2 has error reporting that requires a lot of user intervention to
work.
* Spirit 2 requires user intervention, including a (long) recompile, to enable
parse tracing.
* Spirit X3 has rules that do not compose well _emdash_ the attributes
produced by a rule can change depending on the context in which you use the
rule.
* Spirit X3 is missing many of the convenient interfaces to parsers that
Spirit 2 had. For instance, you cannot add parameters to a parser.
* All versions of Spirit have Unicode support, but it is quite difficult to
get working.
I wanted a library that does not suffer from any of the above limitations. It
should be noted that while Spirit X3 only has a couple of flaws in the list
above, the one related to rules is a deal-breaker. The ability to write
rules, test them in isolation, and then re-use them throughout a complex
parser is essential.
Though no version of _Spirit_ (Spirit 2 or Spirit X3) suffers from all those
limitations, there also does not exist any one version that avoids all of
them. _Parser_ does so. However, there are a lot of great ideas in _Spirit_
that have been retained in _Parser_. Both libraries:
* use the same operator overloads to combine parsers;
* use approximately the same set of directives to influence the parse
(e.g. `lexeme[]`);
* provide loosely-coupled rules that are separately compilable (at least for
Spirit X3); and
* are built around a flexible parse context object that has state added to and
removed from it during the parse (again, comparing to Spirit X3).
[endsect]