2
0
mirror of https://github.com/boostorg/parser.git synced 2026-01-23 17:52:15 +00:00
Files
parser/doc/html/boost_parser/tutorial/unicode_support.html
2024-10-03 20:09:21 -05:00

254 lines
24 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Unicode Support</title>
<link rel="stylesheet" href="../../boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="../../index.html" title="Chapter 1. Boost.Parser">
<link rel="up" href="../tutorial.html" title="Tutorial">
<link rel="prev" href="algorithms_and_views_that_use_parsers.html" title="Algorithms and Views That Use Parsers">
<link rel="next" href="callback_parsing.html" title="Callback Parsing">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<div class="spirit-nav">
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="boost_parser.tutorial.unicode_support"></a><a class="link" href="unicode_support.html" title="Unicode Support">Unicode Support</a>
</h3></div></div></div>
<p>
Boost.Parser was designed from the start to be Unicode friendly. There are
numerous references to the "Unicode code path" and the "non-Unicode
code path" in the Boost.Parser documentation. Though there are in fact
two code paths for Unicode and non-Unicode parsing, the code is not very
different in the two code paths, as they are written generically. The only
difference is that the Unicode code path parses the input as a range of code
points, and the non-Unicode path does not. In effect, this means that, in
the Unicode code path, when you call <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code> for some input range <code class="computeroutput"><span class="identifier">r</span></code>
and some parser <code class="computeroutput"><span class="identifier">p</span></code>, the parse
happens as if you called <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span> <span class="special">|</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code>
instead. (Of course, it does not matter if <code class="computeroutput"><span class="identifier">r</span></code>
is a proper range, or an iterator/sentinel pair; those both work fine with
<code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span></code>.)
</p>
<p>
Matching "characters" within Boost.Parser's parsers is assumed
to be a code point match. In the Unicode path there is a code point from
the input that is matched to each <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code> parser. In the non-Unicode
path, the encoding is unknown, and so each element of the input is considered
to be a whole "character" in the input encoding, analogous to a
code point. From this point on, I will therefore refer to a single element
of the input exclusively as a code point.
</p>
<p>
So, let's say we write this parser:
</p>
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char8_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">);</span>
</pre>
<p>
For any <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
parser that should match a value or values, the type of the value to match
is retained. So <code class="computeroutput"><span class="identifier">char8_parser</span></code>
contains a <code class="computeroutput"><span class="keyword">char</span></code> that it will
use for matching. If we had written:
</p>
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char32_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="identifier">U</span><span class="char">'\xcc'</span><span class="special">);</span>
</pre>
<p>
<code class="computeroutput"><span class="identifier">char32_parser</span></code> would instead
contain a <code class="computeroutput"><span class="keyword">char32_t</span></code> that it would
use for matching.
</p>
<p>
So, at any point during the parse, if <code class="computeroutput"><span class="identifier">char8_parser</span></code>
were being used to match a code point <code class="computeroutput"><span class="identifier">next_cp</span></code>
from the input, we would see the moral equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
<span class="special">==</span> <span class="char">'\xcc'</span></code>,
and if <code class="computeroutput"><span class="identifier">char32_parser</span></code> were
being used to match <code class="computeroutput"><span class="identifier">next_cp</span></code>,
we'd see the equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
<span class="special">==</span> <span class="identifier">U</span><span class="char">'\xcc'</span></code>. The take-away here is that you can write
<code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
parsers that match specific values, without worrying if the input is Unicode
or not because, under the covers, what takes place is a simple comparison
of two integral values.
</p>
<div class="note"><table border="0" summary="Note">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
<th align="left">Note</th>
</tr>
<tr><td align="left" valign="top"><p>
Boost.Parser actually promotes any two values to a common type using <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">common_type</span></code> before comparing them. This
is almost always works because the input and any parameter passed to <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
must be character types.
</p></td></tr>
</table></div>
<p>
Since matches are always done at a code point level (remember, a "code
point" in the non-Unicode path is assumed to be a single <code class="computeroutput"><span class="keyword">char</span></code>), you get different results trying to
match UTF-8 input in the Unicode and non-Unicode code paths:
</p>
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>
<span class="special">{</span>
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="special">(</span><span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*)</span><span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
<span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>
<span class="comment">// Since we've done nothing to indicate that we want to do Unicode</span>
<span class="comment">// parsing, and we've passed a range of char to parse(), this will do</span>
<span class="comment">// non-Unicode parsing.</span>
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>
<span class="comment">// Finds one match of the *char* 0xcc, because the value in the parser</span>
<span class="comment">// (0xcc) was matched against the two code points in the input (0xcc and</span>
<span class="comment">// 0x80), and the first one was a match.</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">"\xcc"</span><span class="special">);</span>
<span class="special">}</span>
<span class="special">{</span>
<span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
<span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>
<span class="comment">// Since the input is a range of char8_t, this will do Unicode</span>
<span class="comment">// parsing. The same thing would have happened if we passed</span>
<span class="comment">// str | boost::parser::as_utf32 or even str | boost::parser::as_utf8.</span>
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>
<span class="comment">// Finds zero matches of the *code point* 0xcc, because the value in</span>
<span class="comment">// the parser (0xcc) was matched against the single code point in the</span>
<span class="comment">// input, 0x0300.</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">""</span><span class="special">);</span>
<span class="special">}</span>
</pre>
<h5>
<a name="boost_parser.tutorial.unicode_support.h0"></a>
<span class="phrase"><a name="boost_parser.tutorial.unicode_support.implicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.implicit_transcoding">Implicit
transcoding</a>
</h5>
<p>
Additionally, it is expected that most programs will use UTF-8 for the encoding
of Unicode strings. Boost.Parser is written with this typical case in mind.
This means that if you are parsing 32-bit code points (as you always are
in the Unicode path), and you want to catch the result in a container <code class="computeroutput"><span class="identifier">C</span></code> of <code class="computeroutput"><span class="keyword">char</span></code>
or <code class="computeroutput"><span class="identifier">char8_t</span></code> values, Boost.Parser
will silently transcode from UTF-32 to UTF-8 and write the attribute into
<code class="computeroutput"><span class="identifier">C</span></code>. This means that <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>,
<code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span></code>, etc. are fine to use as attribute
out-parameters for <code class="computeroutput"><span class="special">*</span><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>, and the result
will be UTF-8.
</p>
<div class="note"><table border="0" summary="Note">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
<th align="left">Note</th>
</tr>
<tr><td align="left" valign="top"><p>
UTF-16 strings as attributes are not supported directly. If you want to
use UTF-16 strings as attributes, you may need to do so by transcoding
a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can
do this by using <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf16</span></code>.
</p></td></tr>
</table></div>
<p>
The treatment of strings as UTF-8 is nearly ubiquitous within Boost.Parser.
For instance, though the entire interface of <code class="computeroutput"><a class="link" href="../../boost/parser/symbols.html" title="Struct template symbols">symbols</a></code> uses <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>
or <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string_view</span></code>, UTF-32 comparisons are used
internally.
</p>
<h5>
<a name="boost_parser.tutorial.unicode_support.h1"></a>
<span class="phrase"><a name="boost_parser.tutorial.unicode_support.explicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.explicit_transcoding">Explicit
transcoding</a>
</h5>
<p>
I mentioned above that the use of <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">utf</span><span class="special">*</span><span class="identifier">_view</span></code> as the range to parse opts you in
to Unicode parsing. Here's a bit more about these views and how best to use
them.
</p>
<p>
If you want to do Unicode parsing, you're always going to be comparing code
points at each step of the parse. As such, you're going to implicitly convert
any parse input to UTF-32, if needed. This is what all the parse API functions
do internally.
</p>
<p>
However, there are times when you have parse input that is a sequence of
UTF-8-encoded <code class="computeroutput"><span class="keyword">char</span></code>s, and you
want to do Unicode-aware parsing. As mentioned previously, Boost.Parser has
a special case for <code class="computeroutput"><span class="keyword">char</span></code> inputs,
and it will <span class="bold"><strong>not</strong></span> assume that <code class="computeroutput"><span class="keyword">char</span></code> sequences are UTF-8. If you want to tell
the parse API to do Unicode processing on them anyway, you can use the <code class="computeroutput"><span class="identifier">as_utf32</span></code> range adapter. (Note that you
can use any of the <code class="computeroutput"><span class="identifier">as_utf</span><span class="special">*</span></code> adaptors and the semantics will not differ
from the semantics below.)
</p>
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>
<span class="keyword">auto</span> <span class="keyword">const</span> <span class="identifier">p</span> <span class="special">=</span> <span class="char">'"'</span> <span class="special">&gt;&gt;</span> <span class="special">*(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span> <span class="special">-</span> <span class="char">'"'</span> <span class="special">-</span> <span class="number">0xb6</span><span class="special">)</span> <span class="special">&gt;&gt;</span> <span class="char">'"'</span><span class="special">;</span>
<span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*</span> <span class="identifier">str</span> <span class="special">=</span> <span class="string">"\"two wörds\""</span><span class="special">;</span> <span class="comment">// ö is two code units, 0xc3 0xb6</span>
<span class="keyword">auto</span> <span class="identifier">result_1</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span> <span class="comment">// Treat each char as a code point (typically ASCII).</span>
<span class="identifier">assert</span><span class="special">(!</span><span class="identifier">result_1</span><span class="special">);</span>
<span class="keyword">auto</span> <span class="identifier">result_2</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span> <span class="special">|</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span> <span class="comment">// Unicode-aware parsing on code points.</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">result_2</span><span class="special">);</span>
</pre>
<p>
The first call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
treats each <code class="computeroutput"><span class="keyword">char</span></code> as a code point,
and since <code class="computeroutput"><span class="string">"ö"</span></code> is the
pair of code units <code class="computeroutput"><span class="number">0xc3</span></code> <code class="computeroutput"><span class="number">0xb6</span></code>, the parse matches the second code unit
against the <code class="computeroutput"><span class="special">-</span> <span class="number">0xb6</span></code>
part of the parser above, causing the parse to fail. This happens because
each code unit/<code class="computeroutput"><span class="keyword">char</span></code> in <code class="computeroutput"><span class="identifier">str</span></code> is treated as an independent code point.
</p>
<p>
The second call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
succeeds because, when the parse gets to the code point for <code class="computeroutput"><span class="char">'ö'</span></code>, it is <code class="computeroutput"><span class="number">0xf6</span></code>
(U+00F6), which does not match the <code class="computeroutput"><span class="special">-</span>
<span class="number">0xb6</span></code> part of the parser.
</p>
<p>
The other adaptors <code class="computeroutput"><span class="identifier">as_utf8</span></code>
and <code class="computeroutput"><span class="identifier">as_utf16</span></code> are also provided
for completeness, if you want to use them. They each can transcode any sequence
of character types.
</p>
<div class="important"><table border="0" summary="Important">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Important]" src="../../images/important.png"></td>
<th align="left">Important</th>
</tr>
<tr><td align="left" valign="top"><p>
The <code class="computeroutput"><span class="identifier">as_utfN</span></code> adaptors are
optional, so they don't come with <code class="computeroutput"><span class="identifier">parser</span><span class="special">.</span><span class="identifier">hpp</span></code>.
To get access to them, <code class="computeroutput"><span class="preprocessor">#include</span>
<span class="special">&lt;</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">parser</span><span class="special">/</span><span class="identifier">transcode_view</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">&gt;</span></code>.
</p></td></tr>
</table></div>
<h5>
<a name="boost_parser.tutorial.unicode_support.h2"></a>
<span class="phrase"><a name="boost_parser.tutorial.unicode_support._lack_of__normalization"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support._lack_of__normalization">(Lack
of) normalization</a>
</h5>
<p>
One thing that Boost.Parser does not handle for you is normalization; Boost.Parser
is completely normalization-agnostic. Since all the parsers do their matching
using equality comparisons of code points, you should make sure that your
parsed range and your parsers all use the same normalization form.
</p>
</div>
<div class="copyright-footer">Copyright © 2020 T. Zachary Laine<p>
Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>)
</p>
</div>
<hr>
<div class="spirit-nav">
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
</div>
</body>
</html>