mirror of
https://github.com/boostorg/parser.git
synced 2026-01-23 17:52:15 +00:00
254 lines
24 KiB
HTML
254 lines
24 KiB
HTML
<html>
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
||
<title>Unicode Support</title>
|
||
<link rel="stylesheet" href="../../boostbook.css" type="text/css">
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
|
||
<link rel="home" href="../../index.html" title="Chapter 1. Boost.Parser">
|
||
<link rel="up" href="../tutorial.html" title="Tutorial">
|
||
<link rel="prev" href="algorithms_and_views_that_use_parsers.html" title="Algorithms and Views That Use Parsers">
|
||
<link rel="next" href="callback_parsing.html" title="Callback Parsing">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||
</head>
|
||
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
|
||
<div class="spirit-nav">
|
||
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
|
||
</div>
|
||
<div class="section">
|
||
<div class="titlepage"><div><div><h3 class="title">
|
||
<a name="boost_parser.tutorial.unicode_support"></a><a class="link" href="unicode_support.html" title="Unicode Support">Unicode Support</a>
|
||
</h3></div></div></div>
|
||
<p>
|
||
Boost.Parser was designed from the start to be Unicode friendly. There are
|
||
numerous references to the "Unicode code path" and the "non-Unicode
|
||
code path" in the Boost.Parser documentation. Though there are in fact
|
||
two code paths for Unicode and non-Unicode parsing, the code is not very
|
||
different in the two code paths, as they are written generically. The only
|
||
difference is that the Unicode code path parses the input as a range of code
|
||
points, and the non-Unicode path does not. In effect, this means that, in
|
||
the Unicode code path, when you call <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code> for some input range <code class="computeroutput"><span class="identifier">r</span></code>
|
||
and some parser <code class="computeroutput"><span class="identifier">p</span></code>, the parse
|
||
happens as if you called <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span> <span class="special">|</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code>
|
||
instead. (Of course, it does not matter if <code class="computeroutput"><span class="identifier">r</span></code>
|
||
is a proper range, or an iterator/sentinel pair; those both work fine with
|
||
<code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span></code>.)
|
||
</p>
|
||
<p>
|
||
Matching "characters" within Boost.Parser's parsers is assumed
|
||
to be a code point match. In the Unicode path there is a code point from
|
||
the input that is matched to each <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code> parser. In the non-Unicode
|
||
path, the encoding is unknown, and so each element of the input is considered
|
||
to be a whole "character" in the input encoding, analogous to a
|
||
code point. From this point on, I will therefore refer to a single element
|
||
of the input exclusively as a code point.
|
||
</p>
|
||
<p>
|
||
So, let's say we write this parser:
|
||
</p>
|
||
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char8_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">);</span>
|
||
</pre>
|
||
<p>
|
||
For any <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
|
||
parser that should match a value or values, the type of the value to match
|
||
is retained. So <code class="computeroutput"><span class="identifier">char8_parser</span></code>
|
||
contains a <code class="computeroutput"><span class="keyword">char</span></code> that it will
|
||
use for matching. If we had written:
|
||
</p>
|
||
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char32_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="identifier">U</span><span class="char">'\xcc'</span><span class="special">);</span>
|
||
</pre>
|
||
<p>
|
||
<code class="computeroutput"><span class="identifier">char32_parser</span></code> would instead
|
||
contain a <code class="computeroutput"><span class="keyword">char32_t</span></code> that it would
|
||
use for matching.
|
||
</p>
|
||
<p>
|
||
So, at any point during the parse, if <code class="computeroutput"><span class="identifier">char8_parser</span></code>
|
||
were being used to match a code point <code class="computeroutput"><span class="identifier">next_cp</span></code>
|
||
from the input, we would see the moral equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
|
||
<span class="special">==</span> <span class="char">'\xcc'</span></code>,
|
||
and if <code class="computeroutput"><span class="identifier">char32_parser</span></code> were
|
||
being used to match <code class="computeroutput"><span class="identifier">next_cp</span></code>,
|
||
we'd see the equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
|
||
<span class="special">==</span> <span class="identifier">U</span><span class="char">'\xcc'</span></code>. The take-away here is that you can write
|
||
<code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
|
||
parsers that match specific values, without worrying if the input is Unicode
|
||
or not because, under the covers, what takes place is a simple comparison
|
||
of two integral values.
|
||
</p>
|
||
<div class="note"><table border="0" summary="Note">
|
||
<tr>
|
||
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
|
||
<th align="left">Note</th>
|
||
</tr>
|
||
<tr><td align="left" valign="top"><p>
|
||
Boost.Parser actually promotes any two values to a common type using <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">common_type</span></code> before comparing them. This
|
||
is almost always works because the input and any parameter passed to <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
|
||
must be character types.
|
||
</p></td></tr>
|
||
</table></div>
|
||
<p>
|
||
Since matches are always done at a code point level (remember, a "code
|
||
point" in the non-Unicode path is assumed to be a single <code class="computeroutput"><span class="keyword">char</span></code>), you get different results trying to
|
||
match UTF-8 input in the Unicode and non-Unicode code paths:
|
||
</p>
|
||
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>
|
||
|
||
<span class="special">{</span>
|
||
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="special">(</span><span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*)</span><span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
|
||
<span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>
|
||
|
||
<span class="comment">// Since we've done nothing to indicate that we want to do Unicode</span>
|
||
<span class="comment">// parsing, and we've passed a range of char to parse(), this will do</span>
|
||
<span class="comment">// non-Unicode parsing.</span>
|
||
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
|
||
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>
|
||
|
||
<span class="comment">// Finds one match of the *char* 0xcc, because the value in the parser</span>
|
||
<span class="comment">// (0xcc) was matched against the two code points in the input (0xcc and</span>
|
||
<span class="comment">// 0x80), and the first one was a match.</span>
|
||
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">"\xcc"</span><span class="special">);</span>
|
||
<span class="special">}</span>
|
||
<span class="special">{</span>
|
||
<span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
|
||
<span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>
|
||
|
||
<span class="comment">// Since the input is a range of char8_t, this will do Unicode</span>
|
||
<span class="comment">// parsing. The same thing would have happened if we passed</span>
|
||
<span class="comment">// str | boost::parser::as_utf32 or even str | boost::parser::as_utf8.</span>
|
||
<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
|
||
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>
|
||
|
||
<span class="comment">// Finds zero matches of the *code point* 0xcc, because the value in</span>
|
||
<span class="comment">// the parser (0xcc) was matched against the single code point in the</span>
|
||
<span class="comment">// input, 0x0300.</span>
|
||
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">""</span><span class="special">);</span>
|
||
<span class="special">}</span>
|
||
</pre>
|
||
<h5>
|
||
<a name="boost_parser.tutorial.unicode_support.h0"></a>
|
||
<span class="phrase"><a name="boost_parser.tutorial.unicode_support.implicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.implicit_transcoding">Implicit
|
||
transcoding</a>
|
||
</h5>
|
||
<p>
|
||
Additionally, it is expected that most programs will use UTF-8 for the encoding
|
||
of Unicode strings. Boost.Parser is written with this typical case in mind.
|
||
This means that if you are parsing 32-bit code points (as you always are
|
||
in the Unicode path), and you want to catch the result in a container <code class="computeroutput"><span class="identifier">C</span></code> of <code class="computeroutput"><span class="keyword">char</span></code>
|
||
or <code class="computeroutput"><span class="identifier">char8_t</span></code> values, Boost.Parser
|
||
will silently transcode from UTF-32 to UTF-8 and write the attribute into
|
||
<code class="computeroutput"><span class="identifier">C</span></code>. This means that <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>,
|
||
<code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span></code>, etc. are fine to use as attribute
|
||
out-parameters for <code class="computeroutput"><span class="special">*</span><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>, and the result
|
||
will be UTF-8.
|
||
</p>
|
||
<div class="note"><table border="0" summary="Note">
|
||
<tr>
|
||
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
|
||
<th align="left">Note</th>
|
||
</tr>
|
||
<tr><td align="left" valign="top"><p>
|
||
UTF-16 strings as attributes are not supported directly. If you want to
|
||
use UTF-16 strings as attributes, you may need to do so by transcoding
|
||
a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can
|
||
do this by using <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf16</span></code>.
|
||
</p></td></tr>
|
||
</table></div>
|
||
<p>
|
||
The treatment of strings as UTF-8 is nearly ubiquitous within Boost.Parser.
|
||
For instance, though the entire interface of <code class="computeroutput"><a class="link" href="../../boost/parser/symbols.html" title="Struct template symbols">symbols</a></code> uses <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>
|
||
or <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string_view</span></code>, UTF-32 comparisons are used
|
||
internally.
|
||
</p>
|
||
<h5>
|
||
<a name="boost_parser.tutorial.unicode_support.h1"></a>
|
||
<span class="phrase"><a name="boost_parser.tutorial.unicode_support.explicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.explicit_transcoding">Explicit
|
||
transcoding</a>
|
||
</h5>
|
||
<p>
|
||
I mentioned above that the use of <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">utf</span><span class="special">*</span><span class="identifier">_view</span></code> as the range to parse opts you in
|
||
to Unicode parsing. Here's a bit more about these views and how best to use
|
||
them.
|
||
</p>
|
||
<p>
|
||
If you want to do Unicode parsing, you're always going to be comparing code
|
||
points at each step of the parse. As such, you're going to implicitly convert
|
||
any parse input to UTF-32, if needed. This is what all the parse API functions
|
||
do internally.
|
||
</p>
|
||
<p>
|
||
However, there are times when you have parse input that is a sequence of
|
||
UTF-8-encoded <code class="computeroutput"><span class="keyword">char</span></code>s, and you
|
||
want to do Unicode-aware parsing. As mentioned previously, Boost.Parser has
|
||
a special case for <code class="computeroutput"><span class="keyword">char</span></code> inputs,
|
||
and it will <span class="bold"><strong>not</strong></span> assume that <code class="computeroutput"><span class="keyword">char</span></code> sequences are UTF-8. If you want to tell
|
||
the parse API to do Unicode processing on them anyway, you can use the <code class="computeroutput"><span class="identifier">as_utf32</span></code> range adapter. (Note that you
|
||
can use any of the <code class="computeroutput"><span class="identifier">as_utf</span><span class="special">*</span></code> adaptors and the semantics will not differ
|
||
from the semantics below.)
|
||
</p>
|
||
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>
|
||
|
||
<span class="keyword">auto</span> <span class="keyword">const</span> <span class="identifier">p</span> <span class="special">=</span> <span class="char">'"'</span> <span class="special">>></span> <span class="special">*(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span> <span class="special">-</span> <span class="char">'"'</span> <span class="special">-</span> <span class="number">0xb6</span><span class="special">)</span> <span class="special">>></span> <span class="char">'"'</span><span class="special">;</span>
|
||
<span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*</span> <span class="identifier">str</span> <span class="special">=</span> <span class="string">"\"two wörds\""</span><span class="special">;</span> <span class="comment">// ö is two code units, 0xc3 0xb6</span>
|
||
|
||
<span class="keyword">auto</span> <span class="identifier">result_1</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span> <span class="comment">// Treat each char as a code point (typically ASCII).</span>
|
||
<span class="identifier">assert</span><span class="special">(!</span><span class="identifier">result_1</span><span class="special">);</span>
|
||
<span class="keyword">auto</span> <span class="identifier">result_2</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span> <span class="special">|</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span> <span class="comment">// Unicode-aware parsing on code points.</span>
|
||
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">result_2</span><span class="special">);</span>
|
||
</pre>
|
||
<p>
|
||
The first call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
|
||
treats each <code class="computeroutput"><span class="keyword">char</span></code> as a code point,
|
||
and since <code class="computeroutput"><span class="string">"ö"</span></code> is the
|
||
pair of code units <code class="computeroutput"><span class="number">0xc3</span></code> <code class="computeroutput"><span class="number">0xb6</span></code>, the parse matches the second code unit
|
||
against the <code class="computeroutput"><span class="special">-</span> <span class="number">0xb6</span></code>
|
||
part of the parser above, causing the parse to fail. This happens because
|
||
each code unit/<code class="computeroutput"><span class="keyword">char</span></code> in <code class="computeroutput"><span class="identifier">str</span></code> is treated as an independent code point.
|
||
</p>
|
||
<p>
|
||
The second call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
|
||
succeeds because, when the parse gets to the code point for <code class="computeroutput"><span class="char">'ö'</span></code>, it is <code class="computeroutput"><span class="number">0xf6</span></code>
|
||
(U+00F6), which does not match the <code class="computeroutput"><span class="special">-</span>
|
||
<span class="number">0xb6</span></code> part of the parser.
|
||
</p>
|
||
<p>
|
||
The other adaptors <code class="computeroutput"><span class="identifier">as_utf8</span></code>
|
||
and <code class="computeroutput"><span class="identifier">as_utf16</span></code> are also provided
|
||
for completeness, if you want to use them. They each can transcode any sequence
|
||
of character types.
|
||
</p>
|
||
<div class="important"><table border="0" summary="Important">
|
||
<tr>
|
||
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Important]" src="../../images/important.png"></td>
|
||
<th align="left">Important</th>
|
||
</tr>
|
||
<tr><td align="left" valign="top"><p>
|
||
The <code class="computeroutput"><span class="identifier">as_utfN</span></code> adaptors are
|
||
optional, so they don't come with <code class="computeroutput"><span class="identifier">parser</span><span class="special">.</span><span class="identifier">hpp</span></code>.
|
||
To get access to them, <code class="computeroutput"><span class="preprocessor">#include</span>
|
||
<span class="special"><</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">parser</span><span class="special">/</span><span class="identifier">transcode_view</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">></span></code>.
|
||
</p></td></tr>
|
||
</table></div>
|
||
<h5>
|
||
<a name="boost_parser.tutorial.unicode_support.h2"></a>
|
||
<span class="phrase"><a name="boost_parser.tutorial.unicode_support._lack_of__normalization"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support._lack_of__normalization">(Lack
|
||
of) normalization</a>
|
||
</h5>
|
||
<p>
|
||
One thing that Boost.Parser does not handle for you is normalization; Boost.Parser
|
||
is completely normalization-agnostic. Since all the parsers do their matching
|
||
using equality comparisons of code points, you should make sure that your
|
||
parsed range and your parsers all use the same normalization form.
|
||
</p>
|
||
</div>
|
||
<div class="copyright-footer">Copyright © 2020 T. Zachary Laine<p>
|
||
Distributed under the Boost Software License, Version 1.0. (See accompanying
|
||
file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>)
|
||
</p>
|
||
</div>
|
||
<hr>
|
||
<div class="spirit-nav">
|
||
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
|
||
</div>
|
||
</body>
|
||
</html>
|