parser/doc/html/boost_parser/tutorial/unicode_support.html

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Unicode Support</title>
<link rel="stylesheet" href="../../boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="../../index.html" title="Chapter 1. Boost.Parser">
<link rel="up" href="../tutorial.html" title="Tutorial">
<link rel="prev" href="algorithms_and_views_that_use_parsers.html" title="Algorithms and Views That Use Parsers">
<link rel="next" href="callback_parsing.html" title="Callback Parsing">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<div class="spirit-nav">
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
</div>
<div class="section">
<div class="titlepage"><div><div><h3 class="title">
<a name="boost_parser.tutorial.unicode_support"></a><a class="link" href="unicode_support.html" title="Unicode Support">Unicode Support</a>
</h3></div></div></div>
<p>
        Boost.Parser was designed from the start to be Unicode friendly. There are
        numerous references to the "Unicode code path" and the "non-Unicode
        code path" in the Boost.Parser documentation. Though there are in fact
        two code paths for Unicode and non-Unicode parsing, the code is not very
        different in the two code paths, as they are written generically. The only
        difference is that the Unicode code path parses the input as a range of code
        points, and the non-Unicode path does not. In effect, this means that, in
        the Unicode code path, when you call <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code> for some input range <code class="computeroutput"><span class="identifier">r</span></code>
        and some parser <code class="computeroutput"><span class="identifier">p</span></code>, the parse
        happens as if you called <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse</a><span class="special">(</span><span class="identifier">r</span> <span class="special">|</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">)</span></code>
        instead. (Of course, it does not matter if <code class="computeroutput"><span class="identifier">r</span></code>
        is a proper range, or an iterator/sentinel pair; those both work fine with
        <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf32</span></code>.)
      </p>
<p>
        Matching "characters" within Boost.Parser's parsers is assumed
        to be a code point match. In the Unicode path there is a code point from
        the input that is matched to each <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code> parser. In the non-Unicode
        path, the encoding is unknown, and so each element of the input is considered
        to be a whole "character" in the input encoding, analogous to a
        code point. From this point on, I will therefore refer to a single element
        of the input exclusively as a code point.
      </p>
<p>
        So, let's say we write this parser:
      </p>
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char8_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">);</span>
</pre>
<p>
        For any <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
        parser that should match a value or values, the type of the value to match
        is retained. So <code class="computeroutput"><span class="identifier">char8_parser</span></code>
        contains a <code class="computeroutput"><span class="keyword">char</span></code> that it will
        use for matching. If we had written:
      </p>
<pre class="programlisting"><span class="keyword">constexpr</span> <span class="keyword">auto</span> <span class="identifier">char32_parser</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="identifier">U</span><span class="char">'\xcc'</span><span class="special">);</span>
</pre>
<p>
        <code class="computeroutput"><span class="identifier">char32_parser</span></code> would instead
        contain a <code class="computeroutput"><span class="keyword">char32_t</span></code> that it would
        use for matching.
      </p>
<p>
        So, at any point during the parse, if <code class="computeroutput"><span class="identifier">char8_parser</span></code>
        were being used to match a code point <code class="computeroutput"><span class="identifier">next_cp</span></code>
        from the input, we would see the moral equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
        <span class="special">==</span> <span class="char">'\xcc'</span></code>,
        and if <code class="computeroutput"><span class="identifier">char32_parser</span></code> were
        being used to match <code class="computeroutput"><span class="identifier">next_cp</span></code>,
        we'd see the equivalent of <code class="computeroutput"><span class="identifier">next_cp</span>
        <span class="special">==</span> <span class="identifier">U</span><span class="char">'\xcc'</span></code>. The take-away here is that you can write
        <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
        parsers that match specific values, without worrying if the input is Unicode
        or not because, under the covers, what takes place is a simple comparison
        of two integral values.
      </p>
<div class="note"><table border="0" summary="Note">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
<th align="left">Note</th>
</tr>
<tr><td align="left" valign="top"><p>
          Boost.Parser actually promotes any two values to a common type using <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">common_type</span></code> before comparing them. This
          is almost always works because the input and any parameter passed to <code class="computeroutput"><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>
          must be character types.
        </p></td></tr>
</table></div>
<p>
        Since matches are always done at a code point level (remember, a "code
        point" in the non-Unicode path is assumed to be a single <code class="computeroutput"><span class="keyword">char</span></code>), you get different results trying to
        match UTF-8 input in the Unicode and non-Unicode code paths:
      </p>
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>

<span class="special">{</span>
    <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="special">(</span><span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*)</span><span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
    <span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>

    <span class="comment">// Since we've done nothing to indicate that we want to do Unicode</span>
    <span class="comment">// parsing, and we've passed a range of char to parse(), this will do</span>
    <span class="comment">// non-Unicode parsing.</span>
    <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
    <span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>

    <span class="comment">// Finds one match of the *char* 0xcc, because the value in the parser</span>
    <span class="comment">// (0xcc) was matched against the two code points in the input (0xcc and</span>
    <span class="comment">// 0x80), and the first one was a match.</span>
    <span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">"\xcc"</span><span class="special">);</span>
<span class="special">}</span>
<span class="special">{</span>
    <span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span> <span class="identifier">str</span> <span class="special">=</span> <span class="identifier">u8</span><span class="string">"\xcc\x80"</span><span class="special">;</span> <span class="comment">// encodes the code point U+0300</span>
    <span class="keyword">auto</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>

    <span class="comment">// Since the input is a range of char8_t, this will do Unicode</span>
    <span class="comment">// parsing.  The same thing would have happened if we passed</span>
    <span class="comment">// str | boost::parser::as_utf32 or even str | boost::parser::as_utf8.</span>
    <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">chars</span><span class="special">;</span>
    <span class="identifier">assert</span><span class="special">(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="special">*</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span><span class="special">(</span><span class="char">'\xcc'</span><span class="special">),</span> <span class="identifier">chars</span><span class="special">));</span>

    <span class="comment">// Finds zero matches of the *code point* 0xcc, because the value in</span>
    <span class="comment">// the parser (0xcc) was matched against the single code point in the</span>
    <span class="comment">// input, 0x0300.</span>
    <span class="identifier">assert</span><span class="special">(</span><span class="identifier">chars</span> <span class="special">==</span> <span class="string">""</span><span class="special">);</span>
<span class="special">}</span>
</pre>
<h5>
<a name="boost_parser.tutorial.unicode_support.h0"></a>
        <span class="phrase"><a name="boost_parser.tutorial.unicode_support.implicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.implicit_transcoding">Implicit
        transcoding</a>
      </h5>
<p>
        Additionally, it is expected that most programs will use UTF-8 for the encoding
        of Unicode strings. Boost.Parser is written with this typical case in mind.
        This means that if you are parsing 32-bit code points (as you always are
        in the Unicode path), and you want to catch the result in a container <code class="computeroutput"><span class="identifier">C</span></code> of <code class="computeroutput"><span class="keyword">char</span></code>
        or <code class="computeroutput"><span class="identifier">char8_t</span></code> values, Boost.Parser
        will silently transcode from UTF-32 to UTF-8 and write the attribute into
        <code class="computeroutput"><span class="identifier">C</span></code>. This means that <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>,
        <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">u8string</span></code>, etc. are fine to use as attribute
        out-parameters for <code class="computeroutput"><span class="special">*</span><a class="link" href="../../boost/parser/char_.html" title="Global char_">char_</a></code>, and the result
        will be UTF-8.
      </p>
<div class="note"><table border="0" summary="Note">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../../images/note.png"></td>
<th align="left">Note</th>
</tr>
<tr><td align="left" valign="top"><p>
          UTF-16 strings as attributes are not supported directly. If you want to
          use UTF-16 strings as attributes, you may need to do so by transcoding
          a UTF-8 or UTF-32 attribute to UTF-16 within a semantic action. You can
          do this by using <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">as_utf16</span></code>.
        </p></td></tr>
</table></div>
<p>
        The treatment of strings as UTF-8 is nearly ubiquitous within Boost.Parser.
        For instance, though the entire interface of <code class="computeroutput"><a class="link" href="../../boost/parser/symbols.html" title="Struct template symbols">symbols</a></code> uses <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>
        or <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string_view</span></code>, UTF-32 comparisons are used
        internally.
      </p>
<h5>
<a name="boost_parser.tutorial.unicode_support.h1"></a>
        <span class="phrase"><a name="boost_parser.tutorial.unicode_support.explicit_transcoding"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support.explicit_transcoding">Explicit
        transcoding</a>
      </h5>
<p>
        I mentioned above that the use of <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">::</span><span class="identifier">utf</span><span class="special">*</span><span class="identifier">_view</span></code> as the range to parse opts you in
        to Unicode parsing. Here's a bit more about these views and how best to use
        them.
      </p>
<p>
        If you want to do Unicode parsing, you're always going to be comparing code
        points at each step of the parse. As such, you're going to implicitly convert
        any parse input to UTF-32, if needed. This is what all the parse API functions
        do internally.
      </p>
<p>
        However, there are times when you have parse input that is a sequence of
        UTF-8-encoded <code class="computeroutput"><span class="keyword">char</span></code>s, and you
        want to do Unicode-aware parsing. As mentioned previously, Boost.Parser has
        a special case for <code class="computeroutput"><span class="keyword">char</span></code> inputs,
        and it will <span class="bold"><strong>not</strong></span> assume that <code class="computeroutput"><span class="keyword">char</span></code> sequences are UTF-8. If you want to tell
        the parse API to do Unicode processing on them anyway, you can use the <code class="computeroutput"><span class="identifier">as_utf32</span></code> range adapter. (Note that you
        can use any of the <code class="computeroutput"><span class="identifier">as_utf</span><span class="special">*</span></code> adaptors and the semantics will not differ
        from the semantics below.)
      </p>
<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">bp</span> <span class="special">=</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">parser</span><span class="special">;</span>

<span class="keyword">auto</span> <span class="keyword">const</span> <span class="identifier">p</span> <span class="special">=</span> <span class="char">'"'</span> <span class="special">&gt;&gt;</span> <span class="special">*(</span><span class="identifier">bp</span><span class="special">::</span><span class="identifier">char_</span> <span class="special">-</span> <span class="char">'"'</span> <span class="special">-</span> <span class="number">0xb6</span><span class="special">)</span> <span class="special">&gt;&gt;</span> <span class="char">'"'</span><span class="special">;</span>
<span class="keyword">char</span> <span class="keyword">const</span> <span class="special">*</span> <span class="identifier">str</span> <span class="special">=</span> <span class="string">"\"two wörds\""</span><span class="special">;</span> <span class="comment">// ö is two code units, 0xc3 0xb6</span>

<span class="keyword">auto</span> <span class="identifier">result_1</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span>                <span class="comment">// Treat each char as a code point (typically ASCII).</span>
<span class="identifier">assert</span><span class="special">(!</span><span class="identifier">result_1</span><span class="special">);</span>
<span class="keyword">auto</span> <span class="identifier">result_2</span> <span class="special">=</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">parse</span><span class="special">(</span><span class="identifier">str</span> <span class="special">|</span> <span class="identifier">bp</span><span class="special">::</span><span class="identifier">as_utf32</span><span class="special">,</span> <span class="identifier">p</span><span class="special">);</span> <span class="comment">// Unicode-aware parsing on code points.</span>
<span class="identifier">assert</span><span class="special">(</span><span class="identifier">result_2</span><span class="special">);</span>
</pre>
<p>
        The first call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
        treats each <code class="computeroutput"><span class="keyword">char</span></code> as a code point,
        and since <code class="computeroutput"><span class="string">"ö"</span></code> is the
        pair of code units <code class="computeroutput"><span class="number">0xc3</span></code> <code class="computeroutput"><span class="number">0xb6</span></code>, the parse matches the second code unit
        against the <code class="computeroutput"><span class="special">-</span> <span class="number">0xb6</span></code>
        part of the parser above, causing the parse to fail. This happens because
        each code unit/<code class="computeroutput"><span class="keyword">char</span></code> in <code class="computeroutput"><span class="identifier">str</span></code> is treated as an independent code point.
      </p>
<p>
        The second call to <code class="computeroutput"><a class="link" href="../../boost/parser/parse_id2.html" title="Function template parse">parse()</a></code>
        succeeds because, when the parse gets to the code point for <code class="computeroutput"><span class="char">'ö'</span></code>, it is <code class="computeroutput"><span class="number">0xf6</span></code>
        (U+00F6), which does not match the <code class="computeroutput"><span class="special">-</span>
        <span class="number">0xb6</span></code> part of the parser.
      </p>
<p>
        The other adaptors <code class="computeroutput"><span class="identifier">as_utf8</span></code>
        and <code class="computeroutput"><span class="identifier">as_utf16</span></code> are also provided
        for completeness, if you want to use them. They each can transcode any sequence
        of character types.
      </p>
<div class="important"><table border="0" summary="Important">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Important]" src="../../images/important.png"></td>
<th align="left">Important</th>
</tr>
<tr><td align="left" valign="top"><p>
          The <code class="computeroutput"><span class="identifier">as_utfN</span></code> adaptors are
          optional, so they don't come with <code class="computeroutput"><span class="identifier">parser</span><span class="special">.</span><span class="identifier">hpp</span></code>.
          To get access to them, <code class="computeroutput"><span class="preprocessor">#include</span>
          <span class="special">&lt;</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">parser</span><span class="special">/</span><span class="identifier">transcode_view</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">&gt;</span></code>.
        </p></td></tr>
</table></div>
<h5>
<a name="boost_parser.tutorial.unicode_support.h2"></a>
        <span class="phrase"><a name="boost_parser.tutorial.unicode_support._lack_of__normalization"></a></span><a class="link" href="unicode_support.html#boost_parser.tutorial.unicode_support._lack_of__normalization">(Lack
        of) normalization</a>
      </h5>
<p>
        One thing that Boost.Parser does not handle for you is normalization; Boost.Parser
        is completely normalization-agnostic. Since all the parsers do their matching
        using equality comparisons of code points, you should make sure that your
        parsed range and your parsers all use the same normalization form.
      </p>
</div>
<div class="copyright-footer">Copyright © 2020 T. Zachary Laine<p>
        Distributed under the Boost Software License, Version 1.0. (See accompanying
        file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>)
      </p>
</div>
<hr>
<div class="spirit-nav">
<a accesskey="p" href="algorithms_and_views_that_use_parsers.html"><img src="../../images/prev.png" alt="Prev"></a><a accesskey="u" href="../tutorial.html"><img src="../../images/up.png" alt="Up"></a><a accesskey="h" href="../../index.html"><img src="../../images/home.png" alt="Home"></a><a accesskey="n" href="callback_parsing.html"><img src="../../images/next.png" alt="Next"></a>
</div>
</body>
</html>