Don't use stdlib UTF-8 codecvt facet for Cygwin too.

Handle that the same as for native Windows. Reason is an issue discovered converting an UTF-8 sequence of 1000x U+2008A to wchar_t (UTF-16): UTF-8: "\xF0\xA0\x82\x8A" The correct result are 1000x L"\xD840\xDC8A" The first 255 pairs are correct (1020 input bytes consumed) but the low surrogate of the 256th pair becomes `0xDC82` hinting it repeats the second last byte (index 1023) instead of reading the correct one.
2026-01-19 04:22:08 +00:00 · 2025-10-28 20:14:03 +01:00
parent cd43cdcf0f
commit 848115fcae
1 changed files with 4 additions and 1 deletions
--- a/src/std/codecvt.cpp
+++ b/src/std/codecvt.cpp
@@ -18,11 +18,14 @@ namespace boost { namespace locale { namespace impl_std {
    std::locale
    create_codecvt(const std::locale& in, const std::string& locale_name, char_facet_t type, utf8_support utf)
    {
-#if defined(BOOST_WINDOWS)
+#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
        // This isn't fully correct:
        // It will treat the 2-Byte wchar_t as UTF-16 encoded while it may be UCS-2
        // std::basic_filebuf explicitely disallows using suche multi-byte codecvts
        // but it works in practice so far, so use it instead of failing for codepoints above U+FFFF
+        //
+        // Additionally, the stdlib in Cygwin has issues converting long UTF-8 sequences likely due to left-over
+        // state across buffer boundaries. E.g. the low surrogate after a sequence of 255 UTF-16 pairs gets corrupted.
        if(utf != utf8_support::none)
            return util::create_utf8_codecvt(in, type);
 #endif