2
0
mirror of https://github.com/boostorg/locale.git synced 2026-01-19 04:22:08 +00:00

Don't use stdlib UTF-8 codecvt facet for Cygwin too.

Handle that the same as for native Windows.
Reason is an issue discovered converting an UTF-8 sequence of 1000x U+2008A to wchar_t (UTF-16):
UTF-8: "\xF0\xA0\x82\x8A"
The correct result are 1000x L"\xD840\xDC8A"
The first 255 pairs are correct (1020 input bytes consumed) but the low
surrogate of the 256th pair becomes `0xDC82` hinting it repeats the
second last byte (index 1023) instead of reading the correct one.
This commit is contained in:
Alexander Grund
2025-10-28 20:14:03 +01:00
parent cd43cdcf0f
commit 848115fcae

View File

@@ -18,11 +18,14 @@ namespace boost { namespace locale { namespace impl_std {
std::locale
create_codecvt(const std::locale& in, const std::string& locale_name, char_facet_t type, utf8_support utf)
{
#if defined(BOOST_WINDOWS)
#if defined(BOOST_WINDOWS) || defined(__CYGWIN__)
// This isn't fully correct:
// It will treat the 2-Byte wchar_t as UTF-16 encoded while it may be UCS-2
// std::basic_filebuf explicitely disallows using suche multi-byte codecvts
// but it works in practice so far, so use it instead of failing for codepoints above U+FFFF
//
// Additionally, the stdlib in Cygwin has issues converting long UTF-8 sequences likely due to left-over
// state across buffer boundaries. E.g. the low surrogate after a sequence of 255 UTF-16 pairs gets corrupted.
if(utf != utf8_support::none)
return util::create_utf8_codecvt(in, type);
#endif