This allows for testing that the ISA-specific code at least compiles,
even if running the tests isn't possible.
The support is only added to b2, CMake still always compiles and runs
the tests to keep using boost_test_jamfile for easier maintenance. In
the future, similar support can be added to CMake as well.
This adds SSE4.1, AVX2, AVX-512v1 and AVX10.1 implementations of the
from_chars algorithm. The generic implementation is moved to its own
header and constexpr is relaxed to only enabled when is_constant_evaluated
is supported.
The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of successful from_chars() calls per second:
Char | Generic | SSE4.1 | AVX2 | AVX512v1 | AVX10.1
=========+=========+=================+=================+=================+================
char | 38.571 | 560.645 (14.5x) | 501.505 (13.0x) | 540.038 (14.0x) | 480.778 (12.5x)
char16_t | 37.998 | 479.308 (12.6x) | 425.728 (11.2x) | 416.379 (11.0x) | 392.326 (10.3x)
char32_t | 50.327 | 391.313 (7.78x) | 359.312 (7.14x) | 349.849 (6.95x) | 333.979 (6.64x)
The AVX2 version is slightly slower than SSE4.1 because on Intel
microarchitectures the VEX-coded vpblendvb instruction is slower than
the legacy SSE4.1 pblendvb. The code contains workarounds for this, which
have slight performance overhead compared to SSE4.1 version, but are still
faster than using vpblendvb. Alternatively, the performance could be
improved by using asm blocks to force using pblendvb in AVX2 code, but this
may potentially cause SSE/AVX transition penalties if the target vector
register happens to have "dirty" upper bits. There's no way to ensure this
doesn't happen, so this is not implemented. AVX512v1 claws back some
performance and uses less instructions (i.e. smaller code size).
The AVX10.1 version is slower as it uses vpermi2b instruction from AVX512_VBMI,
which is relatively slow on Intel. It allows for reducing the number of
instructions even further and the number of vector constants as well. The
instruction is faster on AMD Zen 4 and should offer better performance compared
to AVX512v1 code path, although it wasn't tested. This code path is disabled
by default, unless BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B is defined, which
can be used to test and tune performance on AMD and newer Intel CPU
microarchitectures. Thus, by default, AVX10.1 performance should be roughly
equivalent to AVX512v1, barring compiler (mis)optimizations.
The unsuccessful parsing case depends on where the error happens, as the
generic version may terminate sooner if the error is detected at the
beginning of the input string, while the SIMD version performs roughly
the same amount of work but faster. Here are some examples for 8-bit
character types (for larger types the numbers are more or less comparable):
Error | Generic | SSE4.1 | AVX2 | AVX512v1 | AVX10.1
===================+==========+=================+=================+=================+================
EOI at 35 chars | 43.629 | 356.562 (8.17x) | 326.311 (7.48x) | 322.377 (7.39x) | 308.155 (7.06x)
EOI at 1 char | 2645.783 | 444.769 (0.17x) | 400.275 (0.15x) | 404.826 (0.15x) | 403.730 (0.15x)
Missing dash at 23 | 73.878 | 514.303 (6.96x) | 474.694 (6.43x) | 507.949 (6.88x) | 474.077 (6.42x)
Missing dash at 8 | 223.921 | 516.641 (2.31x) | 472.737 (2.11x) | 506.242 (2.26x) | 473.718 (2.12x)
Illegal char at 35 | 47.373 | 368.002 (7.77x) | 333.233 (7.03x) | 318.242 (6.72x) | 301.659 (6.37x)
Illegal char at 0 | 1729.087 | 421.511 (0.24x) | 385.217 (0.22x) | 374.047 (0.22x) | 351.944 (0.20x)
The above table is collected with BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B
defined.
In general, only the very early errors tend to perform worse in the SIMD
version and the majority of cases are still faster.
Besides BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B, the implementation also has
BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM control macro, which, if defined, enables
usage of 512-bit registers for convertting from 32-bit character types to 8-bit
integers. This code path is also slower than the 256-bit path on Golden Cove,
and therefore is disabled. The macro is provided primarily to allow for tuning
and experimentation with newer CPU microarchitectures, where the 512-bit path
may become beneficial. All of the above performance numbers were produced
without it.
Moved the generic to_chars implementation to a separate header and made
to_chars.hpp select the implementation based on the enabled SIMD ISA
extensions. Added an x86 implementation leveraging SSSE3 and later
vector extensions. Added detection of the said extensions to config.hpp.
The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of to_chars() calls per second with a 16-byte aligned output buffer:
Char | Generic | SSE4.1 | AVX2 | AVX-512
=========+=========+==================+==================+=================
char | 203.190 | 1059.322 (5.21x) | 1053.352 (5.18x) | 1058.089 (5.21x)
char16_t | 184.003 | 848.356 (4.61x) | 1009.489 (5.49x) | 1011.122 (5.50x)
char32_t | 202.425 | 484.801 (2.39x) | 676.338 (3.34x) | 462.770 (2.29x)
The core of the SIMD implementation is using 128-bit vectors, larger vectors
are only used to convert to the target character types. This means that for
1-byte character types all vector implementations are basically the same
(barring the extra ISA flexibility added by AVX) and for 2-byte character
types AVX2 and AVX-512 are basically the same.
For 4-byte character types, AVX-512 showed worse performance than SSE4.1 and
AVX2 on the test system. It isn't clear why that is, but it is possible that
the CPU throttles 512-bit instructions so much that the performance drops
below a 256-bit equivalent. Perhaps, there are just not enough 512-bit
instructions for the CPU to power up the full 512-bit pipeline. Therefore,
the AVX-512 code path for 4-byte character types is currently disabled and
the AVX2 path is used instead (which makes AVX2 and AVX-512 versions basically
equivalent). The AVX-512 path can be enabled again if new CPU microarchitectures
appear that will benefit from it.
Higher alignment values of the output buffer were also tested, but they did not
meaningfully improve performance.