This allows for testing that the ISA-specific code at least compiles,
even if running the tests isn't possible.
The support is only added to b2, CMake still always compiles and runs
the tests to keep using boost_test_jamfile for easier maintenance. In
the future, similar support can be added to CMake as well.
This adds SSE2 and SSSE3 code paths to from_chars_x86.hpp. The performance
effect on Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of
successful from_chars() calls per second:
Char | Generic | SSE2 | SSSE3 | SSE4.1 | AVX2 | AVX512v1
=========+=========+=================+=================+=================+=================+================
char | 40.475 | 327.791 (8.10x) | 465.857 (11.5x) | 555.346 (13.7x) | 504.648 (12.5x) | 539.700 (13.3x)
char16_t | 38.757 | 292.048 (7.54x) | 401.117 (10.3x) | 478.574 (12.3x) | 426.188 (11.0x) | 416.205 (10.7x)
char32_t | 50.200 | 150.900 (3.01x) | 204.588 (4.08x) | 389.882 (7.77x) | 359.591 (7.16x) | 349.663 (6.97x)
In addition, the workarounds to avoid (v)pblendvb instructions have been
extended to Intel Haswell and Broadwell, as these microarchitectures have
poor performance with these instructions (including the SSE4.1 pblendvb).
Two new experimental control macros added: BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB
and BOOST_UUID_FROM_CHARS_X86_USE_PBLENDVB. The former indicates that
(v)pblendvb instructions are slow and should be avoided on the target
microarchitectures. The latter indicates that (v)pblendvb should be used
by the implementation. The latter macro is derived from the former and
takes precedence. As before, these macros can be used for experimenting
and fine tuning performance for specific target CPUs. By default,
BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB is defined for Haswell/Broadwell
or if AVX is detected.
Lastly, made selection between blend-based and shuffle-based character
code conversion in various places unified, controlled by a single
internal macro BOOST_UUID_DETAIL_FROM_CHARS_X86_USE_BLENDS.
This adds SSE2 code paths to to_chars_x86.hpp. The performance effect on
Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of to_chars() calls
per second with a 16-byte aligned output buffer:
Char | Generic | SSE2 | SSE4.1 | AVX2 | AVX10.1
=========+=========+=================+==================+==================+=================
char | 202.314 | 564.857 (2.79x) | 1194.772 (5.91x) | 1192.094 (5.89x) | 1191.838 (5.89x)
char16_t | 188.532 | 457.281 (2.43x) | 795.798 (4.22x) | 935.016 (4.96x) | 938.368 (4.98x)
char32_t | 193.151 | 345.612 (1.79x) | 489.620 (2.53x) | 688.829 (3.57x) | 689.617 (3.57x)
Here, Generic column was generated with BOOST_UUID_NO_SIMD defined and
SSE2 with -march=x86-64. SSE2 support can be useful in cases when users
need to be compatible with the base x86-64 ISA.
The load/store helpers use memcpy internally, which is a more correct
way to load and store integers from/to unaligned memory and with
potential type punning. In particular, it should silence UBSAN errors
about unaligned memory accesses in SIMD algorithms.
The builtins are sometimes more strongly optimized than the libc function
calls. They also don't need the <cstring> include.
Added unqualified memcpy function that simply calls either the builtin or
the libc function. This function is intended to be a drop-in replacement
for the libc memcpy calls, where constexpr friendliness is not important.
It is still marked as constexpr to allow mentioning them in other constexpr
functions. To avoid early checks whether its body can be evaluated in the
context of a constant expression, it is defined as a dummy template.
Marked all functions as noexcept.