The load/store helpers use memcpy internally, which is a more correct
way to load and store integers from/to unaligned memory and with
potential type punning. In particular, it should silence UBSAN errors
about unaligned memory accesses in SIMD algorithms.
The builtins are sometimes more strongly optimized than the libc function
calls. They also don't need the <cstring> include.
Added unqualified memcpy function that simply calls either the builtin or
the libc function. This function is intended to be a drop-in replacement
for the libc memcpy calls, where constexpr friendliness is not important.
It is still marked as constexpr to allow mentioning them in other constexpr
functions. To avoid early checks whether its body can be evaluated in the
context of a constant expression, it is defined as a dummy template.
Marked all functions as noexcept.
Following from_chars_x86.hpp, allow users to explicitly enable 512-bit
vectors in to_chars by defining BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM.
This is primarily to allow for experimenting and tuning performance on
newer CPU microarchitectures.
This adds SSE4.1, AVX2, AVX-512v1 and AVX10.1 implementations of the
from_chars algorithm. The generic implementation is moved to its own
header and constexpr is relaxed to only enabled when is_constant_evaluated
is supported.
The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of successful from_chars() calls per second:
Char | Generic | SSE4.1 | AVX2 | AVX512v1 | AVX10.1
=========+=========+=================+=================+=================+================
char | 38.571 | 560.645 (14.5x) | 501.505 (13.0x) | 540.038 (14.0x) | 480.778 (12.5x)
char16_t | 37.998 | 479.308 (12.6x) | 425.728 (11.2x) | 416.379 (11.0x) | 392.326 (10.3x)
char32_t | 50.327 | 391.313 (7.78x) | 359.312 (7.14x) | 349.849 (6.95x) | 333.979 (6.64x)
The AVX2 version is slightly slower than SSE4.1 because on Intel
microarchitectures the VEX-coded vpblendvb instruction is slower than
the legacy SSE4.1 pblendvb. The code contains workarounds for this, which
have slight performance overhead compared to SSE4.1 version, but are still
faster than using vpblendvb. Alternatively, the performance could be
improved by using asm blocks to force using pblendvb in AVX2 code, but this
may potentially cause SSE/AVX transition penalties if the target vector
register happens to have "dirty" upper bits. There's no way to ensure this
doesn't happen, so this is not implemented. AVX512v1 claws back some
performance and uses less instructions (i.e. smaller code size).
The AVX10.1 version is slower as it uses vpermi2b instruction from AVX512_VBMI,
which is relatively slow on Intel. It allows for reducing the number of
instructions even further and the number of vector constants as well. The
instruction is faster on AMD Zen 4 and should offer better performance compared
to AVX512v1 code path, although it wasn't tested. This code path is disabled
by default, unless BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B is defined, which
can be used to test and tune performance on AMD and newer Intel CPU
microarchitectures. Thus, by default, AVX10.1 performance should be roughly
equivalent to AVX512v1, barring compiler (mis)optimizations.
The unsuccessful parsing case depends on where the error happens, as the
generic version may terminate sooner if the error is detected at the
beginning of the input string, while the SIMD version performs roughly
the same amount of work but faster. Here are some examples for 8-bit
character types (for larger types the numbers are more or less comparable):
Error | Generic | SSE4.1 | AVX2 | AVX512v1 | AVX10.1
===================+==========+=================+=================+=================+================
EOI at 35 chars | 43.629 | 356.562 (8.17x) | 326.311 (7.48x) | 322.377 (7.39x) | 308.155 (7.06x)
EOI at 1 char | 2645.783 | 444.769 (0.17x) | 400.275 (0.15x) | 404.826 (0.15x) | 403.730 (0.15x)
Missing dash at 23 | 73.878 | 514.303 (6.96x) | 474.694 (6.43x) | 507.949 (6.88x) | 474.077 (6.42x)
Missing dash at 8 | 223.921 | 516.641 (2.31x) | 472.737 (2.11x) | 506.242 (2.26x) | 473.718 (2.12x)
Illegal char at 35 | 47.373 | 368.002 (7.77x) | 333.233 (7.03x) | 318.242 (6.72x) | 301.659 (6.37x)
Illegal char at 0 | 1729.087 | 421.511 (0.24x) | 385.217 (0.22x) | 374.047 (0.22x) | 351.944 (0.20x)
The above table is collected with BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B
defined.
In general, only the very early errors tend to perform worse in the SIMD
version and the majority of cases are still faster.
Besides BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B, the implementation also has
BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM control macro, which, if defined, enables
usage of 512-bit registers for convertting from 32-bit character types to 8-bit
integers. This code path is also slower than the 256-bit path on Golden Cove,
and therefore is disabled. The macro is provided primarily to allow for tuning
and experimentation with newer CPU microarchitectures, where the 512-bit path
may become beneficial. All of the above performance numbers were produced
without it.
The new BOOST_UUID_USE_AVX512_V1 config macro indicates presence of
AVX-512F, VL, CD, BW and DQ extensions, which are supported by Intel
Skylake-X and similar processors. BOOST_UUID_USE_AVX10_1 is still
retained and indicates support for full AVX10.1 set. For now, it only
adds support for VBMI, but this list may grow in the future as new
extensions are being utilized.
This helper was used to simplify support for older CPUs, to select
between _mm_loadu_si128 and _mm_lddqu_si128 intrinsics. That code
has long been removed, and we now always use _mm_loadu_si128 to load
data. Use the intrinsic directly everywhere.
The simd_vector template is a wrapper around an array of elements that
can automatically read that arrays as a SIMD vector. This reduces the amount
of reinterpret_casts in SIMD code that uses constants.
This avoids potential character code conversion in ostream and instead
produces native character type directly in to_chars, which is likely
much faster.
This removes code duplication with from_chars and allows for reusing
a faster implementation of from_chars in operator>>.
Also, align the input character buffer for more efficient memory
accesses.
Also clarified the meaning of BOOST_UUID_USE_AVX10_1 in the docs as the previous
wording could be taken that it indicates support for a subset of AVX-512 that
is supported in Skylake-X.