2
0
mirror of https://github.com/boostorg/uuid.git synced 2026-01-19 04:42:16 +00:00

737 Commits

Author SHA1 Message Date
Peter Dimov
2ce9519afc Avoid -Wsign-conversion warning in detail/sha1.hpp 2026-01-05 16:54:54 +02:00
Peter Dimov
fd167bba0d Avoid -Wsign-conversion warning in time_generator_v1.hpp 2026-01-05 16:54:54 +02:00
Peter Dimov
a835ddff90 Avoid -Wsign-conversion warning in time_generator_v7.hpp 2026-01-05 16:54:54 +02:00
Peter Dimov
91ffab27d2 Enable stricter warnings (matching Unordered) in test/Jamfile.v2 2026-01-05 16:54:54 +02:00
Peter Dimov
db92124922 Reorder includes 2026-01-05 16:02:11 +02:00
Peter Dimov
0038762216 Merge pull request #186 from Lastique/feature/from_chars_simd
Add SIMD implementation of `from_chars`
2026-01-05 15:53:07 +02:00
Andrey Semashev
0e23b235fc Use load/store helpers from endian.hpp in to/from_chars_x86.hpp.
The load/store helpers use memcpy internally, which is a more correct
way to load and store integers from/to unaligned memory and with
potential type punning. In particular, it should silence UBSAN errors
about unaligned memory accesses in SIMD algorithms.
2026-01-05 14:13:10 +03:00
Andrey Semashev
3698f8df2c Added a missing include. 2026-01-05 14:13:10 +03:00
Andrey Semashev
9dde4978fd Use memcpy/memset/memcmp functions from cstring.hpp in endian.hpp.
This benefits integer reads/writes from using compiler intrinsics,
when possible.
2026-01-05 14:13:10 +03:00
Andrey Semashev
d358c39a67 Use __builtin_memcpy/memcmp in cstring.hpp.
The builtins are sometimes more strongly optimized than the libc function
calls. They also don't need the <cstring> include.

Added unqualified memcpy function that simply calls either the builtin or
the libc function. This function is intended to be a drop-in replacement
for the libc memcpy calls, where constexpr friendliness is not important.
It is still marked as constexpr to allow mentioning them in other constexpr
functions. To avoid early checks whether its body can be evaluated in the
context of a constant expression, it is defined as a dummy template.

Marked all functions as noexcept.
2026-01-05 14:13:10 +03:00
Andrey Semashev
02574368fc Added GitHub Actions job on Rocketlake ISA. 2026-01-05 14:13:10 +03:00
Andrey Semashev
831a9e6eab Added more tests for from_chars verifying unexpected end of input.
The added tests check unexpected end of input on even and odd character
positions, since these are handled separately in SIMD.
2026-01-05 14:13:10 +03:00
Andrey Semashev
7e50b1aaa7 Added running from_chars tests with SIMD disabled.
Also added to/from_chars tests to CMakeLists.txt.
2026-01-05 14:13:10 +03:00
Andrey Semashev
b7535347ec Allow users to enable 512-bit vectors in to_chars_x86.hpp.
Following from_chars_x86.hpp, allow users to explicitly enable 512-bit
vectors in to_chars by defining BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM.
This is primarily to allow for experimenting and tuning performance on
newer CPU microarchitectures.
2026-01-05 14:13:10 +03:00
Andrey Semashev
3920cc584c Added x86 SIMD implementation of from_chars.
This adds SSE4.1, AVX2, AVX-512v1 and AVX10.1 implementations of the
from_chars algorithm. The generic implementation is moved to its own
header and constexpr is relaxed to only enabled when is_constant_evaluated
is supported.

The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of successful from_chars() calls per second:

Char     | Generic | SSE4.1          | AVX2            | AVX512v1        | AVX10.1
=========+=========+=================+=================+=================+================
char     |  38.571 | 560.645 (14.5x) | 501.505 (13.0x) | 540.038 (14.0x) | 480.778 (12.5x)
char16_t |  37.998 | 479.308 (12.6x) | 425.728 (11.2x) | 416.379 (11.0x) | 392.326 (10.3x)
char32_t |  50.327 | 391.313 (7.78x) | 359.312 (7.14x) | 349.849 (6.95x) | 333.979 (6.64x)

The AVX2 version is slightly slower than SSE4.1 because on Intel
microarchitectures the VEX-coded vpblendvb instruction is slower than
the legacy SSE4.1 pblendvb. The code contains workarounds for this, which
have slight performance overhead compared to SSE4.1 version, but are still
faster than using vpblendvb. Alternatively, the performance could be
improved by using asm blocks to force using pblendvb in AVX2 code, but this
may potentially cause SSE/AVX transition penalties if the target vector
register happens to have "dirty" upper bits. There's no way to ensure this
doesn't happen, so this is not implemented. AVX512v1 claws back some
performance and uses less instructions (i.e. smaller code size).

The AVX10.1 version is slower as it uses vpermi2b instruction from AVX512_VBMI,
which is relatively slow on Intel. It allows for reducing the number of
instructions even further and the number of vector constants as well. The
instruction is faster on AMD Zen 4 and should offer better performance compared
to AVX512v1 code path, although it wasn't tested. This code path is disabled
by default, unless BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B is defined, which
can be used to test and tune performance on AMD and newer Intel CPU
microarchitectures. Thus, by default, AVX10.1 performance should be roughly
equivalent to AVX512v1, barring compiler (mis)optimizations.

The unsuccessful parsing case depends on where the error happens, as the
generic version may terminate sooner if the error is detected at the
beginning of the input string, while the SIMD version performs roughly
the same amount of work but faster. Here are some examples for 8-bit
character types (for larger types the numbers are more or less comparable):

Error              | Generic  | SSE4.1          | AVX2            | AVX512v1        | AVX10.1
===================+==========+=================+=================+=================+================
EOI at 35 chars    |   43.629 | 356.562 (8.17x) | 326.311 (7.48x) | 322.377 (7.39x) | 308.155 (7.06x)
EOI at 1 char      | 2645.783 | 444.769 (0.17x) | 400.275 (0.15x) | 404.826 (0.15x) | 403.730 (0.15x)
Missing dash at 23 |   73.878 | 514.303 (6.96x) | 474.694 (6.43x) | 507.949 (6.88x) | 474.077 (6.42x)
Missing dash at 8  |  223.921 | 516.641 (2.31x) | 472.737 (2.11x) | 506.242 (2.26x) | 473.718 (2.12x)
Illegal char at 35 |   47.373 | 368.002 (7.77x) | 333.233 (7.03x) | 318.242 (6.72x) | 301.659 (6.37x)
Illegal char at 0  | 1729.087 | 421.511 (0.24x) | 385.217 (0.22x) | 374.047 (0.22x) | 351.944 (0.20x)

The above table is collected with BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B
defined.

In general, only the very early errors tend to perform worse in the SIMD
version and the majority of cases are still faster.

Besides BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B, the implementation also has
BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM control macro, which, if defined, enables
usage of 512-bit registers for convertting from 32-bit character types to 8-bit
integers. This code path is also slower than the 256-bit path on Golden Cove,
and therefore is disabled. The macro is provided primarily to allow for tuning
and experimentation with newer CPU microarchitectures, where the 512-bit path
may become beneficial. All of the above performance numbers were produced
without it.
2026-01-05 14:13:10 +03:00
Andrey Semashev
d0c74979a9 Separated Skylake-X level of AVX-512 to a new config macro.
The new BOOST_UUID_USE_AVX512_V1 config macro indicates presence of
AVX-512F, VL, CD, BW and DQ extensions, which are supported by Intel
Skylake-X and similar processors. BOOST_UUID_USE_AVX10_1 is still
retained and indicates support for full AVX10.1 set. For now, it only
adds support for VBMI, but this list may grow in the future as new
extensions are being utilized.
2026-01-05 14:13:10 +03:00
Andrey Semashev
f7718fd7cc Removed load_unaligned_si128 helper function.
This helper was used to simplify support for older CPUs, to select
between _mm_loadu_si128 and _mm_lddqu_si128 intrinsics. That code
has long been removed, and we now always use _mm_loadu_si128 to load
data. Use the intrinsic directly everywhere.
2026-01-05 14:13:10 +03:00
Andrey Semashev
1f7875c97e Added a simd_vector utility to complify SIMD constants definition.
The simd_vector template is a wrapper around an array of elements that
can automatically read that arrays as a SIMD vector. This reduces the amount
of reinterpret_casts in SIMD code that uses constants.
2026-01-05 14:13:10 +03:00
Andrey Semashev
d3b72c2b71 Extracted from_chars_result to a separate header. 2026-01-05 14:13:10 +03:00
Andrey Semashev
79ecd9f563 Suppress 'conditional expression is constant' warning on MSVC. 2026-01-05 14:13:10 +03:00
Peter Dimov
9684559de7 Add Windows jobs to ci.yml using different /arch: values 2026-01-04 20:04:54 +02:00
Peter Dimov
83ab39d277 Update documentation 2026-01-04 04:10:34 +02:00
Peter Dimov
d5cf5c4656 Simplify operator>> 2026-01-04 04:06:55 +02:00
Peter Dimov
b24e19d64e Merge pull request #187 from Lastique/feature/update_iostream_ops
Update iostream operators
2026-01-04 04:02:54 +02:00
Andrey Semashev
211d84bdb1 Use stream character type when calling to_chars in operator<<.
This avoids potential character code conversion in ostream and instead
produces native character type directly in to_chars, which is likely
much faster.
2026-01-03 03:50:38 +03:00
Andrey Semashev
45910f2ace Use from_chars in operator>>.
This removes code duplication with from_chars and allows for reusing
a faster implementation of from_chars in operator>>.

Also, align the input character buffer for more efficient memory
accesses.
2026-01-03 03:49:48 +03:00
Peter Dimov
326e5db863 Merge pull request #185 from Lastique/feature/from_chars_result_op_bool
Add `from_chars_result::operator bool()`
2025-12-29 13:08:27 +02:00
Peter Dimov
6ce513ef20 Merge pull request #184 from Lastique/feature/to_chars_simd
Add SIMD implementation of `to_chars`
2025-12-28 12:09:20 +02:00
Andrey Semashev
454de03dbd Updated docs with the new SIMD macros, added a release note for SIMD to_chars.
Also clarified the meaning of BOOST_UUID_USE_AVX10_1 in the docs as the previous
wording could be taken that it indicates support for a subset of AVX-512 that
is supported in Skylake-X.
2025-12-27 23:53:21 +03:00
Andrey Semashev
ef9c903055 Reorder macos parameters in GitHub Actions CI for better readability in web UI. 2025-12-27 23:53:21 +03:00
Andrey Semashev
e6fe1c45d9 Added GitHub Actions jobs for AVX2-enabled target. 2025-12-27 23:53:21 +03:00
Andrey Semashev
2508c5434e Added running IO and to_chars tests with SIMD disabled. 2025-12-27 23:53:21 +03:00
Andrey Semashev
84afcf6372 Align buffers for to_chars for better performance of SIMD implementation. 2025-12-27 23:53:21 +03:00
Andrey Semashev
839c431152 Added x86 SIMD implementation of to_chars.
Moved the generic to_chars implementation to a separate header and made
to_chars.hpp select the implementation based on the enabled SIMD ISA
extensions. Added an x86 implementation leveraging SSSE3 and later
vector extensions. Added detection of the said extensions to config.hpp.

The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of to_chars() calls per second with a 16-byte aligned output buffer:

Char     | Generic | SSE4.1           | AVX2             | AVX-512
=========+=========+==================+==================+=================
char     | 203.190 | 1059.322 (5.21x) | 1053.352 (5.18x) | 1058.089 (5.21x)
char16_t | 184.003 |  848.356 (4.61x) | 1009.489 (5.49x) | 1011.122 (5.50x)
char32_t | 202.425 |  484.801 (2.39x) |  676.338 (3.34x) |  462.770 (2.29x)

The core of the SIMD implementation is using 128-bit vectors, larger vectors
are only used to convert to the target character types. This means that for
1-byte character types all vector implementations are basically the same
(barring the extra ISA flexibility added by AVX) and for 2-byte character
types AVX2 and AVX-512 are basically the same.

For 4-byte character types, AVX-512 showed worse performance than SSE4.1 and
AVX2 on the test system. It isn't clear why that is, but it is possible that
the CPU throttles 512-bit instructions so much that the performance drops
below a 256-bit equivalent. Perhaps, there are just not enough 512-bit
instructions for the CPU to power up the full 512-bit pipeline. Therefore,
the AVX-512 code path for 4-byte character types is currently disabled and
the AVX2 path is used instead (which makes AVX2 and AVX-512 versions basically
equivalent). The AVX-512 path can be enabled again if new CPU microarchitectures
appear that will benefit from it.

Higher alignment values of the output buffer were also tested, but they did not
meaningfully improve performance.
2025-12-27 23:52:15 +03:00
Peter Dimov
f797b2617f Improve from_chars, taking advantage of the fact that 0-9, A-F and a-f are consecutive in both ASCII and EBCDIC 2025-12-26 22:18:59 +02:00
Peter Dimov
b945f3ee23 Update documentation 2025-12-26 21:33:11 +02:00
Andrey Semashev
31bc20e30e Added from_chars_result::operator bool(). 2025-12-26 21:58:49 +03:00
Peter Dimov
c8f97785a4 Prefer SIMD/runtime performance over constexpr-ness when __builtin_is_constant_evaluated is not available 2025-12-26 19:27:20 +02:00
Peter Dimov
c963533f73 Make to_chars constexpr 2025-12-25 19:20:58 +02:00
Peter Dimov
e78661b1e2 Suppress msvc-14.1 warnings in test_hash_value_cx 2025-12-25 13:48:11 +02:00
Peter Dimov
963bb373d0 Attempt to fix GCC 5 2025-12-25 13:34:49 +02:00
Peter Dimov
09dd0dd608 Make hash_value constexpr 2025-12-25 12:41:38 +02:00
Peter Dimov
bd765b558c Make uuid::is_nil, swap, and relational operators constexpr 2025-12-25 06:44:07 +02:00
Peter Dimov
5a5797d465 Add more constexpr to class uuid 2025-12-24 17:45:11 +02:00
Peter Dimov
bfee791f47 Update documentation 2025-12-23 19:48:35 +02:00
Peter Dimov
dd1145756e Add VS2026 to .drone.jsonnet 2025-12-23 12:39:04 +02:00
Peter Dimov
da0ce8406a Disable test_string_generator_cx2 under GCC 5 2025-12-23 11:37:33 +02:00
Peter Dimov
7b063dde86 Fix return type of from_chars_is_* functions 2025-12-23 11:29:40 +02:00
Peter Dimov
83d3da399c Add test_string_generator_cx2.cpp 2025-12-23 04:21:49 +02:00
Peter Dimov
a851cf33ca Add test_string_generator_2.cpp 2025-12-23 04:15:37 +02:00