Moved the generic to_chars implementation to a separate header and made
to_chars.hpp select the implementation based on the enabled SIMD ISA
extensions. Added an x86 implementation leveraging SSSE3 and later
vector extensions. Added detection of the said extensions to config.hpp.
The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3,
in millions of to_chars() calls per second with a 16-byte aligned output buffer:
Char | Generic | SSE4.1 | AVX2 | AVX-512
=========+=========+==================+==================+=================
char | 203.190 | 1059.322 (5.21x) | 1053.352 (5.18x) | 1058.089 (5.21x)
char16_t | 184.003 | 848.356 (4.61x) | 1009.489 (5.49x) | 1011.122 (5.50x)
char32_t | 202.425 | 484.801 (2.39x) | 676.338 (3.34x) | 462.770 (2.29x)
The core of the SIMD implementation is using 128-bit vectors, larger vectors
are only used to convert to the target character types. This means that for
1-byte character types all vector implementations are basically the same
(barring the extra ISA flexibility added by AVX) and for 2-byte character
types AVX2 and AVX-512 are basically the same.
For 4-byte character types, AVX-512 showed worse performance than SSE4.1 and
AVX2 on the test system. It isn't clear why that is, but it is possible that
the CPU throttles 512-bit instructions so much that the performance drops
below a 256-bit equivalent. Perhaps, there are just not enough 512-bit
instructions for the CPU to power up the full 512-bit pipeline. Therefore,
the AVX-512 code path for 4-byte character types is currently disabled and
the AVX2 path is used instead (which makes AVX2 and AVX-512 versions basically
equivalent). The AVX-512 path can be enabled again if new CPU microarchitectures
appear that will benefit from it.
Higher alignment values of the output buffer were also tested, but they did not
meaningfully improve performance.