For details, see issue #695.
If filename contains special character (\n, 0x0a, LF),
- Put '\' (0x5c) at the beginning of the line.
- Escape special character by '\'.
- XXH_SIZE_OPT is a value from 0-2 which indicates how much xxHash
should care about code size, default 1 for -Os/-Oz, default 0
otherwise
- XXH_NO_STREAM disables the streaming API.
- These two interact: if XXH_SIZE_OPT == 2 and XXH_NO_STREAM is not
defined, XXH32 and XXH64 use the streaming API for single shot
- TODO: apply this to XXH3 as well
Add dedicated install targets to allow the user to install only xxhsum,
the static library, the shared library, etc.
This is especially useful on embedded systems where dynamic library is
not always supported by toolchains
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
ensures there is no usage / invocation of <stdlib.h> within libxxhash
at the expense of `XXH*_createState()` functions, which always return `NULL`.
This is not a concern when state is allocated statically,
or when one-short functions (like `XXH64()`) are used.
- Use memcpy on ARMv6 and lower when unaligned access is supported
- GCC has an internal conflict on whether unaligned access is available
on ARMv6 so some parts do byteshift, some parts do not
- aligned(1) is better on everything else
- All this seems to be safe on even GCC 4.9.
- Leave out the alignment check if unaligned access is supported on ARM.
It improves gcc code generation on armv6 and armv7:
https://gcc.godbolt.org/z/nfnEKsjzc. It has some minor effect on
aarch64: +4 instructions with -O2, -156 instructions with -O3. Overall
it is clear that while gcc emits proper instructions in isolation with
packed structures, it often gets confused after inlining, and using
explicit alignment value with integer mode fixes this issue.
32-bit ARM changes:
- Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on
ARMv6
- Use `XXH_FORCE_MEMORY_ACCESS=1` when in ARM strict alignment mode, avoids
calls to memcpy(?!???!)
XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1:
- Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access)
0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s.
- ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access)
0.3 GB/s -> 0.85 GB/s (no longer calls memcpy())
AArch64 changes
- Moved the scalar loop above the NEON loop which allows GCC to interleave
- AArch64 GCC now uses raw casting instead of `vld1q` which was treated as an
intrinsic instead of a load.
- Also hides the vreinterprets
- Clang and v7a still use the safer vld1q_u8
- Slight reordering of the NEON instructions
Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s
Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s
*GCC is now faster than Clang for aarch64.*
for variant redirectors (`xxh32sum`, `xxh64sum` and `xxh128sum`).
fix#647, reported by @jpalus.
Also : slightly updated man page text, for clarity and accuracy.