Bitwise (logical) instructions in aarch64 have different encoding of
the immediate constant arguments, which require a different asm
constraint. Otherwise, an invalid instruction could be generated,
resulting in a compilation error.
Fixes https://github.com/boostorg/atomic/issues/41.
By default, the destination register is the same as the first input register
in most instructions. Removing it has no practical effect other than, perhaps,
inclining the assembler to generate a shorter encoding for the instruction
in Thumb mode.
The check is already made in the size-specific specializations of the extra
operations, where needed. For 128-bit operations there are no LSE instructions,
which means the common template should still use the more efficient op
operations instead of fetch_op.
The operations returning the result of the operation instead of the original
value require less registers, so are more efficient. The exception is AArch64
LSE extension, which adds dedicated atomic instructions which are presumably
more efficient than ll/sc loops.
This is the second iteration of the backends, which were both tested
on a QEMU VM and did not show any test failures. The essential difference
from the first version is that in AArch64 we now initialize the success flag
in the asm blocks in compare_exchange operations rather than relying on
compiler initializing it before passing into the asm block as an in-out
parameter. Apparently, this sometimes didn't work for some reason, which
made compare_exchange_strong return incorrect value, which broke futex-based
mutexes in the lock pool.
The above change was also applied to AArch32, along with minor corrections
in the asm blocks constraints.
During testing in a VM occasional fallback_wait_fuzz test failures were
observed when AArch64 asm-based backend was used to implement futexes in
the lock pool. It is not clear yet what causes the failures, but they
don't appear when __atomic* intrinsics are used. Further investigation needed.
AArch32 asm-based backend is practically untested and has very much in
common with AArch64, so I'm disabling it as well until at least the problem
with AArch64 is resolved.
- Use dedicated registers to return success from compare_exchange methods.
- Pre-initialize success flag outside asm blocks.
- Use "+Q" constraints for memory operands in 64-bit operations. This allows
to remove "memory" clobber.
- Avoid using lots of conditional instructions in 64-bit compare_exchange
operations. Simplify success flag derivation.
ARMv8 (AArch32) is significantly different from ARMv7, which warrants
addition of a separate asm-based backend:
- It adds exclusive load/store instructions with acquire/release semantics,
which obsoletes use of explicit dmb instructions in most atomic operations.
- It deprecates "it" hints for some instructions and hints for more than one
following instruction.
- It does not require elaborate code for switching between Thumb and A32
modes as it supports Thumb 2 extension.
- It always supports instructions for bytes and halfwords.
The old ARM backend is now restricted to ARMv6 and ARMv7.
We no longer use the alignment attributes (except for alignas, when
available) in order to align the 128-bit storage for atomics. Instead we
rely on type_with_alignment for that. Although it may still use the same
attributes to acheve the required alignment, this is its implementation
detail and may not correspond to BOOST_NO_ALIGNMENT exactly.
The check for BOOST_HAS_INT128 was not really relevant to begin with
because __int128 is not guaranteed to have alignment of 16.
In any case, all current compilers targeting x86 do support alignment of
16, so the checks weren't doing anything.
We need to explicitly link with synchronization.lib when the
WaitOnAddress API is enabled at compile time for ARM targets. Since
this library is only available on newer Windows SDKs, we have to perform
a configure check for whether it is available.
There is no guarantee of atomicity of plain loads and stores of anything
larger than a byte on an arbitrary hardware architecture. However, all
modern architectures seem to guarantee atomicity of loads and stores of
suitably aligned objects ate least up to a pointer size, so we use that
as the threshold. For larger objects we have to use intrinsics to
guarantee atomicity.
The old operations template is replaced with core_operations, which falls
back to core_arch_operations, which falls back to core_operations_emulated.
The core_operations layer is intended for more or less architecture-neutral
backends, like the one based on gcc __atomic* intrinsics. It may fall back
to core_arch_operations where it is not supported by the compiler or where
the latter is more optimal. For example, where gcc does not implement 128-bit
atomic operations via __atomic* intrinsics, we support them in the
core_arch_operations backend, which uses inline assembler blocks.
The old emulated_operations template is largely unchanged and was renamed to
core_operations_emulated for naming consistency. All other operation templates
were also renamed for consistency (e.g. generic_wait_operations ->
wait_operations_generic).
Fence operations have been extracted to a separate set of structures:
fence_operations, fence_arch_opereations and fence_operations_emulated. These
are similar to the core operations described above. This structuring also
allows to fall back from fence_operations to fence_arch_opereations when
the latter is more optimal.
The net result of these changes is that 128-bit and 64-bit atomic operations
should now be consistently supported on all architectures that support them.
Previously, only x86 was supported via local hacks for gcc and clang.
The initialization is not needed for the code, but it is needed to make
tools like valgrind happy. Otherwise, the tools would mark the instructions
as accessing uninitialized data.
Also, changed the dummy variable to a byte. This may allow for a more lax
alignment.
The backend implements core and extra atomic operations using gcc asm blocks.
The implementation supports extensions added in ARMv8.1 and ARMv8.3. It supports
both little and big endian targets.
Currently, the code has not been tested on real hardware. It has been tested
on a QEMU VM.
The previous change to increase the delay didn't help, so we're instead
changing the expectation - the first woken thread is allowed to receive
value3 on wake up.
Occasionally, IPC notify_one test fails on Windows because the first
of the woken threads receives value3 from wait(). This is possible if
the thread lingers in wait() for some reason. Increase the delay
before the second notification slightly to reduce the likelihood
of this happening.
ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition allows to use
ldrex and other load-exclusive instructions without a matching strex in
Section A3.4.5 "Load-Exclusive and Store-Exclusive usage restrictions". And in
Section A3.5.3 "Atomicity in the ARM architecture" it states that ldrexd
atomically loads the 64-bit value from suitably aligned memory. This makes
the strexd added in eb50aea437 unnecessary.
ARM Architecture Reference Manual Armv8, for Armv8-A architecture profile
does not state explicitly that ldrexd can be used without a matching strexd,
but does not prohibit this either in Section E2.10.5 "Load-Exclusive and
Store-Exclusive instruction usage restrictions".
Although we don't need to store anything after the load, we need to issue
strexd to reset the exclusive access mark on the storage address. So we
immediately store the loaded value back.
The technique to use ldrexd+strexd is described in ARM Architecture
Reference Manual ARMv8, Section B2.2.1. Although it is described for ARMv8,
the technique should be valid for previous versions as well.
The implementation uses GetTickCount/GetTickCount64 internally,
which is a steady and sufficiently low precision time source.
We need the clock to have relatively low precision so that wait
tests don't fail spuriously because the blocked threads wake up
too soon, according to more precise clocks.
boost::chrono::system_clock currently has an acceptably low precision,
but it is not a steady clock.
Forced inline is mostly used to ensure the compiler is able to treat memory
order arguments as constants. It is also useful for constant propagation
on other arguments. This is not very useful for the emulated backend, so
we might as well allow the compiler to not inline the functions.
When the emulated wait function is inlined, the compiler sometimes generates
code that acts as if a wrong value is returned from the wait function. The
compiler simply "forgets" to save the atomic value into an object on the
stack, which makes it later use a bogus value as the "returned" value.
Preventing inlining seems to work around the problem.
Discovered by wait_api notify_one/notify_all test failures for struct_3_bytes.
Oddly enough, the same test for uint32_t did not fail.
Checking for the capability macros is not good enough because ipc_atomic_ref
can be not lock-free even when the macro (and ipc_atomic) indicates lock-free.
We now check the is_always_lockfree property to decide whether to run or skip
tests for a given IPC atomic type.
Also, made struct_3_bytes output more informative.