As the names suggest, the methods perform the corresponding operation and test
if the result is not zero.
Also, for the emulated fetch_complement, take care of integral promotion, which
could mess up the storage bits that were not part of the value on backends
where the storage is larger than the value. This could in turn break CAS on
the atomic value as it compares the whole storage.
The compiler, surprisingly, uses ebx for memory operands, which messed up the
save/restore logic in asm blocks, resulting the memory operand (which was
supposed to be the pointer to the atomic storage) being incorrect.
First, clang (and, apparently, recent gcc as well) are able to deal with ebx
around the asm blocks by themselves, which makes it unnecessary to save/restore
the register in the asm blocks. Therefore, for those compilers we now use the
non-PIC branch in PIC mode as well. This sidesteps the original problem with
clang.
Second, since we can't be sure if other compilers are able to pull the same
trick, the PIC branches of code have been updated to avoid any memory operand
constraints and use the explicitly calculated pointer in a register instead. We
also no longer use a scratch slot on the stack to save ebx but instead use esi
for that, which is also conveniently used for one of the inputs. This should
be slightly faster as well. The downside is that we're possibly wasting one
register for storing the pointer to the storage, but there seem to be no way
around it.
Apparently, gcc versions up to 4.6, inclusively, have problems allocating
eax:edx register pairs in asm statements for 32-bit x86 targets. Included those
compilers in the existing workaround.
Also, for clang removed the use of __sync-based workarounds for exchange()
implementation and use the asm branch with the workaround. It should produce
a more efficient code.
Clang failed to compile such code. Given that gcc 7 also complained about
missing displacements in memory operands, this trick is no longer effective
with newer compilers.
Instead, the assembler code have been refactored to avoid having to specify
any displacements at all, offloading this work to the compiler. We hope that
the compiler will be smart enough to not overallocate registers for every
memory operand used in the inline assembler. At least, recent gcc and clang are
able to do this and generate code comparable to what was achieved previously.
Additionally, it was possible to reduce assembler code in several places by
removing mov instructions setting up input registers or handling the results.
Instead, we now rely on the compiler doing this work to satisfy assembler block
constraints.
In 32-bit load and store, improved support for targets with SSE but not SSE2.
It is possible to use SSE to do the loads/stores. Also, the scratch xmm register
is now picked by the compiler.
Clang has a bug of not advertising support for atomics implemented via
cmpxchg8b, even if the instruction is enabled in the command line. We have
to workaround the same problem with cmpxchg16b on 64-bit x86 as well, so
we apply the same approach here - we implement all atomic ops through DCAS
ourselves.
This fix adds a check for cmpxchg8b to capabilities definition.
This makes the result of (op)_and_test more consistent with other
methods such as test_and_set and bit_test_and_set, as well as the
methods used in the C++ standard library.
This is a breaking change. The users are able to define
BOOST_ATOMIC_HIGHLIGHT_OP_AND_TEST macro to generate warnings on each
use of the changed functions. This will help users to port from Boost
1.66 to newer Boost releases.
More info at:
https://github.com/boostorg/atomic/issues/11http://boost.2283326.n4.nabble.com/atomic-op-and-test-naming-
tc4701445.html
Gcc 7 removed support for __atomic intrinsics on 16-byte operands on x86-64 and
instead always generates library calls to libatomic, thus breaking user's code
compilation due to having to link with the library. Also, the assembler tends
to generate warnings when implicit zero displacement is used in memory operands.
The new implementation of 8 and 16-bit ops uses the lbarx/stbcx and
lharx/sthcx instructions available in Power8 and later architectures.
This allows to use smaller storage types, similar to those used by
compiler intrinsics.
Also added detection of 128-bit instructions lqarx/stqcx, which can
later be used to implement 128-bit ops.
Use ldrexb/w and strexb/w on ARMv7 and later to implement byte/word-wide
atomic ops. On the older ARM versions we still have to use 32-bit
widening implementation.
Also allowed immediate constants in some of the operations to improve
generated code.
Common ARM code extracted to a separate header to reuse with extra ops.
This allows for more flexibility in register allocation and potentially
more efficient code. Also, the temporary register was not exactly
customizable in the previous code, so it should have been cleaned up
anyway.
In order to support more flexible definition of the extra operations for
different platforms, define extra_operations as an addon to the existing
operations template. The extra_operations template will be used only by
the non-standard operations added by Boost.Atomic.
This is an attempt to improve generated code in the calling application that
involves CAS in a tight loop. The neccessity to cast between the value type and
the storage type for the `expected` argument results in inefficient code
that involves copying of the expected value and also saving the CAS result on
the stack. This has been observed at least with gcc 6.3 with a tight loop
on the user's side.
When we can ensure that the storage type can safely alias other types, and the
value type has the same size as the storage type, we can simplify CAS by
performing type punning on the `expected` reference instead of copying it back
and forth.