mirror of
https://github.com/boostorg/fiber.git
synced 2026-02-20 14:42:21 +00:00
208 lines
7.6 KiB
Plaintext
208 lines
7.6 KiB
Plaintext
[/
|
|
Copyright Oliver Kowalke 2016.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt
|
|
]
|
|
|
|
[section:performance Performance]
|
|
|
|
Performance measurements were taken using `std::chrono::highresolution_clock`,
|
|
with overhead corrections.
|
|
The code was compiled with gcc-6.20, using build options:
|
|
variant = release, optimization = speed.
|
|
Tests were executed on Intel Core i7-4770S 3.10GHz, 4 cores with 8
|
|
hyperthreads (4C/8T), running Linux (4.7.0/x86_64).
|
|
|
|
Measurements headed 1C/1T were run in a single-threaded process.
|
|
|
|
The [@https://github.com/atemerev/skynet microbenchmark ['syknet]] from
|
|
Alexander Temerev was ported and used for performance measurements.
|
|
At the root the test spawns 10 threads-of-execution (ToE), e.g.
|
|
actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ...
|
|
until 100000 ToEs are created. ToEs return back ther ordinal numbers
|
|
(0 ... 99999), which are summed on the previous level and sent back upstream,
|
|
until reaching the root. The test was run 10-20 times, producing a range of
|
|
values for each measurement.
|
|
|
|
[table time per actor/erlang process/goroutine (other languages) (average over 100,000)
|
|
[
|
|
[Haskell | stack-1.0.4]
|
|
[Erlang | erts-7.0]
|
|
[Go | go1.6.1 (GOMAXPROCS == default)]
|
|
[Go | go1.6.1 (GOMAXPROCS == 8)]
|
|
]
|
|
[
|
|
[0.32 \u00b5s]
|
|
[0.64 \u00b5s - 1.21 \u00b5s]
|
|
[1.52 \u00b5s - 1.64 \u00b5s]
|
|
[0.70 \u00b5s - 0.98 \u00b5s]
|
|
]
|
|
]
|
|
|
|
`std::thread` can not be tested at this time (C++14) because the API does not
|
|
allow to set thread stack size (idefault on Linux 2MB, on Windows 1MB). An
|
|
out-of-memory error is likely. With pthreads the stack size is set 8kBC.
|
|
|
|
[table time per thread (average over *10,000* - unable to spawn 100,000 threads)
|
|
[
|
|
[pthread]
|
|
[`std::thread`]
|
|
]
|
|
[
|
|
[14.4 \u00b5s - 20.8 \u00b5s]
|
|
[18.8 \u00b5s - 28.1 \u00b5s]
|
|
]
|
|
]
|
|
|
|
The test utilizes 4 cores with Symmetric MultiThreading enabled (8 logical
|
|
CPUs). The fiber stacks are allocated by __fixedsize_stack__.
|
|
|
|
As the benchmark shows, the memory allocation algorithm is significant for
|
|
performance in a multithreaded environment. The tests use glibc[s] memory
|
|
allocation algorithm (based on ptmalloc2) as well as Google[s]
|
|
[@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] (via
|
|
linkflags="-ltcmalloc").[footnote
|
|
Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo
|
|
["An Experimental Study on Memory Allocators in Multicore and
|
|
Multithreaded Applications], PDCAT [,]11 Proceedings of the 2011 12th
|
|
International Conference on Parallel and Distributed Computing, Applications
|
|
and Technologies, pages 92-98]
|
|
|
|
The __shared_work__ scheduling algorithm uses one global queue, containing
|
|
fibers ready to run, shared between all threads. The work is distributed equally
|
|
over all threads.
|
|
In the __work_stealing__ scheduling algorithm, each thread has its own local
|
|
queue. Fibers that are ready to run are pushed to and popped from the local
|
|
queue. If the queue runs out of ready fibers, fibers are stolen from the local
|
|
queues of other participating threads.
|
|
|
|
[table time per fiber (average over 100,000)
|
|
[
|
|
[fiber (4C/8T, work stealing, tcmalloc)]
|
|
[fiber (4C/8T, work stealing)]
|
|
[fiber (4C/8T, work sharing, tcmalloc)]
|
|
[fiber (4C/8T, work sharing)]
|
|
[fiber (1C/1T, round robin, tcmalloc)]
|
|
[fiber (1C/1T, round robin)]
|
|
]
|
|
[
|
|
[0.13 \u00b5s - 0.26 \u00b5s]
|
|
[0.35 \u00b5s - 0.66 \u00b5s]
|
|
[0.62 \u00b5s - 0.80 \u00b5s]
|
|
[0.90 \u00b5s - 1.11 \u00b5s]
|
|
[0.90 \u00b5s - 1.03 \u00b5s]
|
|
[0.91 \u00b5s - 1.28 \u00b5s]
|
|
]
|
|
]
|
|
|
|
|
|
[section:tweaking Tweaking]
|
|
|
|
__boost_fiber__ enables synchronization of fibers running on different threads
|
|
per default. This is accomplished by spinlocks (using atomics).
|
|
(BOOST_FIBERS_SPINLOCK_STD_MUTEX defined at the compiler[s] command line enables
|
|
`std::mutex` instead of spinlocks).
|
|
|
|
|
|
[heading disable synchronization]
|
|
|
|
With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
|
|
compiler[s] command line, synchronization between fibers (in different
|
|
threads) is disabled. This is acceptable if the application is single threaded
|
|
and/or fibers are not synchronized between threads.
|
|
|
|
|
|
[heading TTAS locks]
|
|
|
|
Spinlocks are implemented as TTAS locks, i.e. the spinlock tests the lock
|
|
before calling an atomic exchange. This strategy helps to reduce the cache
|
|
line invalidations triggered by acquiring/releasing the lock.
|
|
|
|
|
|
[heading spin-wait loop]
|
|
|
|
A lock is considered under high contention, if a thread repeatedly fails to
|
|
acquire the lock because some other thread was faster.
|
|
Waiting for a short time lets other threads finish before trying to enter the
|
|
critical section again. While busy waiting on the lock, relaxing the CPU (via
|
|
pause/yield memnonic) gives the CPU a hint that the code is in a spin-wait loop.
|
|
|
|
* prevents expensive pipeline flushes (speculatively executed load and compare
|
|
instructions are not pushed to pipeline)
|
|
* another hardware thread (simultaneous multithreading) can get time slice
|
|
* it does delay a few CPU cycles, but this is necessary to prevent starvation
|
|
|
|
It is obvious that this strategy is useless on single core systems because the
|
|
lock can only released if the thread gives up its time slice in order to let
|
|
other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE disables active spinning,
|
|
in other words, the operating system is notified (via `std::this_thread_yield()`) that the
|
|
thread gives up its time slice and the operating system switches to another
|
|
thread.
|
|
|
|
|
|
[heading exponential back-off]
|
|
|
|
The macro BOOST_FIBERS_SPIN_MAX_TESTS determines how many times the CPU
|
|
iterates in the spin-wait loop before yielding the thread or blocking in
|
|
futex-wait.
|
|
The spinlock tracks how many times the thread failed to acquire the lock.
|
|
The higher the contention, the longer the thread should back-off.
|
|
A ["Binary Exponential Backoff] algorithm together with a randomized contention
|
|
window is utilized for this purpose.
|
|
BOOST_FIBERS_SPIN_MAX_COLLISIONS determines the upper limit of collisions
|
|
between threads after the thread waits on a futex.
|
|
|
|
|
|
[table macros for tweaking
|
|
[
|
|
[Macro]
|
|
[Effect on Boost.Fiber]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_NO_ATOMICS]
|
|
[no multithreading support, all atomics removed, no synchronization
|
|
between fibers running in different threads]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_STD_MUTEX]
|
|
[`std::mutex` used inside spinlock]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS]
|
|
[spinlock with test-test-and-swap on shared variable]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
|
|
[spinlock with test-test-and-swap on shared variable, adaptive retries
|
|
while busy waiting]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
|
|
[spinlock with test-test-and-swap on shared variable, suspend on
|
|
futex after certain number of retries]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
|
|
[spinlock with test-test-and-swap on shared variable, while busy
|
|
waiting adaptive retries, suspend on futex certain amount of retries]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_SINGLE_CORE]
|
|
[on single core machines with multiple threads, yield thread
|
|
(`std::this_thread::yield()`) after collisions]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_MAX_TESTS]
|
|
[max number of retries while busy spinning]
|
|
]
|
|
[
|
|
[BOOST_FIBERS_SPIN_MAX_COLLISIONS]
|
|
[max number of collisions between contending threads]
|
|
]
|
|
]
|
|
|
|
[endsect]
|
|
|
|
[endsect]
|