fiber/doc/performance.qbk

[/
          Copyright Oliver Kowalke 2016.
 Distributed under the Boost Software License, Version 1.0.
    (See accompanying file LICENSE_1_0.txt or copy at
          http://www.boost.org/LICENSE_1_0.txt
]

[section:performance Performance]

Performance measurements were taken using `std::chrono::highresolution_clock`,
with overhead corrections.
The code was compiled with gcc-6.20, using build options:
variant = release, optimization = speed.
Tests were executed on Intel Core i7-4770S 3.10GHz, 4 cores with 8
hyperthreads (4C/8T), running Linux (4.7.0/x86_64).

Measurements headed 1C/1T were run in a single-threaded process.

The [@https://github.com/atemerev/skynet microbenchmark ['syknet]] from
Alexander Temerev was ported and used for performance measurements.
At the root the test spawns 10 threads-of-execution (ToE), e.g.
actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ...
until 100000 ToEs are created. ToEs return back ther ordinal numbers
(0 ... 99999), which are summed on the previous level and sent back upstream,
until reaching the root. The test was run 10-20 times, producing a range of
values for each measurement.

[table time per actor/erlang process/goroutine (other languages) (average over 100,000)
    [
         [Haskell | stack-1.0.4]
         [Erlang | erts-7.0]
         [Go | go1.6.1 (GOMAXPROCS == default)]
         [Go | go1.6.1 (GOMAXPROCS == 8)]
    ]
    [
        [0.32 \u00b5s]
        [0.64 \u00b5s - 1.21 \u00b5s]
        [1.52 \u00b5s - 1.64 \u00b5s]
        [0.70 \u00b5s - 0.98 \u00b5s]
    ]
]

`std::thread` can not be tested at this time (C++14) because the API does not
allow to set thread stack size (idefault on Linux 2MB, on Windows 1MB). An
out-of-memory error is likely. With pthreads the stack size is set 8kBC.

[table time per thread (average over *10,000* - unable to spawn 100,000 threads)
    [
         [pthread]
         [`std::thread`]
    ]
    [
        [14.4 \u00b5s - 20.8 \u00b5s]
        [18.8 \u00b5s - 28.1 \u00b5s]
    ]
]

The test utilizes 4 cores with Symmetric MultiThreading enabled (8 logical
CPUs). The fiber stacks are allocated by __fixedsize_stack__.

As the benchmark shows, the memory allocation algorithm is significant for
performance in a multithreaded environment. The tests use glibc[s] memory
allocation algorithm (based on ptmalloc2) as well as Google[s]
[@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] (via
linkflags="-ltcmalloc").[footnote
Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo
["An Experimental Study on Memory Allocators in Multicore and
Multithreaded Applications], PDCAT [,]11 Proceedings of the 2011 12th
International Conference on Parallel and Distributed Computing, Applications
and Technologies, pages 92-98]

The __shared_work__ scheduling algorithm uses one global queue, containing
fibers ready to run, shared between all threads. The work is distributed equally
over all threads.
In the __work_stealing__ scheduling algorithm, each thread has its own local
queue. Fibers that are ready to run are pushed to and popped from the local
queue. If the queue runs out of ready fibers, fibers are stolen from the local
queues of other participating threads.

[table time per fiber (average over 100,000)
    [
         [fiber (4C/8T, work stealing, tcmalloc)]
         [fiber (4C/8T, work stealing)]
         [fiber (4C/8T, work sharing, tcmalloc)]
         [fiber (4C/8T, work sharing)]
         [fiber (1C/1T, round robin, tcmalloc)]
         [fiber (1C/1T, round robin)]
    ]
    [
        [0.13 \u00b5s - 0.26 \u00b5s]
        [0.35 \u00b5s - 0.66 \u00b5s]
        [0.62 \u00b5s - 0.80 \u00b5s]
        [0.90 \u00b5s - 1.11 \u00b5s]
        [0.90 \u00b5s - 1.03 \u00b5s]
        [0.91 \u00b5s - 1.28 \u00b5s]
    ]
]


[section:tweaking Tweaking]

__boost_fiber__ enables synchronization of fibers running on different threads
per default. This is accomplished by spinlocks (using atomics).
(BOOST_FIBERS_SPINLOCK_STD_MUTEX defined at the compiler[s] command line enables
`std::mutex` instead of spinlocks).


[heading disable synchronization]

With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
compiler[s] command line, synchronization between fibers (in different
threads) is disabled. This is acceptable if the application is single threaded
and/or fibers are not synchronized between threads.


[heading TTAS locks]

Spinlocks are implemented as TTAS locks, i.e. the spinlock tests the lock
before calling an atomic exchange. This strategy helps to reduce the cache
line invalidations triggered by acquiring/releasing the lock.


[heading spin-wait loop]

A lock is considered under high contention, if a thread repeatedly fails to
acquire the lock because some other thread was faster.
Waiting for a short time lets other threads finish before trying to enter the
critical section again. While busy waiting on the lock, relaxing the CPU (via
pause/yield memnonic) gives the CPU a hint that the code is in a spin-wait loop.

* prevents expensive pipeline flushes (speculatively executed load and compare
  instructions are not pushed to pipeline)
* another hardware thread (simultaneous multithreading) can get time slice
* it does delay a few CPU cycles, but this is necessary to prevent starvation

It is obvious that this strategy is useless on single core systems because the
lock can only released if the thread gives up its time slice in order to let
other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE disables active spinning,
in other words, the operating system is notified (via `std::this_thread_yield()`) that the
thread gives up its time slice and the operating system switches to another
thread.


[heading exponential back-off]

The macro BOOST_FIBERS_SPIN_MAX_TESTS determines how many times the CPU
iterates in the spin-wait loop before yielding the thread or blocking in
futex-wait.
The spinlock tracks how many times the thread failed to acquire the lock.
The higher the contention, the longer the thread should back-off.
A ["Binary Exponential Backoff] algorithm together with a randomized contention
window is utilized for this purpose.
BOOST_FIBERS_SPIN_MAX_COLLISIONS determines the upper limit of collisions
between threads after the thread waits on a futex.


[table macros for tweaking
    [
        [Macro]
        [Effect on Boost.Fiber]
    ]
    [
        [BOOST_FIBERS_NO_ATOMICS]
        [no multithreading support, all atomics removed, no synchronization
        between fibers running in different threads]
    ]
    [
        [BOOST_FIBERS_SPINLOCK_STD_MUTEX]
        [`std::mutex` used inside spinlock]
    ]
    [
        [BOOST_FIBERS_SPINLOCK_TTAS]
        [spinlock with test-test-and-swap on shared variable]
    ]
    [
        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
        [spinlock with test-test-and-swap on shared variable, adaptive retries
        while busy waiting]
    ]
    [
        [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
        [spinlock with test-test-and-swap on shared variable, suspend on
        futex after certain number of retries]
    ]
    [
        [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
        [spinlock with test-test-and-swap on shared variable, while busy
        waiting adaptive retries, suspend on futex certain amount of retries]
    ]
    [
        [BOOST_FIBERS_SPIN_SINGLE_CORE]
        [on single core machines with multiple threads, yield thread
        (`std::this_thread::yield()`) after collisions]
    ]
    [
        [BOOST_FIBERS_SPIN_MAX_TESTS]
        [max number of retries while busy spinning]
    ]
    [
        [BOOST_FIBERS_SPIN_MAX_COLLISIONS]
        [max number of collisions between contending threads]
    ]
]

[endsect]

[endsect]