28 KiB
On the costs of asynchronous abstractions
The biggest force behind the evolution of Boost.Redis was my struggling in coming up with a high-level connection abstraction that was capable of multiplexing Redis commands from independent sources while concurrently handling server pushes. This journey taught me many important lessons, many of which are related to the design and performance of asynchronous programs based on Boost.Asio.
In this article I will share some of the lessons learned, specially
those related to the performance costs of abstractions such as
async_read_until that tend to overschedule into the event-loop. In
this context I will also briefly comment on how the topics discussed
here influenced my views on the proposed
P2300
(a.k.a. Senders and Receivers), which is likely to become the basis of
networking in upcoming C++ standards.
Although the analysis presented in this article uses the Redis communication protocol for illustration I expect it to be useful in general since RESP3 shares many similarities with other widely used protocols such as HTTP.
Parsing \r\n-delimited messages
The Redis server communicates with its clients by exchanging data
serialized in
RESP3 format.
Among the data types supported by this specification, the
\r\n-delimited messages are some of the most frequent in a typical
session. The table below shows some examples
| Command | Response | Wire format | RESP3 name |
|---|---|---|---|
| PING | PONG | +PONG\r\n |
simple-string |
| INCR | 42 | :42\r\n |
number |
| GET | null | _\r\n |
null |
Redis also supports command pipelines, which provide a way of optimizing round-trip times by batching commands. A pipeline composed by the commands shown in the previous table look like this
| Sent in a |
| single write |
+--------+ | | +-------+
| | --------> PING + INCR + GET --------> | |
| | | |
| Client | | Redis |
| | | |
| | <-------- "+PONG\r\n:42\r\n_\r\n" <-------- | |
+--------+ |<------>|<---->|<-->| +-------+
| |
| Responses |
Messages that use delimiters are so common in networking that a
facility called async_read_until for reading them incrementally from
a socket is already part of Boost.Asio. The coroutine below uses it to
print message contents to the screen
awaitable<void> parse_resp3_simple_msgs(tcp::socket socket)
{
for (std::string buffer;;) {
auto n = co_await async_read_until(socket, dynamic_buffer(buffer), "\r\n");
std::cout << buffer.substr(1, n - 3) << std::endl;
// Consume the buffer.
buffer.erase(0, n);
}
}
If we pay attention to the buffer content as it is parsed by the code above we can see it is rotated fairly often, for example
"+PONG\r\n:100\r\n+OK\r\n_\r\n"
":100\r\n+OK\r\n_\r\n"
"+OK\r\n_\r\n"
"_\r\n"
""
When I first realized these, apparently excessive, buffer rotations I
was concerned they would impact the performance of Boost.Redis in a
severe way. To measure the magnitude of this impact I came up with an
experimental implementation of Asio's dynamic_buffer that consumed
the buffer less eagerly than the std::string::erase function used
above. For that, the implementation increased a buffer offset up
to a certain threshold and only then triggered a (larger) rotation.
This is illustrated in the diagram below
|<---- offset threshold ---->|
| |
"+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
| # Initial offset
"+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
|<------>| # After 1st message
"+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
|<-------------->| # After 2nd message
"+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
|<--------------------->| # After 3rd message
"+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
|<-------------------------->| # Threshold crossed after the 4th message
"+PONG\r\n"
| # After rotation
After comparing the performance differences between the two versions I
was surprised there wasn't any! But that was also very suspicious
since some RESP3 aggregate types contain a considerable number of
separators. For example, a map with two pairs [(key1, value1), (key2, value2)] encoded in RESP3 requires ten rotations in total
"%2\r\n$4\r\nkey1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
"$4\r\nkey1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
"key1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
"$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
...
It was evident something more costly was shadowing the buffer
rotations. But it couldn't be the search for the separator since it
performs equivalently to rotations. It is also easy to show that the
overhead is not related to any IO operation since the problem persists
if the buffer is never consumed (which causes the function to be
called with the same string repeatedly). Once these two factors
are removed from the table, we are driven into the conclusion that
calling async_read_until has an intrinsic cost, let us see what
that is.
Async operations that complete synchronously considered harmful
Assume the scenario described earlier where async_read_until is used
to parse multiple \r\n-delimited messages. The following is a
detailed description of what happens behind the scenes
async_read_untilcallssocket.async_read_somerepeatedly until the separator\r\nshows up in the buffer
"<read1>" # Read 1: needs more data.
"<read1><read2>" # Read 2: needs more data.
"<read1><read2>" # Read 3: needs more data.
"<read1><read2><read3>" # Read 4: needs more data.
"<read1><read2><read3>\r\n<bonus bytes>" # separator found, done.
- The last call to
socket.async_read_somehappens to read past the separator\r\n(depicted as<bonus bytes>above), resulting in bonus (maybe incomplete) messages in the buffer
| 1st async_read_some | 2nd async_read_some |
| | |
"+message content here \r\n:100\r\n+OK\r\n_\r\n+incomplete respo"
| | | |
| Message wanted |<-- bonus msgs --->|<--incomplete-->|
| | msg |
| | |
| |<---------- bonus bytes ----------->|
- The buffer is consumed and
async_read_untilis called again. However, since the buffer already contains the next message this is an IO-less call
":100\r\n+OK\r\n_\r\n+not enough byt"
| | |
| No IO required | Need more |
| to parse these | data |
| messages. | |
The fact that step 3. doesn't perform any IO implies the operation can complete synchronously, but because this is an asynchronous function Boost.Asio by default won't call the continuation before the function returns. The implementation must therefore enqueue it for execution, as depicted below
OP5 ---> OP4 ---> OP3 ---> OP2 ---> OP1 # Reschedules the continuation
|
OP1 schedules its continuation |
+-----------------------------------+
|
|
OP6 ---> OP5 ---> OP4 ---> OP3 ---> OP2 # Reschedules the continuation
|
OP2 schedules its continuation |
+-----------------------------------+
|
|
OP7 ---> OP6 ---> OP5 ---> OP4 ---> OP3
When summed up, the excessive rescheduling of continuations lead to performance degradation at scale. But since this is an event-loop there is no way around rescheduling as doing otherwise would mean allowing a task to monopolize the event-loop, preventing other tasks from making progress. The best that can be done is to avoid overscheduling, so let us determine how much rescheduling is too much.
The intrinsic latency of an event-loop
An event-loop is a design pattern originally used to handle events
external to the application, such as GUIs, networking and other forms
of IO. If we take this literally, it becomes evident that the way
async_read_until works is incompatible with an event-loop since
searching for the separator is not an external event and as such
should not have to be enqueued for execution.
Once we constrain ourselves to events that have an external origin,
such as anything related to IO and including any form of IPC, the
scheduling overhead is reduced considerably since the latency
of the transport layer eclipses whatever time it takes to schedule the
continuation, for example, according to
these
benchmarks, the time it takes to schedule a task in the
asio::io_context is approximately 50ns.
To give the reader an idea about the magnitude of this number, if rescheduling alone were to account for 1% of the runtime of an app that uses asynchronous IO to move around data in chunks of size 128kb, then this app would have a throughput of approximately 24Gbs. At such high throughput multiple other factors kick in before any scheduling overhead even starts to manifest.
It is therefore safe to say that only asynchronous operations that don't perform or are not bound to any IO are ever likely to overschedule in the sense described above. Those cases can be usually avoided, this is what worked for Boost.Redis
-
async_read_untilwas replaced with calls tosocket.async_read_someand an incremental parser that does not do any IO. -
Channel
try_functions are used to check if send and receive operations can be called without suspension. For example,try_sendbeforeasync_sendandtry_receivebeforeasync_receive(see alsotry_send_via_dispatchfor a more aggressive optimization). -
Coalescing of individual requests into a single payload to reduce the number of necessary writes on the socket, this is only possible because Redis supports pipelining (good protocols help!).
-
Increased the socket read sizes to 4kb to reduce the number of reads (which is outweighed by the costs of rotating data in the buffer).
-
Dropped the
resp3::async_readabstraction. When I started developing Boost.Redis there was convincing precedent for having aresp3::async_readfunction to read complete RESP3 messages from a socketName Description asio::ip::tcp::async_readReads nbytes from a stream.beast::http::async_readReads a complete HTTP message. beast::websocket::stream::async_readReads a complete Websocket message. redis::async_readReads a complete RESP3 message. It turns out however that this function is also vulnerable to immediate completions since in command pipelines multiple responses show up in the buffer after a call to
socket.async_read_some. When that happens each call toresp3::async_readis IO-less.
Sometimes it is not possible to avoid asynchronous operations that complete synchronously, in the following sections we will see how to favor throughput over fairness in Boost.Asio.
Calling the continuation inline
In Boost.Asio it is possible to customize how an algorithm executes the continuation when an immediate completion occurs, this includes the ability of calling it inline, thereby avoiding the costs of excessive rescheduling. Here is how it works
// (default) The continuation is enqueued for execution, regardless of
// whether it is immediate or not.
async_read_until(socket, buffer, "\r\n", continuation);
// Immediate completions are executed in exec2 (otherwise equal to the
// version above). The completion is called inline if exec2 is the
// same executor that is running the operation.
async_read_until(socket, buffer, "\r\n", bind_immediate_executor(exec2, completion));
To compare the performance of both cases I have written a small
function that calls async_read_until in a loop with a buffer that is
never consumed so that all completions are immediate. The version
below uses the default behaviour
void read_safe(tcp::socket& s, std::string& buffer)
{
auto continuation = [&s, &buffer](auto ec, auto n)
{
read_safe(s, buffer); // Recursive call
};
// This won't cause stack exhaustion because the continuation is
// not called inline but posted in the event loop.
async_read_until(s, dynamic_buffer(buffer), "\r\n", continuation);
}
To optimize away some of the rescheduling the version below uses the
bind_immediate_executor customization to call the continuation
reentrantly and then breaks the stack from time to time to avoid
exhausting it
void read_reentrant(tcp::socket& s, std::string& buffer)
{
auto cont = [&](auto, auto)
{
read_reentrant(s, buffer); // Recursive call
};
// Breaks the callstack after 16 inline calls.
if (counter % 16 == 0) {
post(s.get_executor(), [cont](){cont({}, 0);});
return;
}
// Continuation called reentrantly.
async_read_until(s, dynamic_buffer(buffer), "\r\n",
bind_immediate_executor(s.get_executor(), cont));
}
The diagram below shows what the reentrant chain of calls in the code above look like from the event-loop point of view
OP5 ---> OP4 ---> OP3 ---> OP2 ---> OP1a # Completes immediately
|
|
... |
OP1b # Completes immediately
|
Waiting for OP5 to |
reschedule its |
continuation OP1c # Completes immediately
|
|
... |
OP1d # Break the call-stack
|
+-----------------------------------+
|
OP6 ---> OP5 ---> OP4 ---> OP3 ---> OP2
Unsurprisingly, the reentrant code is 3x faster than the one that relies on the default behaviour (don't forget that this is a best case scenario, in the general case not all completions are immediate). Although faster, this strategy has some downsides
-
The overall operation is not as fast as possible since it still has to reschedule from time to time to break the call stack. The less it reschedules the higher the risk of exhausting it.
-
It is too easy to forget to break the stack. For example, the programmer might decide to branch somewhere into another chain of asynchronous calls that also use this strategy. To avoid exhaustion all such branches would have to be safeguarded with a manual rescheduling i.e.
post. -
Requires additional layers of complexity such as
bind_immediate_executorin addition tobind_executor. -
Non-compliat with more strict guidelines that prohibits reentrat code.
-
There is no simple way of choosing the maximum allowed number of reentrant calls for each function in a way that covers different use cases and users. Library writers and users would be tempted into using a small value reducing the performance advantage.
-
If the socket is always ready for reading the task will monopolize IO for up to
16interactions which might cause stutter in unrelated tasks as depicted below
Unfairness
+----+----+----+ +----+----+----+ +----+----+----+
Socket-1 | | | | | | | | | | | |
+----+----+----+----+----+----+----+----+----+----+----+----+
Socket-2 | | | | | |
+----+ +----+ +----+
From the aesthetic point of view the code above is also unpleasant as it breaks the function asynchronous contract by injecting a reentrant behaviour. It gives me the same kind of feeling I have about recursive mutexes.
Note: It is worth mentioning here that a similar strategy is used to break the call stack of repeating algorithms in stdexec, but in this time based on P2300 and not on Boost.Asio.
Coroutine tail-calls
In the previous section we have seen how to avoid overscheduling by instructing the asynchronous operation to call the completion inline on immediate completion. It turns out however that coroutine support for tail-calls provide a way to completely sidestep this problem. This feature is described by Lewis Baker as follows
A tail-call is one where the current stack-frame is popped before the call and the current function’s return address becomes the return-address for the callee. ie. the callee will return directly the the [sic] caller of this function.
This means (at least in principle) that a library capable of using tail-calls when an immediate completion occurs neither has to reschedule the continuation nor call it inline. To test how this feature compares to the other styles I have used Boost.Cobalt. The code looks as follows
// Warning: risks unfairness and starvation of other tasks.
task<void> read_until_unfair()
{
for (int i = 0; i != repeat; ++i) {
co_await async_read_until(s, dynamic_buffer(buffer), "\r\n", cobalt::use_op);
}
}
The result of this comparison as listed in the table below
| Time/s | Style | Configuration | Library |
|---|---|---|---|
| 1,0 | Coroutine | await_ready optimization |
Boost.Cobalt |
| 4.8 | Callback | Reentant | Boost.Asio |
| 10.3 | Coroutine | use_op |
Boost.Cobalt |
| 14.9 | Callback | Regular | Boost.Asio |
| 15.6 | Coroutine | asio::deferred |
Boost.Asio |
As the reader can see, cobalt::use_op ranks 3rd and is considerably
faster (10.3 vs 15.6) than the Asio equivalent that uses
default-rescheduling. However, by trading rescheduling with tail-calls
the code above can now monopolize the event-loop, resulting in
unfairness if the socket happens to receive data at a higher rate
than other tasks. If by chance data is received continuously
on a socket that is always ready for reading, other tasks will starve
Starvation
+----+----+----+----+----+----+----+----+----+----+----+----+
Socket-1 | | | | | | | | | | | | |
+----+----+----+----+----+----+----+----+----+----+----+----+
Socket-2 Starving ...
To avoid this problem the programmer is forced to reschedule from time to time, in the same way we did for the reentrant calls
task<void> read_until_fair()
{
for (int i = 0; i != repeat; ++i) {
if (repeat % 16 == 0) {
// Reschedules to address unfairness and starvation of
// other tasks.
co_await post(cobalt::use_op);
continue;
}
co_await async_read_until(s, dynamic_buffer(buffer), "\r\n", cobalt::use_op);
}
}
Delegating fairness-safety to applications is a dangerous game. This is a problem the Tokio community had to deal with before Tokio runtime started enforcing rescheduling (after 256 successful operations)
If data is received faster than it can be processed, it is possible that more data will have already been received by the time the processing of a data chunk completes. In this case, .await will never yield control back to the scheduler, other tasks will not be scheduled, resulting in starvation and large latency variance.
Currently, the answer to this problem is that the user of Tokio is responsible for adding yield points in both the application and libraries. In practice, very few actually do this and end up being vulnerable to this sort of problem.
Safety in P2300 (Senders and Receivers)
As of this writing, the C++ standards committee (WG21) has been pursuing the standardization of a networking library for almost 20 years. One of the biggest obstacles that prevented it from happening was a disagreement on what the asynchronous model that underlies networking should look like. Until 2021 that model was basically Boost.Asio executors, but in this poll the committee decided to abandon that front and concentrate efforts on the new P2300 proposal, also known as senders and receivers. The decision was quite abrupt
The original plan about a week earlier than the actual writing of this paper was to write a paper that makes a case for standardizing the Networking TS.
and opinions turned out to be very strong against Boost.Asio (see this for how each voter backed their vote)
The whole concept is completely useless, there's no composed code you can write with it.
The part of that debate that interests us most here is stated in P2471, that compares Boost.Asio with P2300
Yes, default rescheduling each operation and default not rescheduling each operation, is a poor trade off. IMO both options are poor. The one good option that I know of that can prevent stack exhaustion is first-class tail-recursion in library or language
ASIO has chosen to require that every async operation must schedule the completion on a scheduler (every read, every write, etc..).
sender/receiver has not decided to require that the completion be scheduled.
This is why I consider tail-call the only good solution. Scheduling solutions are all inferior (give thanks to Lewis for this shift in my understanding :) ).
Although tail-calls solve the problem of stack-exhaustion as we have
seen above, it makes the code vulnerable to unfairness and starvation
and therefore it is not an alternative to default-rescheduling as the
quotation above is implying. To deal with the lack of
default-rescheduling, libraries and applications built on top of P2300
have to address the aforementioned problems, layer after layer. For
example,
stdexec has invented something
called
trampoline-scheduler
to protect repeating algorithms such as repeat_effect_until from
exhausting the stack. This construct however is built around
reentracy, allowing
sixteen
levels of inline calls by default. While in Boost.Asio it is possible to use
reentracy as an optimization for a corner cases, here it is made its
modus operandi, the downsides of this approach have already been stated in a
previous section so I won't repeat it here.
Also the fact that a special scheduler is needed by specific algorithms is a problem on its own since it contradicts one of the main selling points of P2300 which is that of being generic. For example, P2464R0 uses the code below as an example
void
run_that_io_operation(
scheduler auto sched,
sender_of<network_buffer> auto wrapping_continuation)
{
// snip
}
and states
I have no idea what the sched's concrete type is. I have no idea what the wrapping_continuation's concrete type is. They're none of my business, ...
Hence, by being generic, the algorithms built on top of P2300 are also unsafe (against stack-exhaustion, unfairness and starvation). Otherwise, if library writers require a specific scheduler to ensure safety, then the algorithms become automatically non-generic, pick your poison!
The proposers of P2300 claim that it doesn't address safety because it should be seen as the low-level building blocks of asynchronous programming and that its the role of higher-level libraries, to deal with that. This claim however does not hold since, as we have just seen, Boost.Asio also provides those building blocks but does so in a safe way. In fact during the whole development of Boost.Redis I never had to think about these kinds of problems because safety is built from the ground up.
Avoiding coroutine suspension with await_ready
Now let us get back to the first place in the table above, which uses
the await_ready optimization from Boost.Cobalt. This API provides
users with the ability to avoid coroutine suspension altogether in
case the separator is already present in the buffer. It works by
defining a struct with the following interface
struct read_until : cobalt::op<error_code, std::size_t> {
...
void ready(cobalt::handler<error_code, std::size_t> handler) override
{
// Search for the separator in buffer and call the handler if found
}
void initiate(cobalt::completion_handler<error_code, std::size_t> complete) override
{
// Regular call to async_read_until.
async_read_until(socket, buffer, delim, std::move(complete));
}
};
and the code that uses it
for (int i = 0; i != repeat; ++i) {
co_await read_until(socket, dynamic_buffer(buffer));
}
In essence, what the code above does is to skip a call to
async_read_unil by first checking with the ready function whether
the forthcoming operation is going to complete immediately. The
nice thing about it is that the programmer can use this optimization
only when a performance bottleneck is detected, without planing for it
in advance. The drawback however is that it requires reimplementing
the search for the separator in the body of the ready function,
defeating the purpose of using async_read_until in first place as
(again) it would have been simpler to reformulate the operation in
terms of socket.async_read_some directly.
Acknowledgements
Thanks to Klemens Morgenstern for answering questions about Boost.Cobalt.