Article about the costs of async abstractions.

2026-01-19 04:42:09 +00:00 · 2023-12-17 22:22:38 +01:00
parent 3861c5de74
commit 96da11a2cc
1 changed files with 671 additions and 0 deletions
--- a/doc/on-the-costs-of-async-abstractions.md
+++ b/doc/on-the-costs-of-async-abstractions.md
@@ -0,0 +1,671 @@
+# On the costs of asynchronous abstractions
+
+The biggest force behind the evolution of
+[Boost.Redis](https://github.com/boostorg/redis) was my struggling in
+coming up with a high-level connection abstraction that was capable of
+multiplexing Redis commands from independent sources while
+concurrently handling server pushes. This journey taught me many
+important lessons, many of which are related to the design and
+performance of asynchronous programs based on Boost.Asio.
+
+In this article I will share some of the lessons learned, specially
+those related to the performance costs of _abstractions_ such as
+`async_read_until` that tend to overschedule into the event-loop. In
+this context I will also briefly comment on how the topics discussed
+here influenced my views on the proposed
+[P2300](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2300r7.html)
+(a.k.a. Senders and Receivers), which is likely to become the basis of
+networking in upcoming C++ standards.
+
+Although the analysis presented here uses the Redis communication
+protocol for illustration I expect it to be useful in general since
+[RESP3](https://github.com/antirez/RESP3/blob/master/spec.md) shares
+many similarities with other widely used protocols such as HTTP. 
+
+## Parsing `\r\n`-delimited messages
+
+The Redis server communicates with its clients by exchanging data
+serialized in
+[RESP3](https://github.com/antirez/RESP3/blob/master/spec.md) format.
+Among the data types supported by this specification, the
+`\r\n`-delimited messages are some of the most frequent in a typical
+session. The table below shows some examples
+
+  Command  | Response | Wire format   | RESP3 name
+  ---------|----------|---------------|---------------------
+  PING     |  PONG    | `+PONG\r\n`   | simple-string
+  INCR     |  42      | `:42\r\n`     | number
+  GET      |  null    | `_\r\n`       | null
+
+Redis also supports command pipelines, which provide a way of
+optimizing round-trip times by batching commands. A pipeline composed
+by the commands shown in the previous table look like this
+
+```
+                       |   Sent in a     |
+                       |   single write  |
+--------+             |                 |              +-------+
+|        | -------->    PING + INCR + GET     --------> |       |
+|        |                                              |       |
+| Client |                                              | Redis |
+|        |                                              |       |
+|        | <-------- "+PONG\r\n:42\r\n_\r\n"  <-------- |       |
+--------+            |<------>|<---->|<-->|            +-------+
+                      |                    |
+                      |     Responses      |
+```
+
+Messages that use delimiters are so common in networking that a
+facility called `async_read_until` for reading them incrementally from
+a socket is already part of Boost.Asio. The coroutine below uses it to
+print message contents to the screen
+
+```cpp
+awaitable<void> parse_resp3_simple_msgs(tcp::socket socket)
+{
+  for (std::string buffer;;) {
+     auto n = co_await async_read_until(socket, dynamic_buffer(buffer), "\r\n");
+
+     std::cout << buffer.substr(1, n - 3) << std::endl;
+
+     // Consume the buffer.
+     buffer.erase(0, n);
+  }
+}
+```
+
+If we pay attention to the buffer content as it is parsed by the code
+above we can see it is rotated fairly often, for example
+
+```
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n"
+   ":100\r\n+OK\r\n_\r\n"
+   "+OK\r\n_\r\n"
+   "_\r\n"
+   ""
+```
+
+When I first realized these, apparently excessive, buffer rotations I
+was concerned they would impact the performance of Boost.Redis in a
+severe way.  To measure the magnitude of this impact I came up with an
+experimental implementation of Asio's `dynamic_buffer` that consumed
+the buffer less eagerly than the `std::string::erase` function used
+above. For that, the implementation increased a buffer offset up
+to a certain threshold and only then triggered a (larger) rotation.
+This is illustrated in the diagram below
+
+```
+    |<---- offset threshold ---->|
+    |                            |
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
+    |                                       # Initial message
+
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
+    |<------>|                              # After 1st message
+
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
+    |<-------------->|                      # After 2nd message
+
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
+    |<--------------------->|               # After 3rd message
+
+   "+PONG\r\n:100\r\n+OK\r\n_\r\n+PONG\r\n"
+    |<-------------------------->|          # 4th message crosses the threashold
+
+   "+PONG\r\n"
+    |                                       # After rotation
+```
+
+After comparing the performance differences between the two versions I
+was surprised there wasn't any! But that was also very suspicious
+since some RESP3 aggregate types contain a considerable number of
+separators.  For example, a map with two pairs `[(key1, value1),
+(key2, value2)]` encoded in RESP3 requires ten rotations in total
+
+```
+   "%2\r\n$4\r\nkey1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
+   "$4\r\nkey1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
+   "key1\r\n$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
+   "$6\r\nvalue1\r\n$4\r\nkey2\r\n$6\r\nvalue2\r\n"
+   ...
+```
+
+It was evident something more costly was shadowing the buffer
+rotations. But it couldn't be the search for the separator since it
+performs equivalently to rotations. It is also easy to show that the
+overhead is not related to any IO operation since the problem persists
+if the buffer is never consumed (which causes the function to be
+called with the same string repeatedly).  Once these two factors
+are removed from the table, we are driven into the conclusion that
+calling `async_read_until` has an intrinsic cost, let us see what
+that is.
+
+### Async operations that complete synchronously considered harmful
+
+Assume the scenario described earlier where `async_read_until` is used
+to parse multiple `\r\n`-delimited messages. The following is a
+detailed description of what happens behind the scenes
+
+   1. `async_read_until` calls `socket.async_read_some` repeatedly
+      until the separator `\r\n` shows up in the buffer
+
+```
+   "<read1>"                                # Read 1: needs more data.
+   "<read1><read2>"                         # Read 2: needs more data.
+   "<read1><read2>"                         # Read 3: needs more data.
+   "<read1><read2><read3>"                  # Read 4: needs more data.
+   "<read1><read2><read3>\r\n<bonus bytes>" # separator found, done.
+```
+
+   2. The last call to `socket.async_read_some` happens to read past
+      the separator `\r\n` (depicted as `<bonus bytes>` above),
+      resulting in bonus (maybe incomplete) messages in the buffer
+
+```
+    | 1st async_read_some  |         2nd async_read_some            |
+    |                      |                                        |
+    "+message content here \r\n:100\r\n+OK\r\n_\r\n+incomplete respo"
+    |                          |                   |                |
+    |      Message wanted      |<-- bonus msgs --->|<--incomplete-->|
+    |                          |                         msg        |
+    |                          |                                    |
+    |                          |<---------- bonus bytes ----------->|
+```
+
+   3. The buffer is consumed and `async_read_until` is called again.
+      However, since the buffer already contains the next message this
+      is an IO-less call
+
+```
+   ":100\r\n+OK\r\n_\r\n+not enough byt"
+    |                   |              |
+    |  No IO required   | Need more    |
+    |  to parse these   | data         |   
+    |  messages.        |              |
+```
+
+The fact that step 3. doesn't perform any IO implies the operation can
+complete synchronously, but because this is an asynchronous function
+Boost.Asio by default won't call the continuation before the
+function returns. The implementation must therefore enqueue it for
+execution, as depicted below
+
+```
+   OP5 ---> OP4 ---> OP3 ---> OP2 ---> OP1    # Reschedules the continuation
+                                        |
+       OP1 schedules its continuation   |
+    +-----------------------------------+
+    |
+    |
+   OP6 ---> OP5 ---> OP4 ---> OP3 ---> OP2    # Reschedules the continuation
+                                        |
+       OP2 schedules its continuation   |
+    +-----------------------------------+
+    |
+    |
+   OP7 ---> OP6 ---> OP5 ---> OP4 ---> OP3    
+```
+
+When summed up, the excessive rescheduling of continuations lead to
+performance degradation at scale. But since this is an event-loop
+there is no way around rescheduling as doing otherwise would mean
+allowing a task to monopolize the event-loop, preventing other tasks
+from making progress. The best that can be done is to avoid
+_overscheduling_, so let us determine how much rescheduling is too
+much.
+
+## The intrinsic latency of an event-loop 
+
+An event-loop is a design pattern originally used to handle events
+external to the application, such as GUIs, networking and other forms
+of IO.  If we take this literally, it becomes evident that the way
+`async_read_until` works is incompatible with an event-loop since
+_searching for the separator_ is not an external event and as such
+should not have to be enqueued for execution.
+
+Once we constrain ourselves to events that have an external origin,
+such as anything related to IO and including any form of IPC, the
+scheduling overhead is reduced considerably since the latency
+of the transport layer eclipses whatever time it takes to schedule the
+continuation, for example, according to
+[these](https://www.boost.org/doc/libs/develop/libs/cobalt/doc/html/index.html#posting_to_an_executor)
+benchmarks, the time it takes to schedule a task in the
+`asio::io_context ` is approximately `50ns`.
+
+To give the reader an idea about the magnitude of this number, if
+rescheduling alone were to account for 1% of the runtime of an app
+that uses asynchronous IO to move around data in chunks of size 128kb,
+then this app would have a throughput of approximately 24Gbs.  At such
+high throughput multiple other factors kick in before any scheduling
+overhead even starts to manifest.
+
+It is therefore safe to say that only asynchronous operations that
+don't perform or are not bound to any IO are ever likely to
+overschedule in the sense described above. Those cases can be usually
+avoided, this is what worked for Boost.Redis
+
+  1. `async_read_until` was replaced with calls to
+     `socket.async_read_some` and an incremental parser that does not
+     do any IO.
+
+  2. Channel `try_` functions are used to check if send and receive
+     operations can be called without suspension. For example,
+     `try_send` before `async_send` and `try_receive` before
+     `async_receive` ([see also](https://github.com/chriskohlhoff/asio/commit/fe4fd7acf145335eeefdd19708483c46caeb45e5)
+     `try_send_via_dispatch` for a more aggressive optimization).
+
+  3. Coalescing of individual requests into a single payload to reduce
+     the number of necessary writes on the socket,this is only
+     possible because Redis supports pipelining (good protocols
+     help!).
+
+  4. Increased the socket read sizes to 4kb to reduce the number of
+     reads (which is outweighed by the costs of rotating data in the
+     buffer).
+
+  5. Dropped the `resp3::async_read` abstraction. When I started
+     developing Boost.Redis there was convincing precedent for having
+     a `resp3::async_read` function to read complete RESP3 messages
+     from a socket
+
+     Name                                   | Description
+     ---------------------------------------|-------------------
+     `asio::ip::tcp::async_read`            | Reads `n` bytes from a stream.
+     `beast::http::async_read`              | Reads a complete HTTP message.
+     `beast::websocket::stream::async_read` | Reads a complete Websocket message.
+     `redis::async_read`                    | Reads a complete RESP3 message.
+
+     It turns out however that this function is also vulnerable to
+     immediate completions since in command pipelines multiple
+     responses show up in the buffer after a call to
+     `socket.async_read_some`. When that happens each call to
+     `resp3::async_read` is IO-less.
+   
+Sometimes it is not possible to avoid asynchronous operations that
+complete synchronously, in the following sections we will therefore
+see how favoring throughput over fairness works in Boost.Asio.
+
+### Calling the continuation inline
+
+In Boost.Asio it is possible to customize how an algorithm executes
+the continuation when an immediate completion occurs, this includes
+the ability of calling it inline, thereby avoiding the costs of
+excessive rescheduling.  Here is how it works
+
+```cpp
+// (default) The continuation is enqueued for execution, regardless of
+// whether it is immediate or not.
+async_read_until(socket, buffer, "\r\n", continuation);
+
+// Immediate completions are executed in exec2 (otherwise equal to the
+// version above). The completion is called inline if exec2 is the
+same // executor that is running the operation.
+async_read_until(socket, buffer, "\r\n", bind_immediate_executor(exec2, completion));
+```
+
+To compare the performance of both cases I have written a small
+function that calls `async_read_until` in a loop with a buffer that is
+never consumed so that all completions are immediate.  The version
+below uses the default behaviour
+
+```cpp
+void read_safe(tcp::socket& s, std::string& buffer)
+{
+   auto continuation = [&s, &buffer](auto ec, auto n)
+   {
+      read_safe(s, buffer); // Recursive call
+   };
+
+   // This won't cause stack exhaustion because the continuation is 
+   // not called inline but posted in the event loop.
+   async_read_until(s, dynamic_buffer(buffer), "\r\n", continuation);
+}
+```
+
+To optimize away some of the rescheduling the version below uses the
+`bind_immediate_executor` customization to call the continuation
+reentrantly and then breaks the stack from time to time to avoid
+exhausting it
+
+```cpp
+void read_reentrant(tcp::socket& s, std::string& buffer)
+{
+   auto cont = [&](auto, auto)
+   {
+      read_reentrant(s, buffer); // Recursive call
+   };
+
+   // Breaks the callstack after 16 inline calls.
+   if (counter % 16 == 0) {
+      post(s.get_executor(), [cont](){cont({}, 0);});
+      return;
+   }
+
+   // Continuation called reentrantly.
+   async_read_until(s, dynamic_buffer(buffer), "\r\n",
+      bind_immediate_executor(s.get_executor(), cont));
+}
+```
+
+The diagram below shows what the reentrant chain of calls in the code
+above look like from the event-loop point of view
+
+```
+   OP5 ---> OP4 ---> OP3 ---> OP2 ---> OP1a    # Completes immediately
+                                        |
+                                        |      
+                 ...                    |
+                                       OP1b    # Completes immediately
+                                        |
+           Waiting for OP5 to           |      
+             reschedule its             |
+             continuation              OP1c    # Completes immediately
+                                        |
+                                        |      
+                 ...                    |
+                                       OP1d    # Break the call-stack
+                                        |
+    +-----------------------------------+     
+    |
+   OP6 ---> OP5 ---> OP4 ---> OP3 ---> OP2
+```
+
+Unsurprisingly, the reentrant code is 3x faster than the one that
+relies on the default behaviour (don't forget that this is a best case
+scenario, in the general case not all completions are immediate).
+Although faster, this strategy has some downsides
+
+   - The overall operation is not as fast as possible since it still
+     has to reschedule from time to time to break the call stack.  The
+     less it reschedules the higher the risk of exhausting it.
+
+   - It is too easy to forget to break the stack. For example, the
+     programmer might decide to branch somewhere into another chain of
+     asynchronous calls that also use this strategy.  To avoid
+     exhaustion all such branches would have to be safeguarded with a
+     manual rescheduling i.e. `post`.
+
+   - Requires additional layers of complexity such as
+     `bind_immediate_executor` in addition to `bind_executor`.
+   
+  - Not compliat with more strict
+    [guidelines](https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Developing_Safety-Critical_Code)
+    that prohibits reentrat code.
+
+   - There is no simple way of choosing the maximum allowed number of
+     reentrant calls for each function in a way that covers different
+     use cases and users. Library writers and users would be tempted
+     into using a small value reducing the performance advantage.
+
+   - If the socket is always ready for reading the task will
+     monopolize IO for up to `16` interactions which might cause
+     stutter in unrelated tasks as depicted below
+
+```
+                                 Unfairness
+
+          +----+----+----+    +----+----+----+    +----+----+----+
+Socket-1  |    |    |    |    |    |    |    |    |    |    |    |
+          +----+----+----+----+----+----+----+----+----+----+----+----+     
+Socket-2                 |    |              |    |              |    | 
+                         +----+              +----+              +----+
+```
+
+From the aesthetic point of view the code above is also unpleasant as
+it breaks the function asynchronous contract by injecting a reentrant
+behaviour. It gives me the same kind of feeling I have about
+[recursive
+mutexes](http://www.zaval.org/resources/library/butenhof1.html).
+
+Note: It is worth mentioning here that a similar
+[strategy](https://github.com/NVIDIA/stdexec/blob/6f23dd5b1d523541ce28af32fc2603403ebd36ed/include/exec/trampoline_scheduler.hpp#L52)
+is used to break the call stack of repeating algorithms in
+[stdexec](https://github.com/NVIDIA/stdexec), but in this time
+based on
+[P2300](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2300r7.html)
+and not on Boost.Asio.
+
+### Coroutine tail-calls
+
+In the previous section we have seen how to avoid overscheduling by
+instructing the asynchronous operation to call the completion inline
+on immediate completion.  It turns out however that coroutine support
+for _tail-calls_ provide a way to completely sidestep this problem.
+This feature is described by
+[Backer](https://lewissbaker.github.io/2020/05/11/understanding_symmetric_transfer)
+as follows
+
+> A tail-call is one where the current stack-frame is popped before
+> the call and the current function’s return address becomes the
+> return-address for the callee. ie. the callee will return directly
+> the the [sic] caller of this function.
+
+This means (at least in principle) that a library capable of using
+tail-calls when an immediate completion occurs neither has to
+reschedule the continuation nor call it inline. To test how this
+feature compares to the other styles I have used Boost.Cobalt. The
+code looks as follows
+
+```cpp
+// Warning: risks unfairness and starvation of other tasks.
+task<void> read_until_unfair()
+{
+    for (int i = 0; i != repeat; ++i) {
+        co_await async_read_until(s, dynamic_buffer(buffer), "\r\n", cobalt::use_op);
+    }
+}
+```
+
+The result of this comparison as listed in the table below
+
+Time/s | Style     | Configuration               | Library     
+-------|-----------|-----------------------------|-------------
+ 1,0   | Coroutine | `await_ready` optimization  | Boost.Cobalt
+ 4.8   | Callback  | Reentant                    | Boost.Asio  
+10.3   | Coroutine | `use_op`                    | Boost.Cobalt
+14.9   | Callback  | Regular                     | Boost.Asio  
+15.6   | Coroutine | `asio::deferred`            | Boost.Asio  
+
+As the reader can see, `cobalt::use_op` ranks 3rd and is considerably
+faster (10.3 vs 15.6) than the Asio equivalent that uses
+default-rescheduling. However, by trading rescheduling with tail-calls
+the code above can now monopolize the event-loop, resulting in
+unfairness if the socket happens to receive data at a higher rate
+than other tasks. If by chance data is received continuously
+on a socket that is always ready for reading, other tasks will starve
+
+```
+                                 Starvation
+
+          +----+----+----+----+----+----+----+----+----+----+----+----+
+Socket-1  |    |    |    |    |    |    |    |    |    |    |    |    |
+          +----+----+----+----+----+----+----+----+----+----+----+----+     
+
+Socket-2          Starving ...                                          
+                                                                       
+```
+
+To avoid this problem the programmer is forced to reschedule from time
+to time, in the same way we did for the reentrant calls
+
+```cpp
+task<void> read_until_fair()
+{
+    for (int i = 0; i != repeat; ++i) {
+        if (repeat % 16 == 0) {
+            // Reschedules to address unfairness and starvation of 
+            // other tasks.
+            co_await post(cobalt::use_op);
+            continue;
+        }
+        
+        co_await async_read_until(s, dynamic_buffer(buffer), "\r\n", cobalt::use_op);
+    }
+}
+```
+
+Delegating fairness-safety to applications is a dangerous game.
+This is a
+[problem](https://tokio.rs/blog/2020-04-preemption) the Tokio
+community had to deal with before Tokio runtime started enforcing
+rescheduling (after 256 successful operations)
+
+> If data is received faster than it can be processed, it is possible
+> that more data will have already been received by the time the
+> processing of a data chunk completes. In this case, .await will
+> never yield control back to the scheduler, other tasks will not be
+> scheduled, resulting in starvation and large latency variance.
+
+> Currently, the answer to this problem is that the user of Tokio is
+> responsible for adding yield points in both the application and
+> libraries. In practice, very few actually do this and end up being
+> vulnerable to this sort of problem.
+
+### Safety in P2300 (Senders and Receivers)
+
+As of this writing, the C++ standards committee (WG21) has been
+pursuing the standardization of a networking library for almost 20
+years. One of the biggest obstacles that prevented it from happening
+was a disagreement on what the _asynchronous model_ that underlies
+networking should look like. Until 2021 that model was basically
+Boost.Asio _executors_,  but in this
+[poll](https://www.reddit.com/r/cpp/comments/q6tgod/c_committee_polling_results_for_asynchronous/)
+the committee decided to abandon that front and concentrate efforts on
+the new [P2300](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2300r7.html)
+proposal, also known as _senders and receivers_. The decision was
+quite [abrupt](https://isocpp.org/files/papers/P2464R0.html)
+
+> The original plan about a week earlier than the actual writing of
+> this paper was to write a paper that makes a case for standardizing
+> the Networking TS.
+
+and opinions turned out to be very strong against Boost.Asio (see
+[this](https://api.csswg.org/bikeshed/?force=1&url=https://raw.githubusercontent.com/brycelelbach/wg21_p2459_2022_january_library_evolution_poll_outcomes/main/2022_january_library_evolution_poll_outcomes.bs)
+for how each voter backed their vote)
+
+> The whole concept is completely useless, there's no composed code
+> you can write with it.
+
+The part of that debate that interests us most here is stated in
+[P2471](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2471r1.pdf),
+that compares Boost.Asio with P2300
+
+> Yes, default rescheduling each operation and default not
+> rescheduling each operation, is a poor trade off. IMO both options
+> are poor. The one good option that I know of that can prevent stack
+> exhaustion is first-class tail-recursion in library or language
+
+> ASIO has chosen to require that every async operation must schedule
+> the completion on a scheduler (every read, every write, etc..).
+
+> sender/receiver has not decided to
+> require that the completion be scheduled. 
+
+> This is why I consider tail-call the only good solution. Scheduling
+> solutions are all inferior (give thanks to Lewis for this shift in
+> my understanding :) ).
+
+Although tail-calls solve the problem of stack-exhaustion as we have
+seen above, it makes the code vulnerable to unfairness and starvation
+and therefore it is not an alternative to default-rescheduling as the
+quotation above is implying.  To deal with the lack of
+default-rescheduling, libraries and applications built on top of P2300
+have to address the aforementioned problems, layer after layer. For
+example,
+[stdexec](https://github.com/NVIDIA/stdexec) has invented something
+called
+_[trampoline-scheduler](https://github.com/NVIDIA/stdexec/blob/e7cd275273525dbc693f4bf5f6dc4d4181b639e4/include/exec/trampoline_scheduler.hpp)_
+to protect repeating algorithms such as `repeat_effect_until` from
+exhausting the stack. This construct however is built around
+reentracy, allowing
+[sixteen](https://github.com/NVIDIA/stdexec/blob/83cdb92d316e8b3bca1357e2cf49fc39e9bed403/include/exec/trampoline_scheduler.hpp#L52)
+levels of inline calls by default. While in Boost.Asio it is possible to use
+reentracy as an optimization for a corner cases, here it is made its
+_modus operandi_, my opinion about this has already been stated in a
+previous section so I won't repeat it here.
+
+Also the fact that a special scheduler is needed by specific
+algorithms is a problem on its own since it contradicts one of the
+main selling points of P2300 which is that of being _generic_. For
+example, [P2464R0](https://isocpp.org/files/papers/P2464R0.html) uses
+the code below as an example
+
+```cpp
+void
+run_that_io_operation(
+    scheduler auto sched,
+    sender_of<network_buffer> auto wrapping_continuation)
+{
+   // snip
+}
+```
+
+and states
+
+> I have no idea what the sched's concrete type is. I have no idea
+> what the wrapping_continuation's concrete type is. They're none of
+> my business, ...
+
+Hence, by being generic, the algorithms built on top of P2300 are also
+unsafe (against stack-exhaustion, unfairness and starvation).  Otherwise,
+if library writers require a specific scheduler to ensure safety, then
+the algorithms become automatically non-generic, pick your poison!
+
+The proposers of P2300 claim that it doesn't address safety because it
+should be seen as the low-level building blocks of asynchronous
+programming and that its the role of higher-level libraries, to deal
+with that. This claim however does not hold since, as we have just
+seen, Boost.Asio also provides those building blocks but does so in a
+safe way.  In fact during the whole development of Boost.Redis I never
+had to think about these kinds of problems because safety is built
+from the ground up.
+
+### Avoiding coroutine suspension with `await_ready`
+
+Now let us get back to the first place in the table above, which uses
+the `await_ready` optimization from Boost.Cobalt. This API provides
+users with the ability to avoid coroutine suspension altogether in
+case the separator is already present in the buffer.  It works by
+defining a `struct` with the following interface
+
+```cpp
+struct read_until : cobalt::op<error_code, std::size_t> {
+   ...
+
+   void ready(cobalt::handler<error_code, std::size_t> handler) override
+   {
+      // Search for the separator in buffer and call the handler if found
+   }
+
+   void initiate(cobalt::completion_handler<error_code, std::size_t> complete) override
+   {
+      // Regular call to async_read_until.
+      async_read_until(socket, buffer, delim, std::move(complete));
+   }
+};
+```
+
+and the code that uses it
+
+```cpp
+for (int i = 0; i != repeat; ++i) {
+    co_await read_until(socket, dynamic_buffer(buffer));
+}
+```
+
+In essence, what the code above does is to skip a call to
+`async_read_unil` by first checking with the ready function whether
+the forthcoming operation is going to complete immediately.  The
+nice thing about it is that the programmer can use this optimization
+only when a performance bottleneck is detected, without planing for it
+in advance.  The drawback however is that it requires reimplementing
+the search for the separator in the body of the `ready` function,
+defeating the purpose of using `async_read_until` in first place as
+(again) it would have been simpler to reformulate the operation in
+terms of `socket.async_read_some` directly.
+
+## Acknowledgements
+
+Thanks to Klemens Morgenstern for answering questions about
+Boost.Cobalt.
+