mirror of
https://github.com/boostorg/statechart.git
synced 2026-01-26 19:12:09 +00:00
331 lines
19 KiB
HTML
331 lines
19 KiB
HTML
<html>
|
||
|
||
<head>
|
||
<meta http-equiv="Content-Language" content="en-us">
|
||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||
<meta name="GENERATOR" content="Microsoft FrontPage 6.0">
|
||
<meta name="ProgId" content="FrontPage.Editor.Document">
|
||
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
||
<title>The Boost Statechart Library - Performance</title>
|
||
</head>
|
||
|
||
<body link="#0000ff" vlink="#800080">
|
||
|
||
<table border="0" cellpadding="7" cellspacing="0" width="100%" summary="header">
|
||
<tr>
|
||
<td valign="top" width="300">
|
||
<h3><a href="../../../index.htm">
|
||
<img alt="C++ Boost" src="../../../boost.png" border="0" width="277" height="86"></a></h3>
|
||
</td>
|
||
<td valign="top">
|
||
<h1 align="center">The Boost Statechart Library</h1>
|
||
<h2 align="center">Performance</h2>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
<hr>
|
||
<dl class="index">
|
||
<dt><a href="#Speed versus scalability tradeoffs">Speed versus scalability
|
||
tradeoffs</a></dt>
|
||
<dt><a href="#Memory management customization">Memory management
|
||
customization</a></dt>
|
||
<dt><a href="#RTTI customization">RTTI customization</a></dt>
|
||
<dt><a href="#Resource usage">Resource usage</a></dt>
|
||
</dl>
|
||
<h2><a name="Speed versus scalability tradeoffs">Speed versus scalability
|
||
tradeoffs</a></h2>
|
||
<p>Quite a bit of effort has gone into making the library fast for small
|
||
simple machines <b>and</b> scaleable at the same time (this applies only to
|
||
<code>state_machine<></code>, there still is some room for optimizing <code>
|
||
fifo_scheduler<></code>, especially for multi-threaded builds). While I
|
||
believe it should perform reasonably in most applications, the scalability
|
||
does not come for free. Small, carefully handcrafted state machines will thus
|
||
easily outperform equivalent Boost.Statechart machines. To get a picture of how big
|
||
the gap is, I implemented a simple benchmark in the BitMachine example. The
|
||
Handcrafted example is a handcrafted variant of the 1-bit-BitMachine
|
||
implementing the same benchmark.</p>
|
||
<p>I tried to create a fair but somewhat unrealistic <b>worst-case</b>
|
||
scenario:</p>
|
||
<ul>
|
||
<li>For both machines exactly one object of the only event is allocated
|
||
before starting the test. This same object is then sent to the machines
|
||
over and over</li>
|
||
<li>The Handcrafted machine employs GOF-visitor double dispatch. The states
|
||
are preallocated so that event dispatch & transition amounts to nothing more
|
||
than two virtual calls and one pointer assignment</li>
|
||
</ul>
|
||
<p>The Benchmarks - running on
|
||
a 3.2GHz Intel Pentium 4 - produced the following
|
||
dispatch and transition times per event:</p>
|
||
<ul>
|
||
<li>Handcrafted:<ul>
|
||
<li>MSVC7.1: 10ns</li>
|
||
<li>GCC3.4.2: 20ns</li>
|
||
<li>Intel9.0: 20ns</li>
|
||
</ul>
|
||
</li>
|
||
<li>1-bit-BitMachine with customized memory management:<ul>
|
||
<li>MSVC7.1: 150ns</li>
|
||
<li>GCC3.4.2: 320ns</li>
|
||
<li>Intel9.0: 170ns</li>
|
||
</ul></li>
|
||
</ul>
|
||
<p>Although this is a big difference I still think it will not be noticeable
|
||
in most real-world applications. No matter whether an application uses
|
||
handcrafted or Boost.Statechart machines it will...</p>
|
||
<ul>
|
||
<li>almost never run into a situation where a state machine is swamped with
|
||
as many events as in the benchmarks. A state machine will almost always spend a good deal of time waiting for events
|
||
(which typically come from a human operator, from machinery or
|
||
from electronic devices over often comparatively slow I/O channels).
|
||
Parsers are just about the only application of FSMs where this is not the
|
||
case. However, parser FSMs are usually not directly specified on the state
|
||
machine level but on a higher one that is better suited for the task.
|
||
Examples of such higher levels are: Boost.Regex, Boost.Spirit, XML Schemas,
|
||
etc. Moreover, the nature of parsers allows for a number of optimizations
|
||
that are not possible in a general-purpose FSM framework.<br>
|
||
Bottom line: While it is possible to implement a parser with this library,
|
||
it is almost never advisable to do so because other approaches lead to
|
||
better performing and more expressive code</li>
|
||
<li>often run state machines in their own threads. This adds considerable
|
||
locking and thread-switching overhead. Performance tests with the PingPong
|
||
example, where two asynchronous state machines exchange events, gave the
|
||
following times to process one event and perform the resulting in-state
|
||
reaction (using the library with <code>boost::fast_pool_allocator<></code>):<ul>
|
||
<li>Single-threaded (no locking and waiting): 840ns / 840ns</li>
|
||
<li>Multi-threaded with one thread (the scheduler uses mutex locking but
|
||
never has to wait for events): 6500ns / 4800ns</li>
|
||
<li>Multi-threaded with two threads (both schedulers use mutex locking and
|
||
exactly one always waits for an event): 14000ns / 7000ns</li>
|
||
</ul>
|
||
<p>As mentioned above, there definitely is some room to improve the
|
||
timings for the asynchronous machines. Moreover, these are very crude
|
||
benchmarks, designed to show the overhead of locking and thread context
|
||
switching. The overhead in a real-world application will typically be
|
||
smaller and other operating systems can certainly do better in this area.
|
||
However, I strongly believe that on most platforms the threading overhead is
|
||
usually larger
|
||
than the time that Boost.Statechart spends for event dispatch and transition. Handcrafted machines will
|
||
inevitably have the same overhead, making raw single-threaded dispatch and
|
||
transition speed much less important</li>
|
||
<li>almost always allocate events with <code>new</code> and destroy them
|
||
after consumption. This will add a few cycles, even if event memory
|
||
management is customized</li>
|
||
<li>often use state machines that employ orthogonal states and other
|
||
advanced features. This forces the handcrafted machines to use a more
|
||
adequate and more time-consuming book-keeping</li>
|
||
</ul>
|
||
<p>Therefore, in real-world applications event dispatch and transition not
|
||
normally constitutes a bottleneck and the relative gap between handcrafted and
|
||
Boost.Statechart machines also becomes much smaller than in the worst-case scenario.</p>
|
||
<h3>Detailed performance data</h3>
|
||
<p>In an effort to identify the main performance bottleneck, the example
|
||
program "Performance" has been written. It measures the time that is
|
||
spent to process one event in different BitMachine variants. In contrast to
|
||
the BitMachine example, which uses only transitions, Performance uses
|
||
a varying number of in-state reactions together with transitions. The only
|
||
difference between in-state-reactions and transitions is that the former
|
||
neither enter nor exit any states. Apart from that, the same amount of code
|
||
needs to be run to dispatch an event and execute the resulting action.</p>
|
||
<p>The following diagrams show the average time the library spends to process
|
||
one event, depending on the percentage of in-state reactions employed. 0%
|
||
means that all triggered reactions are transitions. 100% means that all
|
||
triggered reactions are in-state reactions. I draw the following conclusions
|
||
from these measurements:</p>
|
||
<ul>
|
||
<li>The fairly linear course of the curves suggests that the measurements
|
||
with a 100% in-state reaction ratio are accurate and not merely a product of
|
||
optimizations in the compiler. Such optimizations might have been possible
|
||
due to the fact that in the 100% case it is known at compile-time that the
|
||
current state will never change</li>
|
||
<li>The data points with 100% in-state reaction ratio and speed optimized
|
||
RTTI show that modern compilers seem to inline the complex-looking dispatch code so aggressively that dispatch is reduced to
|
||
little more than it actually is, one virtual function call followed by a
|
||
linear search for a suitable reaction. For instance, in the case of the 1-bit Bitmachine, Intel9.0 seems to produce dispatch code that is equally efficient
|
||
like the two virtual function calls in the Handcrafted machine</li>
|
||
<li>On all compilers and in all variants the time spent in event dispatch is
|
||
dwarfed by the time spent to exit the current state and enter the target
|
||
state. It is worth noting that BitMachine is a flat and non-orthogonal state
|
||
machine, representing a close-to-worst case. Real-world machines will often
|
||
exit and enter multiple states during a transition, what further dwarfs pure
|
||
dispatch time. This makes the implementation of constant-time dispatch (as
|
||
requested by a few people during formal review) an undertaking with little
|
||
merit. Instead, the future optimization effort will concentrate on
|
||
state-entry and state-exit</li>
|
||
<li>Intel9.0 seems to have problems to optimize/inline code as soon as the
|
||
amount of code grows over a certain threshold. Unlike with the other two
|
||
compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit BitMachine
|
||
into separate executables to get good performance. Even then was the
|
||
performance overly bad for the 4-bit BitMachine. It was much worse when I
|
||
compiled all 4 tests into the same executable. This surely looks like a bug
|
||
in the compiler</li>
|
||
</ul>
|
||
<h4>Out of the box</h4>
|
||
<p>The library is used as is, without any optimizations/modifications.</p>
|
||
<p><img border="0" src="PerformanceNormal1.gif" width="371" height="284"><img border="0" src="PerformanceNormal2.gif" width="371" height="284"><img border="0" src="PerformanceNormal3.gif" width="371" height="284"><img border="0" src="PerformanceNormal4.gif" width="371" height="284"></p>
|
||
<h4>Native RTTI</h4>
|
||
<p>The library is used with <code>
|
||
<a href="configuration.html#Application Defined Macros">
|
||
BOOST_STATECHART_USE_NATIVE_RTTI</a></code> defined.</p>
|
||
<p><img border="0" src="PerformanceNative1.gif" width="371" height="284"><img border="0" src="PerformanceNative2.gif" width="371" height="284"><img border="0" src="PerformanceNative3.gif" width="371" height="284"><img border="0" src="PerformanceNative4.gif" width="371" height="284"></p>
|
||
<h4>Customized memory-management</h4>
|
||
<p>The library is used with customized memory
|
||
management (<code>boost::fast_pool_allocator</code>).</p>
|
||
<p><img border="0" src="PerformanceCustom1.gif" width="371" height="284"><img border="0" src="PerformanceCustom2.gif" width="371" height="284"><img border="0" src="PerformanceCustom3.gif" width="371" height="284"><img border="0" src="PerformanceCustom4.gif" width="371" height="284"></p>
|
||
<h3>Double dispatch</h3>
|
||
<p>At the heart of every state machine lies an implementation of double
|
||
dispatch. This is due to the fact that the incoming event <b>and</b> the
|
||
active state define exactly which <a href="definitions.html#Reaction">reaction</a>
|
||
the state machine will produce. For each event dispatch, one virtual call is
|
||
followed by a linear search for the appropriate reaction, using one RTTI
|
||
comparison per reaction. The following alternatives were considered but
|
||
rejected:</p>
|
||
<ul>
|
||
<li><a href="http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic
|
||
visitor</a>: This double-dispatch variant satisfies all scalability
|
||
requirements but performs badly due to costly inheritance tree cross-casts.
|
||
Moreover, a state must store one v-pointer for <b>each</b> reaction what
|
||
slows down construction and makes memory management customization
|
||
inefficient. In addition, C++ RTTI must inevitably be turned on, with
|
||
negative effects on executable size. Boost.Statechart originally employed acyclic
|
||
visitor and was about 4 times slower than it is now (MSVC7.1 on Intel
|
||
Pentium M). The dispatch speed might be better on other platforms but the
|
||
other negative effects will remain</li>
|
||
<li>
|
||
<a href="http://www.isbiel.ch/~due/courses/c355/slides/patterns/visitor.pdf">
|
||
GOF Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine
|
||
depend upon all events. That is, whenever a new event is added there is no
|
||
way around recompiling the whole state machine. This is contrary to the
|
||
scalability requirements</li>
|
||
<li>Single two-dimensional array of function pointers: To satisfy requirement 6, it
|
||
should be possible to spread a single state machine over several translation
|
||
units. This however means that the dispatch table must be filled at runtime
|
||
and the different translation units must somehow make themselves "known", so
|
||
that their part of the state machine can be added to the table. There simply
|
||
is no way to do this automatically <b>and</b> portably. The only portable
|
||
way that a state machine distributed over several translation units could
|
||
employ table-based double dispatch relies on the user. The programmer(s)
|
||
would somehow have to <b>manually</b> tie together the various pieces of the
|
||
state machine. Not only does this scale badly but is also quite
|
||
error-prone</li>
|
||
</ul>
|
||
<h2><a name="Memory management customization">Memory management customization</a></h2>
|
||
<p>Out of the box, everything (event objects, state objects, internal data, etc.) is allocated
|
||
through <code>std::allocator< void ></code> (the default for the Allocator
|
||
template parameter). This
|
||
should be satisfactory for applications meeting the following prerequisites:</p>
|
||
<ul>
|
||
<li>There are no deterministic reaction time (hard real-time) requirements</li>
|
||
<li>The application will never run long enough for heap fragmentation to
|
||
become a problem. This is of course an issue for all long running programs
|
||
not only the ones employing this library. However, it should be noted that
|
||
fragmentation problems could show up earlier than with traditional FSM
|
||
frameworks</li>
|
||
</ul>
|
||
<p>Should an application not meet these prerequisites, Boost.Statechart's memory
|
||
management can be customized as follows:</p>
|
||
<ul>
|
||
<li>By passing a model of the standard Allocator concept to the class
|
||
templates that support a corresponding parameter (<code>event<></code>,
|
||
<code>state_machine<></code>,
|
||
<code>asynchronous_state_machine<></code>, <code>fifo_scheduler<></code>,
|
||
<code>fifo_worker<></code>). This redirects all allocations to the passed
|
||
custom allocator and should satisfy the needs of just about any project</li>
|
||
<li>Additionally, it is possible to <b>separately</b> customize <b>state</b> memory
|
||
management by overloading <code>operator
|
||
new()</code> and <code>operator delete()</code> for all state classes but
|
||
this is probably only useful under rare circumstances</li>
|
||
</ul>
|
||
<h2><a name="RTTI customization">RTTI customization</a></h2>
|
||
<p>RTTI is used for event dispatch and <code>state_downcast<>()</code>.
|
||
Currently, there are exactly two options:</p>
|
||
<ol>
|
||
<li>By default, a speed-optimized internal implementation is employed</li>
|
||
<li>The library can be instructed to use native C++ RTTI instead by defining
|
||
<code><a href="configuration.html#Application Defined Macros">
|
||
BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li>
|
||
</ol>
|
||
<p>There are 2 reasons to favor 2:</p>
|
||
<ul>
|
||
<li>When a state machine (or parts of it) are compiled into a DLL, problems
|
||
could arise from the use of the internal RTTI mechanism (see the FAQ item "<a href="faq.html#How_can_I_compile_a_state_machine_into_a_dynamic_link_library_(DLL)">How
|
||
can I compile a state machine into a dynamic link library (DLL)?</a>"). Using
|
||
option 2 is one way to work around such problems (on some platforms, it seems
|
||
to be the only way)</li>
|
||
<li>State and event
|
||
objects need to store one pointer less, meaning that in the best case the
|
||
memory footprint of a state machine object could shrink by 15% (an empty event
|
||
is typically 30% smaller, what can be an advantage when there are bursts of
|
||
events rather than a steady flow). However, on most platforms executable size
|
||
grows when C++ RTTI is turned on. So, given the small per machine object
|
||
savings, this only makes sense in applications where both of the following
|
||
conditions hold:<ul>
|
||
<li>Event dispatch will never become a
|
||
bottleneck</li>
|
||
<li>There is a need to reduce the memory allocated at runtime (at the cost
|
||
of a larger executable)</li>
|
||
</ul>
|
||
<p>Obvious candidates are embedded systems where the executable resides in
|
||
ROM. Other candidates are applications running a large number of identical
|
||
state machines where this measure could even reduce the <b>overall</b> memory
|
||
footprint</li>
|
||
</ul>
|
||
<h2><a name="Resource usage">Resource usage</a></h2>
|
||
<h3>Memory</h3>
|
||
<p>On a 32-bit box, one empty active state typically needs less than 50 bytes
|
||
of memory. Even <b>very</b> complex machines will usually have less than 20
|
||
simultaneously active states so just about every machine should run with less
|
||
than one kilobyte of memory (not counting event queues). Obviously, the
|
||
per-machine memory footprint is offset by whatever state-local members the
|
||
user adds.</p>
|
||
<h3>Processor cycles</h3>
|
||
<p>The following ranking should give a rough picture of what feature will
|
||
consume how many cycles:</p>
|
||
<ol>
|
||
<li><code>state_cast<>()</code>: By far the most cycle-consuming feature.
|
||
Searches linearly for a suitable state, using one <code>dynamic_cast</code>
|
||
per visited state</li>
|
||
<li>State entry and exit: Profiling of the fully optimized 1-bit-BitMachine
|
||
suggested that roughly 3 quarters of the total event processing time is spent destructing the
|
||
exited state and constructing the entered state. Obviously, transitions
|
||
where the <a href="definitions.html#Innermost common context">innermost
|
||
common context</a> is "far" from the leaf states and/or with lots of
|
||
orthogonal states can easily cause the destruction and construction of quite
|
||
a few states leading to significant amounts of time spent for a transition</li>
|
||
<li><code>state_downcast<>()</code>: Searches linearly for the requested
|
||
state, using one virtual call and one RTTI comparison per visited state</li>
|
||
<li>Deep history: For all innermost states inside a state passing either
|
||
<code>has_deep_history</code> or <code>has_full_history</code> to its
|
||
state base class, a binary search
|
||
through the (usually small) history map must be performed on each exit.
|
||
History slot allocation is performed exactly once, at first exit</li>
|
||
<li>Shallow history: For all direct inner states of a state passing either
|
||
<code>has_shallow_history</code> or <code>has_full_history</code> to its
|
||
state base class, a binary search
|
||
through the (usually small) history map must be performed on each exit.
|
||
History slot allocation is performed exactly once, at first exit</li>
|
||
<li>Event dispatch: One virtual call followed by a linear search for a
|
||
suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI
|
||
comparison per visited reaction</li>
|
||
<li>Orthogonal states: One additional virtual call for each exited state <b>
|
||
if</b> there is more than one active leaf state before a transition. It should
|
||
also be noted that the worst-case event dispatch time is multiplied in the
|
||
presence of orthogonal states. For example, if two orthogonal leaf states
|
||
are added to a given state configuration, the worst-case time is tripled</li>
|
||
</ol>
|
||
<hr>
|
||
<p>Revised
|
||
<!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->17 December, 2005<!--webbot bot="Timestamp" endspan i-checksum="38520" --></p>
|
||
<p><i><EFBFBD> Copyright <a href="mailto:ahd6974-spamgroupstrap@yahoo.com">Andreas Huber D<>nni</a>
|
||
2003-2005. <font color="#FF0000"><b>The link refers to a
|
||
<a href="http://en.wikipedia.org/wiki/Honeypot">spam honeypot</a>. Please remove the words spam and trap
|
||
to obtain my real address.</b></font></i></p>
|
||
<p><i>Distributed under the Boost Software License, Version 1.0. (See
|
||
accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or
|
||
copy at <a href="http://www.boost.org/LICENSE_1_0.txt">
|
||
http://www.boost.org/LICENSE_1_0.txt</a>)</i></p>
|
||
|
||
</body>
|
||
|
||
</html>
|