mirror of
https://github.com/boostorg/container.git
synced 2026-01-19 04:02:17 +00:00
1506 lines
72 KiB
HTML
1506 lines
72 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
|
|
<style>
|
|
p {text-align:justify}
|
|
li {text-align:justify}
|
|
blockquote.note
|
|
{
|
|
background-color:#E0E0E0;
|
|
padding-left: 15px;
|
|
padding-right: 15px;
|
|
padding-top: 1px;
|
|
padding-bottom: 1px;
|
|
}
|
|
ins {background-color:#FFFFA0}
|
|
del {background-color:#FFFFA0}
|
|
|
|
cpp_comment {color:#007F33}
|
|
|
|
pre {color:#000066}
|
|
code{color:#000066}
|
|
|
|
</style>
|
|
|
|
<head>
|
|
<title>Applying classic memory allocation strategies to C++ containers (v0.1)</title>
|
|
</head>
|
|
|
|
<h1>Applying classic memory allocation strategies to C++ containers (v0.1)</h1>
|
|
|
|
<body>
|
|
<p>
|
|
Ion Gaztañaga <<a href="mailto:igaztanaga@gmail.com">igaztanaga@gmail.com</a>><br>
|
|
</p>
|
|
|
|
<h1>Table of Contents</h1>
|
|
|
|
<ul>
|
|
<li><a href="#Introduction">Introduction</a></li>
|
|
<li><a href="#Chapter1">Chapter 1: In-place expansion for <code>vector</code></a></li>
|
|
<ul>
|
|
<li><a href="#WhatWeTried">What we tried: Existing proposals for in-place expansion</a></li>
|
|
<li><a href="#NewFunctionsForExpansion">New functions for expansion capabilities</a></li>
|
|
<li><a href="#AConfigurableAllocator">A configurable allocator for testing</a></li>
|
|
<li><a href="#TestingForward">Testing forward expansion in <code>vector</code></a></li>
|
|
<li><a href="#TestingBackwards">Testing backwards expansion in <code>vector</code></a></li>
|
|
<li><a href="#TestingShrinkToFit">Testing shrink to fit in <code>vector</code></a></li>
|
|
</ul>
|
|
<li><a href="#Chapter2">Chapter 2: Burst allocation for node containers</a></li>
|
|
<ul>
|
|
<li><a href="#BurstAllocationIntro">Burst allocation introduction</a></li>
|
|
<li><a href="#BurstAllocationInfrastructure">Burst allocation infrastructure</a></li>
|
|
<li><a href="#BurstInterfaceForAllocators">Burst allocation interface for allocators</a></li>
|
|
<li><a href="#TestingBurstAllocation">Testing burst allocation</a></li>
|
|
</ul>
|
|
<li><a href="#Chapter3">Chapter 3: Adaptive pools for node containers</a></li>
|
|
<ul>
|
|
<li><a href="#SimpleSegregatedStorage">Simple segregated storage: advantages and problems</a></li>
|
|
<li><a href="#AdaptivePoolsExplained">Adaptive pools explained</a></li>
|
|
<li><a href="#PresentingAdaptivePool">Presenting <code>adaptive_pool</code></a></li>
|
|
<li><a href="#TestingAdaptivePool">Testing <code>adaptive_pool</code></a></li>
|
|
</ul>
|
|
</ul>
|
|
|
|
<h1><a name="Introduction"></a>Introduction</h1>
|
|
<p>
|
|
I'm sure that many C++ programmers have ever wondered where does good old realloc
|
|
fit in C++. And that's a good question. Could we improve <code>std::vector</code> performance
|
|
using memory expansion mechanisms to avoid too many copies? But <code>std::vector</code>
|
|
is not the only container that could benefit from some classic strategies memory allocation: What if we
|
|
could take advantage of the insertion of multiple elements in <code>std::list</code> using
|
|
a burst allocation mechanism that could amortize costs (mutex locks, searches...) that
|
|
can't be amortized when using single node allocation strategies? Some space-saving
|
|
strategies also come to mind with node containers: Can we find a mechanism to build a
|
|
memory allocator that offers the size-saving benefits of a simple segregated storage
|
|
strategy (see <a href="http://www.boost.org/libs/pool">Boost.Pool</a>) without suffering
|
|
too big node pools when most of nodes have been deallocated?
|
|
</p>
|
|
<p>
|
|
This article will explain how some classic improvement strategies have been applied to
|
|
C++ allocators. Many of these mechanism are being used for shared-memory allocators and
|
|
containers (see <a href="http://www.boost.org/libs/interprocess">Boost.Interprocess</a>)
|
|
where size restrictions are tighter than with heap allocators and other strategies (e.g.
|
|
per-thread-pthread pools) are not applicable. So we'll be using <a href="http://www.boost.org/libs/interprocess">
|
|
Boost.Interprocess</a> containers to measure the speed up that we can obtain when using
|
|
those classic strategies in C++.
|
|
</p>
|
|
<p>
|
|
Obviously, these C++ allocators need some underlying general purpose allocator, since
|
|
new and delete don't offer expansion and burst capabilities. In this article we'll
|
|
use a modified <a href="http://g.oswego.edu/dl/html/malloc.html">Doug Lea Malloc, aka
|
|
DLmalloc</a> memory allocator with new basic functions that will offer the necessary
|
|
underlying machinery to implement all these classic strategies. <b>DLmalloc</b> is known to be
|
|
very size and speed efficient, and this allocator is used as the basis of many malloc
|
|
implementations, including multithreaded allocators built above <b>DLmalloc</b>
|
|
(See <a href="http://directory.fsf.org/project/ptmalloc2/">ptmalloc2</a>,
|
|
<a href="http://www.malloc.de/en/">ptmalloc3</a> or
|
|
<a href="http://www.nedprod.com/programs/portable/nedmalloc/index.html">nedmalloc</a>
|
|
). So
|
|
improvements obtained by the classic strategies applied to C++ will measure real-world
|
|
speedups, even when the allocator is very efficient.
|
|
</p>
|
|
<p>
|
|
I've developed some additional classes, like STL allocators, pools and tests, I've put
|
|
all this code under Boost Software license, and I've respected Boost guidelines so that
|
|
the developed code can be considered integrated in the Boost Container.</a>.
|
|
</p>
|
|
|
|
<h1><a name="Chapter1"></a>Chapter 1: In-place expansion for <code>vector</code></h1>
|
|
|
|
<h2><a name="WhatWeTried"></a>What we tried: Existing proposals for in-place expansion</h2>
|
|
|
|
<p>
|
|
There've been a couple of C++ proposal to add in-place expansion mechanisms to standard
|
|
allocators (<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1953.html">N1953:
|
|
Upgrading the Interface of Allocators using API Versioning</a> written by Howard Hinnant
|
|
and <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html">N2045:
|
|
Improving STL allocators</a> written by me based on Hinnant's proposal). Unfortunately
|
|
they didn't receive much interest in the committee but there is hope that we could get
|
|
some of this functionality for C++1x if the C++ community shows some interest. Let's
|
|
review a bit what these proposals said:
|
|
</p>
|
|
<p>Hinnant's proposal adds some new functionalities to C++ allocators:
|
|
<ol>
|
|
<li>
|
|
The ability to know the size of a memory buffer allocated with the allocator.
|
|
<li>
|
|
The ability to expand a previously obtained memory buffer (similar to realloc but with no
|
|
memory copy)
|
|
<li>
|
|
The ability to shrink a previously obtained memory buffer (similar to realloc but with no
|
|
memory copy)
|
|
<li>
|
|
Three size concepts when allocating memory: <b>preferred size</b>, <b>limit size</b> and <b>received size</b>.</li>
|
|
</ol>
|
|
<P></P>
|
|
<p>
|
|
The first functionality is interesting to avoid wasting some precious bytes the memory
|
|
allocator adds in the end of the buffer due to alignment reasons or because its own
|
|
memory allocation algorithm. As an example, <b>DLmalloc</b> on 32 bit systems will need a
|
|
minimum of 16 bytes for every allocation (4 bytes are payload and 12 bytes are available
|
|
for the user): If we try to initialize <code>std::string</code> with a 4 character
|
|
narrow string (<code>"abcd"</code>) we will waste 8 bytes, because we have no way to know
|
|
that there is room in our newly allocated buffer that could be used for longer strings
|
|
without going again through the allocator. Other allocators waste even
|
|
more memory. In 64 bit systems the minimum size (and thus, the waste) is bigger:
|
|
32 bytes (24 bytes available for the user) for <b>DLmalloc</b>.
|
|
</p>
|
|
<p>
|
|
The second functionality is the classical use case of realloc in C: instead of
|
|
allocating a new buffer, the programmer tries to expand the current one. If the
|
|
expansion succeeds, there is no need to copy data to the new buffer (in C++, that means
|
|
avoiding potentially costly copy-constructors). The third one is exactly the inverse: a
|
|
vector that has grown considerably has now a few elements, <code>vector.capacity()</code> is
|
|
much bigger than <code>vector.size()</code> and we'd like to return memory
|
|
to the system. Many programmers use the <a href="http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Shrink-to-fit">
|
|
shrink-to-fit trick</a> to return memory to the system, but this trick needs
|
|
a possibly throwing copy, whereas a shrink to fit operation implemented
|
|
by the allocator would not need any copying at all.
|
|
</p>
|
|
<p>
|
|
The last functionality is really interesting because a vector increases its capacity
|
|
usually using a growth factor (usually 1.5f or 2.0f) to avoid too many reallocations. That is, when we insert
|
|
new objects in a vector and <code>vector.capacity() == vector.size()</code>, then the
|
|
vector allocates much more memory than the size needed to insert a new object. When
|
|
allocating new memory to insert a new object, <b>limit size</b> is <code>size() + 1</code>,
|
|
<b>preferred size</b> is <code>size()*growth_factor</code> and <b>received size</b> is the real
|
|
size of the buffer. When inserting a new element in a vector, is usually better to
|
|
expand the memory to hold the new element than to allocate a new bigger buffer.
|
|
</p>
|
|
<p>
|
|
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html">N2045</a>
|
|
makes some additions:
|
|
<ol>
|
|
<li>
|
|
The ability to expand the current buffer not only forward, but also backwards (forward
|
|
expansion, backwards expansion or both).
|
|
<li>
|
|
Offers the possibility to allocate or expand the buffer in a single call to avoid excessive
|
|
locking.
|
|
<li>
|
|
Separates array allocation and node allocation, similarly to <code>operator new</code> and <code>operator
|
|
new[]</code>.
|
|
<li>
|
|
Centralizes all the expansion, shrinking and allocation functionality in a single function.
|
|
<li>
|
|
Pushes <b>limit size</b> and <b>preferred size</b> concepts also to new allocations and shrink to
|
|
fit operations.</li>
|
|
</ol>
|
|
<P></P>
|
|
<p>
|
|
The first addition is useful for embedded systems and other size-limited memories, like
|
|
fixed-size shared memory. If a vector can only be expanded forward, then it can't reuse
|
|
preceding free memory if that preceding memory is not big enough (that is, 1.5 or 2.0 times bigger than
|
|
the current buffer). With backwards expansion, this limitation disappears and
|
|
backwards expansion can be combined with forward expansion to optimize memory usage.
|
|
</p>
|
|
<p>
|
|
The vector has to copy elements backwards so backwards expansion needs exactly the
|
|
same number of copies as a new allocation. Many times backwards expansion can be implemented
|
|
very efficiently in some memory algorithms that have fast access to previous and next memory buffers.
|
|
Backwards expansion also has some downsides:
|
|
<ol>
|
|
<li>
|
|
The code handling data copy for a backwards expanded buffer is not trivial (the new buffer
|
|
has both already constructed objects and raw memory)
|
|
<li>
|
|
In order to maintain old objects aligned (and thus reusable for being move assigned) the
|
|
underlying C function implementing backwards expansion must guarantee
|
|
that the distance between the new beginning of the backwards expanded buffer and the old
|
|
beginning must be multiple of the size of the objects already constructed in the buffer (and of course,
|
|
multiple of the alignment required by the memory allocation algorithm). A buffer
|
|
holding a type with <code>sizeof(Type)</code> == 12 can't be expanded backwards 16 bytes, because the
|
|
distance between the old beginning and the new beginning is not multiple of 12, and thus, the
|
|
old object will be partially overwritten and it can't be safely move-assigned.</li>
|
|
</ol>
|
|
<P></P>
|
|
<p>
|
|
The second addition tries to compact in a single call several tries to expand or
|
|
allocate memory because usually a vector will first try to expand memory and if this fails,
|
|
it will try to allocate a new buffer. Since checking for expansion
|
|
capabilities is quite fast comparing to mutex locking and other checks performed by the
|
|
allocator, both functions can be combined in a single call.
|
|
</p>
|
|
<p>
|
|
The third addition (separating node and array allocation) is based on the fact that with
|
|
the addition of expansion functions, node allocators based classic segregated storage
|
|
pools won't be able to be allocators with expansion capabilities. This might not sound
|
|
like a big problem for pure-node containers like <code>std::list</code> but some hybrid
|
|
containers like <code>std::unordered_map</code> wouldn't be able to use both segregated
|
|
storage (for nodes) and expansion features (for the bucket array).
|
|
</p>
|
|
<p>
|
|
Furthermore, some allocators (see <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2554.pdf">
|
|
N2554 The Scoped Allocator Model (Rev 2)</a>) can be propagated through containers, so
|
|
having an allocator that can use in-place expansion capabilities and node-pooling
|
|
capabilities is very interesting. And this can only be achieved if the allocator uses
|
|
different functions for single element and array operations.
|
|
</p>
|
|
<p>
|
|
The last two
|
|
additions are just optimizations to reduce locking and to minimize the number of allocation
|
|
calls.
|
|
</p>
|
|
<p>
|
|
Hinnant's proposal was implemented in the Freescale (then Metrowerks) standard library
|
|
(MSL) with great success. My approach was implemented in <a href="http://www.boost.org/libs/interprocess">
|
|
Boost.Interprocess</a>. This article is an evolution of my last proposal but applying
|
|
these techniques in a widely used heap allocator.
|
|
</p>
|
|
|
|
<h2><a name="NewFunctionsForExpansion">New functions for expansion capabilities</h2>
|
|
|
|
As mentioned, <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2045.html">
|
|
N2045</a> proposed an advanced new function for allocators that offers the
|
|
possibility to request buffer expansion (combined or not with allocation if the
|
|
expansion fails), and shrinking called <code>allocation_command</code>. This is the
|
|
explanation of the function:
|
|
<pre>
|
|
enum allocation_type
|
|
{
|
|
<cpp_comment>//Bitwise OR (|) combinable values</cpp_comment>
|
|
allocate_new = ...,
|
|
expand_fwd = ...,
|
|
expand_bwd = ...,
|
|
shrink_in_place = ...,
|
|
try_shrink_in_place = ...,
|
|
nothrow_allocation = ...
|
|
};
|
|
|
|
|
|
template<class T>
|
|
std::pair<T *, bool> allocation_command
|
|
( allocation_type command, std::size_t limit_size
|
|
, std::size_t preferred_size, std::size_t &received_size, T *reuse_ptr = 0);
|
|
|
|
</pre>
|
|
<p>
|
|
<b>Preconditions for this function</b>:
|
|
</p>
|
|
<p>
|
|
<ul>
|
|
<li>
|
|
The parameter <code>command</code> must contain at least of these values: <code>shrink_in_place</code>,
|
|
<code>try_shrink_in_place</code>, <code>try_shrink_in_place</code>, <code>expand_fwd</code>, <code>expand_bwd</code> or <code>allocate_new</code>.
|
|
</li>
|
|
<li>
|
|
If the parameter <code>command</code> contains the value <code>shrink_in_place</code> it can't
|
|
contain any of these values: <code>expand_fwd</code>, <code>expand_bwd</code>,
|
|
<code>allocate_new</code>, <code>try_shrink_in_place</code>.
|
|
</li>
|
|
<li>
|
|
If the parameter <code>command</code> contains the value <code>try_shrink_in_place</code> it can't
|
|
contain any of these values: <code>expand_fwd</code>, <code>expand_bwd</code>,
|
|
<code>allocate_new</code>, <code>shrink_in_place</code>.
|
|
</li>
|
|
<li>
|
|
If the parameter <code>command</code> contains <code>expand_fwd</code> or <code>expand_bwd</code>, the
|
|
parameter <code>reuse_ptr</code>
|
|
must be non-null and returned by a previous allocation function.
|
|
<li>
|
|
If the parameter <code>command</code> contains the value <code>shrink_in_place</code>,
|
|
the parameter <code>limit_size</code>
|
|
must be equal or greater than the parameter <code>preferred_size</code>.
|
|
<li>
|
|
If the parameter <code>command</code> contains any of these values: <code>expand_fwd</code> or <code>
|
|
expand_bwd</code>, the parameter <code>limit_size</code> must be equal or less than the
|
|
parameter <code>preferred_size</code>.</li>
|
|
</ul>
|
|
<P></P>
|
|
<p>
|
|
<b>Which are the effects of this function</b>:
|
|
</p>
|
|
<p>
|
|
<ul>
|
|
<li>
|
|
If the parameter <code>command</code> contains the value <code>shrink_in_place</code>,
|
|
(with optional additional <code>nothrow_allocation</code>) the function will
|
|
try to reduce the size of the memory block referenced by pointer <code>reuse_ptr</code> to the
|
|
value <code>preferred_size</code> moving only the end of the block. If it's not possible, it
|
|
will try to reduce the size of the memory block as much as possible as long as this results in <code>
|
|
size(p) <= limit_size</code>. Success is reported only if this results in <code>preferred_size
|
|
<= size(p)</code> and <code>size(p) <= limit_size</code>.
|
|
</li>
|
|
<li>
|
|
If the parameter <code>command</code> contains the value <code>try_shrink_in_place</code>,
|
|
(with optional additional <code>nothrow_allocation</code>) the function will act as if
|
|
a <code>shrink_in_place</code> was demaned but it won't shrink the buffer. This function
|
|
is useful to know if a shrink operation will have success with the given parameters and
|
|
obtain in the parameter <code>received_size</code> a value that is guaranteed to succeed
|
|
as <code>limit_size</code> on a subsequent shrink in place operation. This function is
|
|
useful to know exactly how many objects a caller should destroy with the given
|
|
<code>limit_size</code> and <code>preferred_size</code>.
|
|
</li>
|
|
<li>
|
|
If the parameter <code>command</code> only contains the value <code>expand_fwd</code> (with
|
|
optional additional <code>nothrow_allocation</code>), the allocator will try to increase the
|
|
size of the memory block referenced by pointer reuse moving only the end of the block to the
|
|
value <code>preferred_size</code>. If it's not possible, it will try to increase the size of
|
|
the memory block as much as possible as long as this results in <code>size(p) >= limit_size</code>.
|
|
Success is reported only if this results in <code>limit_size <= size(p)</code>.
|
|
<li>
|
|
If the parameter <code>command</code> only contains the value <code>expand_bwd</code> (with
|
|
optional additional <code>nothrow_allocation</code>), the allocator will try to increase the
|
|
size of the memory block referenced by pointer <code>reuse_ptr</code> only moving the start of
|
|
the block to a returned new position <code>new_ptr</code>. If it's not possible, it will try
|
|
to move the start of the block as much as possible as long as this results in <code>size(new_ptr)
|
|
>= limit_size</code>. Success is reported only if this results in <code>limit_size <=
|
|
size(new_ptr)</code>.
|
|
<li>
|
|
If the parameter <code>command</code> only contains the value <code>allocate_new</code> (with
|
|
optional additional <code>nothrow_allocation</code>), the allocator will try to allocate
|
|
memory for <code>preferred_size</code> objects. If it's not possible it will try to allocate
|
|
memory for at least <code>limit_size</code>
|
|
objects.
|
|
<li>
|
|
If the parameter <code>command</code> only contains a combination of <code>expand_fwd</code> and
|
|
<code>allocate_new</code>, (with optional additional <code>nothrow_allocation</code>) the
|
|
allocator will try first the forward expansion. If this fails, it would try a new
|
|
allocation.
|
|
<li>
|
|
If the parameter <code>command</code> only contains a combination of <code>expand_bwd</code> and
|
|
<code>allocate_new</code> (with optional additional <code>nothrow_allocation</code>), the
|
|
allocator will try first to obtain <code>preferred_size</code> objects using both methods if
|
|
necessary. If this fails, it will try to obtain <code>limit_size</code>
|
|
objects using both methods if necessary.
|
|
<li>
|
|
If the parameter <code>command</code> only contains a combination of <code>expand_fwd</code> and
|
|
<code>expand_bwd</code> (with optional additional <code>nothrow_allocation</code>), the
|
|
allocator will try first forward expansion. If this fails it will try to obtain preferred_size
|
|
objects using backwards expansion or a combination of forward and backwards expansion. If this
|
|
fails, it will try to obtain <code>limit_size</code>
|
|
objects using both methods if necessary.
|
|
<li>
|
|
If the parameter <code>command</code> only contains a combination of allocation_new, <code>expand_fwd</code>
|
|
and <code>expand_bwd</code>, (with optional additional <code>nothrow_allocation</code>) the
|
|
allocator will try first forward expansion. If this fails it will try to obtain preferred_size
|
|
objects using new allocation, backwards expansion or a combination of forward and backwards
|
|
expansion. If this fails, it will try to obtain <code>limit_size</code>
|
|
objects using the same methods.
|
|
<li>
|
|
The allocator always writes the size or the expanded/allocated/shrunk memory block in <code>received_size</code>.
|
|
On failure the allocator writes in <code>received_size</code> a possibly successful <code>limit_size</code>
|
|
parameter for a new call.</li>
|
|
</ul>
|
|
<P></P>
|
|
<p>
|
|
<b>Throws an exception if two conditions are met</b>:
|
|
<ul>
|
|
<li>
|
|
The allocator is unable to allocate/expand/shrink the memory or there is an error in
|
|
preconditions
|
|
<li>
|
|
The parameter command does not contain <code>nothrow_allocation</code>.</li>
|
|
</ul>
|
|
<P></P>
|
|
<p>
|
|
<b>This function returns</b>:
|
|
<ul>
|
|
<li>
|
|
The address of the allocated memory or the new address of the expanded memory as the first
|
|
member of the pair. If the parameter command contains <code>nothrow_allocation</code>
|
|
the first member will be 0 if the allocation/expansion fails or there is an error in
|
|
preconditions.
|
|
<li>
|
|
The second member of the pair will be false if the memory has been allocated, true if the
|
|
memory has been expanded. If the first member is 0, the second member has an undefined value.</li>
|
|
</ul>
|
|
<P></P>
|
|
<p>
|
|
<b>Notes</b>:
|
|
<ul>
|
|
<li>
|
|
If the user chooses <code>char</code>
|
|
as template argument the returned buffer will be suitably aligned to hold any type.
|
|
<li>
|
|
If the user chooses <code>char</code> as template argument and a backwards expansion is
|
|
performed, although properly aligned, the returned buffer might not be suitable because the
|
|
distance between the new beginning and the old beginning might not multiple of the type the
|
|
user wants to construct, since due to internal restrictions the expansion can be slightly
|
|
bigger than the requested bytes. <b>When performing backwards expansion, if you have already
|
|
constructed objects in the old buffer, it's important to correctly specify the type</b>.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>
|
|
Obviously, we need an underlying C function that offers these possibilities.
|
|
I've added the following function to the modified dlmalloc library:
|
|
|
|
<pre>
|
|
|
|
typedef struct boost_cont_command_ret_impl
|
|
{
|
|
void *first;
|
|
int second;
|
|
}boost_cont_command_ret_t;
|
|
|
|
boost_cont_command_ret_t boost_cont_allocation_command
|
|
( allocation_type command , size_t sizeof_object
|
|
, size_t limit_objects , size_t preferred_objects
|
|
, size_t *received_objects , void *reuse_ptr );
|
|
</pre>
|
|
|
|
<p>
|
|
This function does the same job as <code>allocation_command</code>, so it takes the
|
|
same options and an additional parameter indicating the size of the object (specially
|
|
important for backwards expansion). This function does not throw exceptions because C
|
|
has no exceptions but other than these two issues, its use is exactly the same
|
|
as <code>allocation_command</code>.
|
|
|
|
</p>
|
|
</p>
|
|
|
|
<h2><a name="AConfigurableAllocator"></a>A configurable allocator for testing</h2>
|
|
|
|
<p>
|
|
To compare apples to apples, we just can't compare the standard allocator versus an
|
|
allocator based on <b>DLmalloc</b> with expansion capabilities. For this purpose we develop an
|
|
allocator (called, surprisingly, <code>ba::allocator</code>). Here's the declaration:
|
|
</p>
|
|
<pre>
|
|
|
|
namespace boost {
|
|
namespace container {
|
|
|
|
template<class T, unsigned Version = 2, unsigned int AllocationDisableMask = 0>
|
|
class allocator;
|
|
|
|
} <cpp_comment>//namespace container {</cpp_comment>
|
|
} <cpp_comment>//namespace boost {</cpp_comment>
|
|
|
|
</pre>
|
|
<p>
|
|
The first template parameter is the type of the objects to be allocated. The second is
|
|
the version of the allocator (see <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1953.html">
|
|
N1953</a>) for details. If Version is 1, then the allocator is a standard conforming
|
|
classic allocator around <code>dlmalloc</code> and <code>dlfree </code>with no
|
|
additional features. If Version is 2, then the allocator has expansion capabilities and
|
|
other features like specifying limit and preferred sizes and obtaining the real size of
|
|
an allocated memory, separated node and array allocations, etc...
|
|
</p>
|
|
<p>
|
|
The third parameter is a mask that can be used to disable some capabilities. For
|
|
example, specifying <code>expand_bwd | expand_fwd</code> will disable forward
|
|
and backwards expansion and the allocator will only provide new buffer allocations.
|
|
</p>
|
|
|
|
<h2><a name="TestingForward"></a>Testing forward expansion in <code>vector</code></h2>
|
|
|
|
<p>
|
|
Let's first present the class that will be used in our performance tests. It's a simple
|
|
wrapper over an <code>int</code>, with a copy constructor that realizes a job that will
|
|
be equivalent to most move-constructors in C++0x. Of course performance gains will be
|
|
bigger in copyable but non-moveable heavy objects, but I think this class is fair
|
|
enough.
|
|
</p>
|
|
<pre>
|
|
|
|
class MyInt
|
|
{
|
|
int int_;
|
|
|
|
public:
|
|
MyInt(int i = 0) : int_(i){}
|
|
|
|
MyInt(const MyInt &other)
|
|
: int_(other.int_)
|
|
{}
|
|
|
|
MyInt & operator=(const MyInt &other)
|
|
{
|
|
int_ = other.int_;
|
|
return *this;
|
|
}
|
|
};
|
|
|
|
</pre>
|
|
<p>
|
|
Now we'll push back an element one by one in a vector, and measure the time using
|
|
<a href="http://www.boost.org/libs/date_time">Boost.DateTime</a>:
|
|
</p>
|
|
<pre>
|
|
unsigned int numalloc = 0, numexpand = 0, capacity = 0;
|
|
|
|
ptime tini = microsec_clock::universal_time();
|
|
|
|
for(unsigned int r = 0; r != num_iterations; ++r){
|
|
bi::vector<MyInt, IntAllocator> v;
|
|
v.reset_alloc_stats(); <cpp_comment>//Reset allocation statistics</cpp_comment>
|
|
for(unsigned int e = 0; e != num_elements; ++e)
|
|
v.push_back(e);
|
|
numalloc += v.num_alloc; numexpand += v.num_expand_fwd;
|
|
capacity = static_cast<unsigned int>(v.capacity());
|
|
}
|
|
|
|
ptime tend = microsec_clock::universal_time();
|
|
<cpp_comment>//...</cpp_comment>
|
|
</pre>
|
|
<p>
|
|
We'll test 4 different allocators:
|
|
<ol>
|
|
<li><b>StdAllocator</b>: equivalent to <code>std::allocator<int></code>, the standard
|
|
allocator that comes with the C++ compiler</li>
|
|
<li><b>AllocatorPlusV1</b>: equivalent to <code>ba::allocator<int, 1></code>, a STL
|
|
conformant allocator around <code>dlmalloc</code> and <code>dlfree</code>.</li>
|
|
<li><b>AllocatorPlusV2Mask</b>: equivalent to <code>ba::allocator<int, 2, expand_bwd |
|
|
expand_fwd></code>, an allocator with capabilities to specify a limit size the
|
|
preferred size and real size of a buffer, but with expansion capabilities disabled.</li>
|
|
<li><b>AllocatorPlusV2</b>: equivalent to <code>ba::allocator<int, 2></code>, a fully
|
|
equipped improved allocator. (Note: There is no need to disable backwards expansions because
|
|
forward expansion has always preference over new allocations and backwards expansion).</li>
|
|
</ol>
|
|
</p>
|
|
<p>
|
|
and we'll use several <code>[num_iterations, num_elements]</code> combinations: <code>[10000,
|
|
10000], [100000, 1000], [1000000, 100] and [10000000, 10]</code>
|
|
</p>
|
|
<p>
|
|
For each allocator we'll show the following information:
|
|
<ul>
|
|
<li><code>push_back</code> speed (normalized to AllocatorPlusV1)</li>
|
|
<li>The number of calls to the allocator per iteration</li>
|
|
<li>The number of new allocations per iteration</li>
|
|
<li>The number of expansion per iteration</li>
|
|
<li>The final capacity of the vector</li>
|
|
</ul>
|
|
Now let's see the results of the test in a Windows XP machine (AMD 1.47
|
|
Ghz) using Visual C++ 2003:
|
|
</p>
|
|
<img src="fwd_expansion_msvc-7.1_1.jpg"></img>
|
|
<img src="fwd_expansion_msvc-7.1_2.jpg"></img>
|
|
<img src="fwd_expansion_msvc-7.1_3.jpg"></img>
|
|
<p>
|
|
As we see in the picture, <b>DLmalloc</b> is faster than the standard allocator
|
|
provided by MSVC 7.1 (StdAllocator vs. AllocatorPlus). We also see that
|
|
V2 allocators call less times to allocator functions (for size=10, 4 times vs. 6 times),
|
|
because using the <i>received_size</i> parameter of <code>allocation_command</code> the
|
|
container can take advantage of the real size of the newly allocated small buffer.
|
|
</p>
|
|
|
|
<p>
|
|
For big sizes (size=10000), due to the expansion capabilities of AllocatorPlusV2,
|
|
this is twice as fast as AllocatorPlusV2Mask and 2.5 times faster than the standard
|
|
conforming, <b>DLmalloc</b> based AllocatorPlusV1. As we can see in the third graph, only
|
|
AllocatorPlusV2 performs expand in place, whereas the rest only allocate new buffers.
|
|
</p>
|
|
|
|
<p>
|
|
Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler
|
|
and will check <code>push_back</code> times:
|
|
</p>
|
|
|
|
<img src="fwd_expansion_linux-gcc-4.1.jpg"></img>
|
|
|
|
<p>
|
|
We see that the standard allocator is faster than our V1 (a standard conforming <b>DLmalloc</b> wrapper)
|
|
by a small margin for some sizes and slightly slower for bigger sizes.
|
|
That's because gblic malloc is based on ptmalloc2, which is actually
|
|
based on <b>DLmalloc</b>, but uses per-thread-pools, so minimizes mutex locks.
|
|
</p>
|
|
|
|
<p>
|
|
Like in Windows, expand-in-place capable allocator is twice as fast as the V1 allocator,
|
|
confirming that forward expansion offers a considerable speed up. This speed up will be
|
|
noticeable in initialization phases when vectors are filled with many back insertions. The
|
|
speed-up will be more noticeable if vector elements are elements with no move constructors
|
|
(e.g. third-party classes than can't be updated with move semantics).
|
|
</p>
|
|
|
|
<h2><a name="TestingBackwards"></a>Testing backwards expansion in <code>vector</code></h2>
|
|
|
|
<p>
|
|
Historically backwards expansion has not been very used for allocation strategies, basically
|
|
because it does not offer the speed improvements offered by forward expansion for
|
|
usual vector implementations. Similarly to new allocation, we need to move old objects
|
|
(an implementation might avoid data-moving if it manages uninitialized data in the beginning of
|
|
the memory buffer, but that's not usual). Backwards expansion has the ability to reduce the peak
|
|
memory used by the vector (while allocating new memory, the vector has to maintain the old
|
|
buffer until copies data to the new one) effectively reusing free memory placed just before the
|
|
buffer to be expanded.
|
|
</p>
|
|
|
|
<p>
|
|
As we've said before, the backwards expansion algorithm has to calculate the LCM (least common multiple)
|
|
of the memory allocation alignment (8 bytes in <b>DLmalloc</b>) and the size of the object.
|
|
For very efficient algorithms like <b>DLmalloc</b>, I've found that calculating the LCM using the
|
|
traditional <a href="http://en.wikipedia.org/wiki/Euclidean_algorithm">Euclidean</a> algorithm was a big
|
|
overhead that made backwards expansion considerably slower than new
|
|
allocation. It's necessary to take advantage of the power of 2 memory alignment and implement
|
|
optimizations for object sizes also power of 2 or multiple of 2, 4, or 8 bytes, and use
|
|
optimized versions to calculate the LCM. Fortunately,
|
|
these optimizations made backwards expansion slightly faster than a new allocation, making
|
|
backwards expansion better than new allocation in every aspect.
|
|
</p>
|
|
|
|
<p>
|
|
To test backwards expansion we'll execute the same <code>push_back</code> test
|
|
using the following 4 allocators:
|
|
|
|
<ol>
|
|
<li>
|
|
<b>StdAllocator</b>: equivalent to <code>std::allocator<int></code>, the standard
|
|
allocator that comes with the C++ compiler
|
|
<li>
|
|
<b>AllocatorPlusV1</b>: equivalent to <code>ba::allocator<int, 1></code>, a STL
|
|
conformant allocator around <code>dlmalloc</code> and <code>dlfree</code>.
|
|
<li>
|
|
<b>AllocatorPlusV2Mask</b>: equivalent to <code>ba::allocator<int, 2, expand_bwd |
|
|
expand_fwd></code>, an allocator with capabilities to specify a limit size the
|
|
preferred size and real size of a buffer, but with expansion capabilities disabled.
|
|
<li>
|
|
<b>AllocatorPlusV2</b>: equivalent to <code>ba::allocator<int, 2, expand_fwd></code>,
|
|
a fully equipped improved allocator, that avoids forward expansion (forward expansion
|
|
has priority over backwards expansion, so we need to disable it to only measure backwards
|
|
expansion).</li>
|
|
</ol>
|
|
|
|
Now let's see the results of the test in a Windows XP machine (AMD 1.47
|
|
Ghz) using Visual C++ 2003:
|
|
</p>
|
|
|
|
<img src="bwd_expansion_msvc-7.1.jpg"></img>
|
|
|
|
<p>
|
|
And now in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler:
|
|
</p>
|
|
|
|
<img src="bwd_expansion_linux-gcc-4.1.jpg"></img>
|
|
|
|
<p>
|
|
We see that for most cases backwards expansion (AllocatorPlusV2) is slightly faster than new
|
|
allocation (AllocatorPlusV1) and new allocation with <code>received_size</code> capabilities
|
|
(AllocatorPlusV2Mask).
|
|
</p>
|
|
|
|
<h2><a name="TestingShrinkToFit"></a>Testing shrink to fit in <code>vector</code></h2>
|
|
|
|
<p>
|
|
The <a href="http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Shrink-to-fit">
|
|
Shrink-to-fit</a> is a widely used idiom that can be improved with the <code>shrink_in_place"</code>
|
|
option provided by <code>allocation_command</code> method. To test this function we'll use the following test:
|
|
</p>
|
|
|
|
<pre>
|
|
const std::size_t Step = 5;
|
|
|
|
ptime tini = microsec_clock::universal_time();
|
|
|
|
typedef boost::interprocess::detail::integral_constant
|
|
<unsigned, boost::interprocess::detail::version<Allocator>::value> alloc_version;
|
|
|
|
for(unsigned int r = 0; r != num_iterations; ++r){
|
|
<cpp_comment>//Create a vector with num_elements size</cpp_comment>
|
|
bi::vector<MyInt, IntAllocator> v(num_elements);
|
|
v.reset_alloc_stats();
|
|
<cpp_comment>//Empty the vector erasing the last Step elements and calling shrink_to_fit()</cpp_comment>
|
|
for(unsigned int e = num_elements; e != 0; e -= Step){
|
|
v.erase(v.end() - Step, v.end());
|
|
v.shrink_to_fit();
|
|
}
|
|
}
|
|
|
|
ptime tend = microsec_clock::universal_time();
|
|
</pre>
|
|
|
|
<p>
|
|
That is, we'll initialize the vector with an initial size and will erase the last Step elements (5 in this case),
|
|
and shrink to fit the capacity of the vector. We'll repeat this operation for different
|
|
<code>[num_iterations, num_elements]</code> combinations: <code>[100,
|
|
10000], [1000, 200] and [10000, 500]</code>
|
|
</p>
|
|
|
|
<p>
|
|
To test shrink to fit we'll execute the test using the following 3 allocators:
|
|
|
|
<ol>
|
|
<li>
|
|
<b>StdAllocator</b>: equivalent to <code>std::allocator<int></code>, the standard
|
|
allocator that comes with the C++ compiler
|
|
<li>
|
|
<b>AllocatorPlusV1</b>: equivalent to <code>ba::allocator<int, 1></code>, a STL
|
|
conformant allocator around <code>dlmalloc</code> and <code>dlfree</code>.
|
|
<li>
|
|
<b>AllocatorPlusV2</b>: equivalent to <code>ba::allocator<int, 2, expand_fwd></code>,
|
|
a fully equipped improved allocator with shrink to fit capacity.</li>
|
|
</ol>
|
|
|
|
Now let's see the results of the test in a Windows XP machine (AMD 1.47
|
|
Ghz) using Visual C++ 2003:
|
|
</p>
|
|
|
|
|
|
<img src="shrink_to_fit_msvc-7.1_1.jpg"></img>
|
|
<img src="shrink_to_fit_msvc-7.1_2.jpg"></img>
|
|
|
|
<p>
|
|
As the initial vector size grows, <code>AllocatorPlusV2</code> shines (up to 90 times faster!)
|
|
because a shrink to fit operation is a <b>no-throw</b>
|
|
operation that can have <b>amortized constant-time</b> complexity due to the fact that an allocator usually implements
|
|
shrink to fit with a deallocation: the allocator divides the old buffer in two independent buffers and
|
|
deallocates the second buffer. On the other hand, vectors with allocators with no shrink to fit functionality
|
|
need to allocate a new buffer, move elements to the new one (moving elements is a linear operation
|
|
with the size of the vector) and deallocate the old buffer.
|
|
</p>
|
|
<p>
|
|
Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler
|
|
and will check <code>push_back</code> times:
|
|
</p>
|
|
|
|
<img src="shrink_to_fit_linux-gcc-4.1_1.jpg"></img>
|
|
|
|
<p>
|
|
Result are similar to Windows results, with <code>AllocatorPlusV2</code> shinning more and more when initial vector
|
|
size grows.
|
|
</p>
|
|
|
|
<h1><a name="Chapter2"></a>Chapter 2: Burst allocation for node containers</h1>
|
|
|
|
<h2><a name="BurstAllocationIntro"></a>Burst allocation and deallocation introduction</h2>
|
|
|
|
<p>
|
|
Many times, operations can be performed faster if they are packed in groups, because some
|
|
costs are not related to the number of operations, but to the number of groups. An example
|
|
of this operations is the <code>template<class Iterator> iterator insert(Iterator begin, Iterator end)</code>
|
|
operation in vectors. If <code>Iterator</code> is a forward iterator, most implementations
|
|
calculate the distance between <code>begin</code> and <code>end</code> so that they know
|
|
the size of the buffer needed to hold all those elements. With this information, they only
|
|
perform one allocation to insert all the range.
|
|
</p>
|
|
|
|
<p>
|
|
Other containers can't take advantage of this possibility because they are <b>node</b>
|
|
containers, meaning that each element is obtained with a call to the allocator that
|
|
allocates a single element. Since each element has its own independent memory, these
|
|
containers offer stronger iterator validity guarantees and pointers and references
|
|
to inserted objects are not easily invalidated. On the other hand, inserting N elements in
|
|
those containers, require N calls to the allocator.
|
|
</p>
|
|
|
|
<p>
|
|
To solve this, the idea is to have a function that could allocate a lot of objects in a single
|
|
call, guaranteeing that <b>each object can be individually deallocatable</b>. Some memory management
|
|
algorithms offer this possibility, even with the possibility to specify a different buffer size
|
|
for each operation in a group. For example, <b>DLmalloc</b> offers the following two functions:
|
|
|
|
<pre>
|
|
void** independent_calloc(size_t n_elements, size_t element_size, void*chunks[]);
|
|
</pre>
|
|
|
|
<p>
|
|
<b>Explanation</b>: <code>independent_calloc</code> is similar to calloc, but instead of returning a
|
|
single cleared space, it returns an array of pointers to n_elements
|
|
independent elements that can hold contents of size elem_size, each
|
|
of which starts out cleared, and can be independently freed,
|
|
realloc'ed etc. The elements are guaranteed to be adjacently
|
|
allocated (this is not guaranteed to occur with multiple callocs or
|
|
mallocs), which may also improve cache locality in some
|
|
applications. [...]
|
|
</p>
|
|
|
|
<pre>
|
|
void** independent_comalloc(size_t n_elements, size_t sizes[], void* chunks[]);
|
|
</pre>
|
|
|
|
<p>
|
|
<b>Explanation</b>: <code>independent_comalloc</code> allocates, all at once, a set of n_elements
|
|
chunks with sizes indicated in the "sizes" array. It returns an array of pointers to these elements,
|
|
each of which can be independently freed, realloc'ed etc. The elements are guaranteed to be
|
|
adjacently allocated (this is not guaranteed to occur with multiple callocs or mallocs),
|
|
which may also improve cache locality in some applications. [...]
|
|
</p>
|
|
|
|
<p>
|
|
For these two functions each element must be individually freed when it is no longer
|
|
needed and objects are guaranteed to be contiguously allocated (to help cache locality).
|
|
Also, the "chunks" argument is optional (i.e., may be null, which is
|
|
probably the most typical usage). If it is null, the returned array
|
|
is itself dynamically allocated and should also be freed when it is
|
|
no longer needed. Otherwise, the chunks array must be of at least
|
|
n_elements in length. It is filled in with the pointers to the
|
|
chunks.
|
|
</p>
|
|
|
|
<p>
|
|
According to <b>DLmalloc</b> documentation <code>independent_calloc</code>
|
|
simplifies and speeds up implementations of many
|
|
kinds of pools. It may also be useful when constructing large data
|
|
structures that initially have a fixed number of fixed-sized nodes,
|
|
but the number is not known at compile time, and some of the nodes
|
|
may later need to be freed:
|
|
</p>
|
|
|
|
<pre>
|
|
struct Node { int item; struct Node* next; };
|
|
|
|
struct Node* build_list() {
|
|
struct Node** pool;
|
|
int n = read_number_of_nodes_needed();
|
|
if (n <= 0) return 0;
|
|
pool = (struct Node**)(independent_calloc(n, sizeof(struct Node), 0);
|
|
if (pool == 0) die();
|
|
<cpp_comment>// organize into a linked list...</cpp_comment>
|
|
struct Node* first = pool[0];
|
|
for (i = 0; i < n-1; ++i)
|
|
pool[i]->next = pool[i+1];
|
|
free(pool); <cpp_comment>// Can now free the array (or not, if it is needed later)</cpp_comment>
|
|
return first;
|
|
}
|
|
</pre>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
According to <b>DLmalloc</b> documentation <code>independent_comalloc</code>
|
|
can be used to speed up allocation in cases where several structs or
|
|
objects must always be allocated at the same time:
|
|
|
|
<pre>
|
|
struct Head { ... }
|
|
struct Foot { ... }
|
|
|
|
void send_message(char* msg) {
|
|
int msglen = strlen(msg);
|
|
size_t sizes[3] = { sizeof(struct Head), msglen, sizeof(struct Foot) };
|
|
void* chunks[3];
|
|
if (independent_comalloc(3, sizes, chunks) == 0)
|
|
die();
|
|
struct Head* head = (struct Head*)(chunks[0]);
|
|
char* body = (char*)(chunks[1]);
|
|
struct Foot* foot = (struct Foot*)(chunks[2]);
|
|
<cpp_comment>// ...</cpp_comment>
|
|
}
|
|
</pre>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
Unfortunately, these two functions have some downsides when working with some containers:
|
|
|
|
<ul>
|
|
<li>Addresses of allocated nodes are passed in an array that is externally
|
|
provided or allocated by the function (so it must be deallocated after pointers to those
|
|
chunks have been used). This is quite rigid (a <code>list</code> is forced to allocate an
|
|
array to hold pointers to these nodes) and imposes an unnecessary overhead to node containers.</li>
|
|
<li>The guarantee that the memory will be contiguous is not necessary (although it might help)
|
|
for node containers and can require the allocation of quite big memory blocks. It's desirable to allocate
|
|
several contiguous chunks (say, a memory page) but not to force that all elements are
|
|
contiguous on memory.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<h2><a name="BurstAllocationInfrastructure"></a>Burst allocation infrastructure</h2>
|
|
|
|
<p>
|
|
The first issue (the required external or internally allocated array) can be avoided if every memory node is linked
|
|
with an intrusive singly linked list and offering an iterator-like interface. The intrusive list is built in the first bytes
|
|
reserved for the user data because the minimum node size guarantees 12 bytes for 32 bits systems and 24 bytes for 64 bit systems.
|
|
Obviously, once the user has overwritten that memory, there is no way to iterate to the next node so the programmer must be careful. Let's
|
|
present these new functions:
|
|
|
|
<pre>
|
|
<cpp_comment>/* Iterator functions */</cpp_comment>
|
|
typedef /**/ multialloc_it_t; <cpp_comment>/* Iterator type */</cpp_comment>
|
|
BOOST_CONTAINER_INIT_END_IT(IT) <cpp_comment>/* Postcondition: BOOST_CONTAINER_IS_END_IT(IT) != 0 */</cpp_comment>
|
|
BOOST_CONTAINER_IS_END_IT(IT) <cpp_comment>/* Is an end iterator? */</cpp_comment>
|
|
BOOST_CONTAINER_NEXT_IT(IT) <cpp_comment>/* operator ++ emulation */</cpp_comment>
|
|
BOOST_CONTAINER_ADDR_IT(IT) <cpp_comment>/* operator* emulation, returns the address of the memory */</cpp_comment>
|
|
</pre>
|
|
|
|
These functions offer an emulation of a C++ input iterator.
|
|
|
|
<pre>
|
|
<cpp_comment>/* Multiallocation chain functions */</cpp_comment>
|
|
typedef /**/ multialloc_it_chain_t; <cpp_comment>/* Chain type */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_INIT(CHAIN) <cpp_comment>/* Postcondition: BOOST_CONTAINER_IT_CHAIN_IT(IT_CHAIN) is end it */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_PUSH_BACK(CHAIN, MEM) <cpp_comment>/* Push back the node MEM in the intrusive linked list chain */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_PUSH_FRONT(CHAIN, MEM) <cpp_comment>/* Push front the node MEM in the intrusive linked list chain */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_SPLICE_BACK(CHAIN, CHAIN2)<cpp_comment>/* Splice chain2 in the end of the first chain */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_IT(IT_CHAIN) <cpp_comment>/* Return an input iterator that traverse the chain */</cpp_comment>
|
|
BOOST_CONTAINER_IT_CHAIN_SIZE(IT_CHAIN) <cpp_comment>/* Returns the size of the chain */</cpp_comment>
|
|
</pre>
|
|
|
|
These functions offer the possibility of building an intrusive chain (list) of nodes previously allocated by <b>DLmalloc</b>
|
|
and obtain an iterator that can traverse them. These functions are used to build a range that can be deleted
|
|
in a single call using with <code>boost_cont_multidealloc()</code> (See below).
|
|
|
|
<pre>
|
|
<cpp_comment>/* Some defines for "contiguous_elements" parameters */</cpp_comment>
|
|
DL_MULTIALLOC_ALL_CONTIGUOUS <cpp_comment>/* Allocate all elements contiguous on memory */</cpp_comment>
|
|
DL_MULTIALLOC_DEFAULT_CONTIGUOUS <cpp_comment>/* The implementation chooses the maximum contiguous elements</cpp_comment>
|
|
<cpp_comment>
|
|
/* Burst allocation:
|
|
Allocates "n_elements nodes" of size "elem_size" with a maximum (if possible) of "contiguous_elements".
|
|
On success returns an input iterator. On failure returns an "end" iterator.*/</cpp_comment>
|
|
multialloc_it_t boost_cont_multialloc_nodes (size_t n_elements, size_t elem_size, size_t contiguous_elements);
|
|
<cpp_comment>
|
|
/* Burst allocation:
|
|
Allocates "n_arrays" of elements, whose size is "sizeof_element" and whose lengths are specified in the "sizes"
|
|
parameter. Contiguous memory won't be (if possible) bigger than "contiguous_elements"*sizeof_element.
|
|
On success returns an input iterator. On failure returns an "end" iterator.*/</cpp_comment>
|
|
multialloc_it_t boost_cont_multialloc_arrays (size_t n_arrays, const size_t *lengths, size_t sizeof_element, size_t contiguous_elements);
|
|
<cpp_comment>
|
|
/* Burst deallocation: deallocates all the range specified by the iterator */</cpp_comment>
|
|
void boost_cont_multidealloc(multialloc_it_t it);
|
|
</pre>
|
|
</p>
|
|
<p>
|
|
<code>boost_cont_multialloc_nodes</code> is a function that allocates multiple nodes of the same size in the same call returning
|
|
an iterator to the range. </code> is a function that allocates multiple arrays in the same call returning
|
|
an iterator to the range. The user can specify the (suggested) contiguous number of elements in both functions.
|
|
<code>boost_cont_multidealloc</code> deallocates a range of arrays or nodes that have been allocated
|
|
by <code>boost_cont_multialloc_nodes</code>, <code>boost_cont_multialloc_arrays</code> or allocated by other <b>DLmalloc</b> functions and
|
|
chained using <code>BOOST_CONTAINER_IT_CHAIN_XXX</code> functions.
|
|
</p>
|
|
|
|
<h2><a name="BurstInterfaceForAllocators"></a>Burst interface for allocators</h2>
|
|
|
|
<p>
|
|
Now that we have a low-level swiss knife we can define a C++ interface for burst allocation:
|
|
|
|
<pre>
|
|
template<class T>
|
|
class allocator
|
|
{
|
|
<cpp_comment>//...</cpp_comment>
|
|
typedef /**/ multiallocation_iterator; <cpp_comment>//wrapper over multialloc_it_t</cpp_comment>
|
|
typedef /**/ multiallocation_chain; <cpp_comment>//wrapper over multialloc_chain_t</cpp_comment>
|
|
<cpp_comment>
|
|
//------------------------------------------------------------------
|
|
// Functions for array allocation (expandable memory)
|
|
//------------------------------------------------------------------
|
|
|
|
//Allocates many elements of size elem_size in a contiguous block
|
|
//of memory. Elements must be individually deallocated with deallocate()</cpp_comment>
|
|
multiallocation_iterator allocate_many(size_type elem_size, size_type n_elements);
|
|
<cpp_comment>
|
|
//Allocates n_elements elements, each one of size elem_sizes[i]
|
|
//Elements must be individually deallocated with deallocate()</cpp_comment>
|
|
multiallocation_iterator allocate_many(const size_type *elem_sizes, size_type n_elements);
|
|
<cpp_comment>
|
|
//Deallocates the memory arrays pointed by the iterator range starting at it</cpp_comment>
|
|
void deallocate_many(multiallocation_iterator it);
|
|
<cpp_comment>
|
|
//------------------------------------------------------------------
|
|
//Functions for node (size == 1, non-expandable memory) allocation
|
|
//------------------------------------------------------------------
|
|
|
|
//Allocates many elements of size == 1.
|
|
//Elements must be individually deallocated with deallocate_one()</cpp_comment>
|
|
multiallocation_iterator allocate_individual(size_type num_elements);
|
|
<cpp_comment>
|
|
//Deallocates the memory nodes pointed by the iterator range starting at it</cpp_comment>
|
|
void deallocate_individual(multiallocation_iterator it);
|
|
};
|
|
</pre>
|
|
|
|
Burst allocation also follows the separation between node (size == 1) and array allocation functions
|
|
so that node allocation can use segregated storage mechanisms, whereas array allocation can use
|
|
another approach that can be used with expansion functions (<code>allocation_command</code>).
|
|
</p>
|
|
|
|
<p>
|
|
We have all that we need to implement burst allocation and deallocation for STL containers. Now, let's measure
|
|
if this leads to any speed improvement.
|
|
</p>
|
|
|
|
<h2><a name="TestingBurstAllocation"></a>Testing burst allocation</h2>
|
|
|
|
<p>
|
|
A good place to take advantage of burst allocation is range insertion. Other operations(range assignment,
|
|
copy construction) are also good places to implement burst allocation. So let's try this code with several
|
|
allocators to test insertion time:
|
|
</p>
|
|
|
|
<pre>
|
|
tini = microsec_clock::universal_time();
|
|
|
|
bi::list<MyInt, IntAllocator> l;
|
|
for(unsigned int r = 0; r != num_iterations; ++r){
|
|
l.insert(l.end(), num_elements, MyInt(r));
|
|
}
|
|
|
|
tend = microsec_clock::universal_time();
|
|
</pre>
|
|
|
|
|
|
<p>
|
|
That is, we'll try insertions of different range sizes (<code>num_elements</code>) and different
|
|
iterations (<code>num_iterations</code>). We'll end up with the same number of elements in the
|
|
list, so we can compare results easier. To measure burst deallocation we'll preprocess several
|
|
iterator ranges and measure the time to completely empty the list:
|
|
</p>
|
|
|
|
<pre>
|
|
<cpp_comment>//Now preprocess ranges to erase</cpp_comment>
|
|
std::vector <typename bi::list<MyInt, IntAllocator>::iterator> ranges_to_erase;
|
|
ranges_to_erase.push_back(l.begin());
|
|
|
|
for(unsigned int r = 0; r != num_iterations; ++r){
|
|
typename bi::list<MyInt, IntAllocator>::iterator next_pos(ranges_to_erase[r]);
|
|
std::size_t n = num_elements;
|
|
while(n--){ ++next_pos; }
|
|
ranges_to_erase.push_back(next_pos);
|
|
}
|
|
|
|
<cpp_comment>//Measure range erasure function</cpp_comment>
|
|
tini = microsec_clock::universal_time();
|
|
|
|
for(unsigned int r = 0; r != num_iterations; ++r){
|
|
l.erase(ranges_to_erase[r], ranges_to_erase[r+1]);
|
|
}
|
|
|
|
tend = microsec_clock::universal_time();
|
|
</pre>
|
|
|
|
<p>
|
|
As shown, a good place to take advantage of burst allocation is range erasure along with <code>clear()</code> and destructors.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
We'll test 3 different allocators:
|
|
<ol>
|
|
<li><b>StdAllocator</b>: equivalent to <code>std::allocator<int></code>, the standard
|
|
allocator that comes with the C++ compiler</li>
|
|
<li><b>AllocatorPlusV1</b>: equivalent to <code>ba::allocator<int, 1></code>, a STL
|
|
conformant allocator around <code>dlmalloc</code> and <code>dlfree</code> (no burst allocation).</li>
|
|
<li><b>AllocatorPlusV2</b>: equivalent to <code>ba::allocator<int, 2></code>, burst-allocation capable
|
|
allocator.</li>
|
|
</ol>
|
|
and we'll use these test values for <code>[num_iterations, num_elements]</code> combinations: <code>[200,
|
|
10000], [2000, 1000], [20000, 100] and [200000, 10]</code>
|
|
</p>
|
|
|
|
<p>
|
|
For each allocator we'll show the following information:
|
|
<ul>
|
|
<li>Insertion speed per element (normalized to AllocatorPlusV1 speed)</li>
|
|
<li>Erasure speed per element (normalized to AllocatorPlusV1 speed)</li>
|
|
</ul>
|
|
Now let's see the results of the test in a Windows XP machine (AMD 1.47
|
|
Ghz) using Visual C++ 2003:
|
|
</p>
|
|
<img src="burst_allocation_msvc-7.1_1.jpg"></img>
|
|
|
|
<p>
|
|
The graph shows that allocation speed per element is constant for AllocatorPlusV1 (surprisingly, not so constant for
|
|
Visual's standard allocator) and this is pretty logical since each element needs an allocation. AllocatorPlusV2 is
|
|
using <code>DL_MULTIALLOC_DEFAULT_CONTIGUOUS</code> which allocates contiguous nodes up to 4096 bytes. That's why
|
|
AllocatorPlusV2 performance stabilizes for big ranges reaching to a 3x speedup.
|
|
</p>
|
|
|
|
<img src="burst_allocation_msvc-7.1_2.jpg"></img>
|
|
|
|
<p>
|
|
Burst deallocation improvements are more modest, why?
|
|
|
|
<ul>
|
|
<li>Each burst-allocated node is individually
|
|
deallocatable and <code>list</code> must construct a node chain (iterating through all nodes) before calling
|
|
<code>deallocate_individual</code>. That is, we must traverse nodes twice (once to construct the chain and
|
|
another one internally in the internal C function)</li>
|
|
<li>Burst allocation takes advantage of locality allocating big blocks of memory (in a single heap memory search)
|
|
and creating chunks from this block. Burst deallocation must be able to deallocate nodes that can be far apart, so
|
|
there is no chance to replace N allocations with a single, big deallocation.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>
|
|
This means that burst deallocation is just a loop of plain deallocations, amortizing costs like mutex locks, checks,
|
|
function calls and returns and other common operations. Nevertheless, the test shows a 30% speedup when comparing to
|
|
node to node deallocation with quite big ranges (>1000) and still noticeable speedups (20%) with small ranges (>=10).
|
|
</p>
|
|
|
|
<p>
|
|
Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1:
|
|
</p>
|
|
|
|
<img src="burst_allocation_linux-gcc-4.1_1.jpg"></img>
|
|
|
|
<img src="burst_allocation_linux-gcc-4.1_2.jpg"></img>
|
|
|
|
<p>
|
|
In Linux, the standard allocator (ptmalloc2, based on DLmalloc) performs very well, and the lock reduction
|
|
achieved by its per-thread-pool approach offers a similar performance than burst allocation (AllocatorPlusV2)
|
|
for small ranges. With bigger ranges (>=100) AllocatorPlusV2 maintains (more or less) its performance while
|
|
StdAllocator can't keep up. Like in WinXP Visual 7.1 tests, AllocatorPlusV2 is faster (up to 2x in some cases)
|
|
than AllocatorPlusV1 for big ranges.
|
|
</p>
|
|
|
|
<p>
|
|
Burst deallocation is not faster than the standard allocator in Linux, surely because ptmalloc2 reduces
|
|
mutex locks just like burst deallocation, but burst deallocation needs to traverse the range to be
|
|
erased twice as explained before. When comparing to AllocatorPlusV1, that is, a wrapper around
|
|
<code>dlmalloc</code> and <code>dlfree</code> burst deallocation (AllocatorPlusV2) is 30% faster for
|
|
big ranges.
|
|
</p>
|
|
|
|
<h1><a name="Chapter3"></a>Chapter 3: Adaptive pools for node containers</h1>
|
|
|
|
<h2><a name="SimpleSegregatedStorage"></a>Simple segregated storage: advantages and problems</h2>
|
|
|
|
<p>
|
|
When developing allocators to improve node containers performance, one of the widely used approaches
|
|
is using simple segregated storage (e.g. <a href="http://www.boost.org/libs/pool">Boost.Pool</a>).
|
|
The idea behind simple segregated storage is to divide a big memory portion ("block") allocated through a
|
|
general purpose memory manager into fixed-size "nodes".
|
|
A memory manager ("pool") holds a list of "blocks" and a singly linked list of free nodes.
|
|
Since all nodes are the same size, Simple Segregated Storage can't serve nodes of different sizes.
|
|
To solve this, an application creates a different pool for each size and usually different types with the same size
|
|
share the same pool.
|
|
</p>
|
|
|
|
<p>
|
|
The free node list is intrusive (since the node is free, the list pointer is built in the node),
|
|
so that <b>there is no size overhead</b> for each node. Simple Segregated Storage is very fast:
|
|
<ul>
|
|
<li>Allocation: take the first free node of the free list.
|
|
If there are no free nodes, a new block is allocated and partitioned. Complexity: amortized O(1)</li>
|
|
<li>Deallocation: put the node in the front of the free list. Complexity: O(1)</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>
|
|
Now the question is: when does the pool return fully free "blocks" (allocated each time the free node list was empty) to the
|
|
general purpose memory manager so that they can be reused by other pools or non-pool allocators? The common answer is: <b>usually never</b>.
|
|
Why? Simple Segregated Storage erases all bookeeping data from the node and this makes "block" deallocation very expensive: the pool
|
|
needs to traverse the free nodes list (nodes are <b>not</b> ordered by their original block) and check the address of each node
|
|
against the address of each "block" just to check if all the nodes of a "block" are free and the block can be safely deallocated. If
|
|
we have B blocks and N nodes, this is a O(N*B) operation. Usually N (and B) can be quite high and this operation can take <b>minutes</b>
|
|
if the application has allocated several tens of thousands of nodes. That's why the option to trim the memory is provided only as
|
|
an <b>optional</b> and <b>explicit</b> operation.
|
|
</p>
|
|
|
|
<p>
|
|
A manual operation makes allocator changes non-portable (How is that manual trimming activated? How long does it take?) so many
|
|
applications just ignore this operation. This also has a drawback: if an application allocates a big amount of nodes in a container
|
|
and then destroys the container, the pool will hold a big list of free nodes but that memory won't be reused by any other container
|
|
not using a pool with the same node size, leading to excessive memory waste.
|
|
</p>
|
|
|
|
<h2><a name="AdaptivePoolsExplained"></a>Adaptive pools explained</h2>
|
|
|
|
<h3>Adaptive pool and memory alignment</h3>
|
|
|
|
<p>
|
|
Can we obtain the small size overhead of simple segregated storage and a fast, continuous, memory trimming operation?
|
|
The idea behind adaptive pools is simple: when deallocating a node, an efficient self-trimming node pool needs
|
|
a fast access to the "block" containing the node. If so, the pool can check how many free nodes are in that "block"
|
|
(the "block" has a header storing the number of free blocks) and if all nodes are free, the pool can return
|
|
that block to the general purpose memory manager.
|
|
</p>
|
|
|
|
<p>
|
|
The problem is that storing bookeeping data (a pointer to the block containing that node) in a node has
|
|
a size overhead. But there is an extremely fast way to obtain a pointer to the block containing the node with zero
|
|
size overhead per-node: <b>memory alignment</b>.
|
|
</p>
|
|
|
|
<p>
|
|
Suppose a 4KB "block" also aligned to 4KB: a node just needs a mask operation to obtain the address of the block.
|
|
Once we have it, we can check the number of free nodes and deallocate the block if all the nodes are free. And once we
|
|
have fast access to the "block" we can develop a more complicated trimming scheme that favors the creation of free "blocks"
|
|
and improves trimming probabilities.
|
|
</p>
|
|
|
|
<h3>Alignment can lead to excessive memory waste</h3>
|
|
|
|
<p>
|
|
Allocators usually have a payload that store bookeeping information for deallocation. If an adaptive pool allocates 4KB
|
|
bytes aligned to 4KB, it will really allocate 4KB + payload bytes where the start of the non-payload section is aligned
|
|
to 4KB. This means that the memory adjacent to this aligned block won't be suitable for a new 4KB aligned allocation leading
|
|
to excessive memory waste (a 4KB allocation would waste next 4KB - payload bytes of memory).
|
|
</p>
|
|
|
|
</p>
|
|
There is a simple solution: allocate (4KB - payload bytes) bytes aligned to 4KB. This simple
|
|
solution make contiguous memory usable for a future 4KB aligned allocation leading to optimal memory usage. That's why
|
|
knowing the size of the payload is important for aligned memory allocations where the size of the allocation is not fixed.
|
|
</p>
|
|
|
|
|
|
<h3>Managing blocks</h3>
|
|
|
|
<p>
|
|
Once we have access from a node to its block, we can implement different policies for them.
|
|
Once a block is completely free we can just deallocate it but the allocation pattern
|
|
could lead to several half-used blocks, increasing in practice the memory usage
|
|
we'll trying to minimize.
|
|
</p>
|
|
|
|
<p>
|
|
Alternatively, we can implement a strategy in order to minimize the number of in-use
|
|
blocks. We can try to allocate nodes always from blocks with less free nodes,
|
|
leading, at least in theory, to more fully used blocks and more fully free blocks that
|
|
could be deallocated faster. Obviously, this block management must be very fast: searching for the
|
|
most used block can ruin performance, specially when comparing with the efficient
|
|
simple segregated storage. The same happens when deallocating a node: if block
|
|
management needs too many operations deallocation can have very bad performance.
|
|
</p>
|
|
|
|
<h3>Minimizing alignment requirements</h3>
|
|
|
|
<p>
|
|
Adaptive pools can require excessive alignment requirements when many nodes are allocated in a chunk.
|
|
If a block is configured to hold at least 256 nodes and the size of a node is 24 bytes, the required size
|
|
for the block is (block payload + 20*256) > 4096 -> 8196 bytes, which also needs 8196 bytes alignment. This
|
|
means that increasing the number of nodes a block can hold also increases the required alignment.
|
|
</p>
|
|
|
|
<p>
|
|
We can improve the excessive alignment if we divide each block in sub-blocks. The first sub-block contains the
|
|
payload
|
|
of the block and the rest of sub-blocks just store a pointer to the first sub-block. This leads to an
|
|
additional degree of freedom: block overhead. The overhead per block (number of nodes * node size)/ (block size)
|
|
determines the size (and the alignment) of a sub-block, whereas the number of nodes per block determines the
|
|
number of sub-blocks per block. This trick maintains the alignment requirement <b>lower</b>
|
|
than the approach that does not use sub-blocks and <b>constant</b> with the number of nodes per block.
|
|
On the other hand, deallocation requires an additional indirection (from node to subblock and from subblock
|
|
to the block header) and a bit more complicated block initialization.
|
|
</p>
|
|
|
|
<h2><a name="PresentingAdaptivePool"></a>Presenting <code>adaptive_pool</code></h2>
|
|
|
|
<p>
|
|
<code>adaptive_pool</code> is built above the aligned memory allocation function
|
|
from our modified <b>DLmalloc</b> library. Alignment must be power of two:
|
|
|
|
<pre>
|
|
void* boost_cont_memalign(size_t bytes, size_t alignment);
|
|
</pre>
|
|
</p>
|
|
|
|
<p>
|
|
<code>adaptive_pool</code> is implemented as follows:
|
|
|
|
<ul>
|
|
<li>Each block stores an intrusive free node singly linked list.</li>
|
|
<li>The pool stores an intrusive red-black tree of blocks with free nodes, ordered by free node count.</li>
|
|
<li>The pool deallocates all fully free blocks except a number of previously configured
|
|
number of blocks.</li>
|
|
<li>The user can configure the overhead per block so that alignment requirements can be minimized.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>This is the declaration of <code>adaptive_pool</code>:
|
|
<pre>
|
|
const std::size_t ADP_nodes_per_block = unspecified;
|
|
const std::size_t ADP_max_free_blocks = unspecified
|
|
const std::size_t ADP_overhead_percent = unspecified;
|
|
const std::size_t ADP_only_alignment = unspecified;
|
|
|
|
template < class T
|
|
, unsigned Version = 2
|
|
, std::size_t NodesPerBlock = ADP_nodes_per_block
|
|
, std::size_t MaxFreeBlocks = ADP_max_free_blocks
|
|
, std::size_t OverheadPercent = ADP_overhead_percent
|
|
>
|
|
class adaptive_pool;
|
|
</pre>
|
|
</p>
|
|
|
|
<p>
|
|
Description of the template parameters:
|
|
<ol>
|
|
<li>The type of the objects to be allocated</li>
|
|
<li>The version of the allocator. Similarly to other allocators presented in previous chapters,
|
|
Version 2 allocator offers burst-allocation.</li>
|
|
<li><i>NodesPerBlock</i> is the minimum number of nodes per block we want
|
|
to allocate.</li>
|
|
<li><i>MaxFreeBlocks</i> is the maximum number of fully free blocks that the pool will hold
|
|
in the intrusive red-black tree (the rest of fully free blocks will be deallocated).</li>
|
|
<li><i>OverheadPercent</i> is a parameter that will configure the maximum memory overhead that each
|
|
block will have (and thus, the number of sub-blocks).
|
|
If this parameter is equal to <code>ADP_only_alignment</code>, then no sub-blocks will be used
|
|
and <i>NodesPerBlock</i> will determine the alignment required by blocks.</li>
|
|
</ol>
|
|
</p>
|
|
|
|
<p>
|
|
To allocate a new node <code>adaptive_pool</code> needs to:
|
|
<ul>
|
|
<li>See if there are free blocks in the tree. If not, it should allocate a new one.</li>
|
|
<li>Take the first node of the first block</li>
|
|
<li>Decrease the count of fully free blocks if the block was fully free.</li>
|
|
<li>Erase the block from the tree if the block has no free nodes.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>
|
|
To deallocate a new node <code>adaptive_pool</code> needs to:
|
|
<ul>
|
|
<li>Obtain the block from the node.</li>
|
|
<li>Put the node in the free list of the block.</li>
|
|
<li>Reorder the block tree according to the free node count.</li>
|
|
<li>Increase the count of fully free blocks if the block is now fully free.</li>
|
|
<li>Insert the block in the tree if the free node count has passed from 0 to 1.</li>
|
|
<li>Deallocate remaining fully free nodes if the fully free block count is higher than the configured limit.</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<h2><a name="TestingAdaptivePool"></a>Testing <code>adaptive_pool</code></h2>
|
|
|
|
<p>
|
|
In this test we'll measure 8 allocators:
|
|
|
|
<ol>
|
|
<li><b>AllocatorPlusV1</b>: equivalent to <code>ba::allocator<int, 1></code>, a STL
|
|
conformant allocator around <code>dlmalloc</code> and <code>dlfree</code> (no burst allocation).</li>
|
|
<li><b>AllocatorPlusV2</b>: equivalent to <code>ba::allocator<int, 2></code>, burst-allocation capable
|
|
allocator.</li>
|
|
<li><b>AdPoolAlignOnlyV1</b>: an adaptive pool with no sub-blocks and no burst allocation.</li>
|
|
<li><b>AdPoolAlignOnlyV2</b>: an adaptive pool with no sub-blocks but with burst allocation.</li>
|
|
<li><b>AdPool2PercentV1</b>: an adaptive pool with sub-blocks (configured with 2% overhead per block) and no burst allocation.</li>
|
|
<li><b>AdPool2PercentV2</b>: an adaptive pool with sub-blocks (configured with 2% overhead per block) but with burst allocation.</li>
|
|
<li><b>SimpleSegregatedStorageV1</b>: a classical simple segregated storage allocator with no burst allocation.</li>
|
|
<li><b>SimpleSegregatedStorageV2</b>: a classical simple segregated storage allocator with burst allocation.</li>
|
|
</ol>
|
|
</p>
|
|
|
|
<p>
|
|
All pool allocators (allocators 3-8) are configured with a minimum of 256 nodes per block.
|
|
For a int list, in 32 bit systems, the required alignment
|
|
is 4096 bytes for 3 and 4 allocators (adaptive pools with no sub-blocks) and 2048 bytes for 5 and 6 allocators
|
|
(adaptive pools with sub-blocks).
|
|
</p>
|
|
|
|
<p>
|
|
For each allocator we'll show the following information:
|
|
<ul>
|
|
<li><b>Insertion speed</b> per element (normalized to AllocatorPlusV1 speed)</li>
|
|
<li><b>Erasure speed</b> per element (normalized to AllocatorPlusV1 speed)</li>
|
|
<li><b>System bytes after insertion</b>: the amount of system bytes (memory cached by dlmalloc
|
|
from the operating system) reported by dlmalloc after inserting all elements</code></li>
|
|
<li><b>Memory overhead</b>: The amount (in percentage) of payload memory per node</code></li>
|
|
<li><b>In use bytes after erasure</b>: the amount of bytes privately hold by allocators
|
|
after erasing all elements</code></li>
|
|
<li><b>System bytes after erasure</b>: the amount of system bytes (memory cached by dlmalloc
|
|
from the operating system) reported by dlmalloc after erasing all elements</code></li>
|
|
</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>The test will be the same as the used in the section <a href="#TestingBurstAllocation">Testing burst allocation</a>.
|
|
Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003:
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_1.jpg"></img>
|
|
|
|
<p>
|
|
In general, insertion speeds are better for pool allocators than for <b>DLmalloc</b>. Analyzing only V1 allocators
|
|
(no burst allocation) we can see that <code>SimpleSegregatedStorageV1</code> performs best, followed by
|
|
<code>AdPoolAlignOnlyV1</code> and <code>AllocPool2PercentV1</code>. This is quite logical: simple
|
|
segregated storage minimizes all block operations because it does
|
|
not deallocate free blocks on the fly. <code>AdPoolAlignOnlyV1</code> and <code>AllocPool2PercentV1</code>
|
|
perform quite similarly.
|
|
</p>
|
|
|
|
<p>
|
|
When comparing V2 allocators (burst-allocation enabled) we see that performance of all pool allocators is very
|
|
similar. The reason for this is that the slightly higher costs of adaptive pools are practically minimized when the pool
|
|
allocates several nodes in a row: all needed blocks are preallocated in a single call, error handling in loops can be
|
|
simplified leading to tighter loops... Curiously, the performance of a burst-enabled <b>DLmalloc</b>
|
|
(<code>AllocatorPlusV2</code>) is on par with pools with burst allocation. This is not a surprise: <b>DLmalloc</b> burst
|
|
allocation uses exactly the same mechanism as pools: allocating big block and create nodes from them. The difference, though,
|
|
is in the overhead per node.
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_2.jpg"></img>
|
|
|
|
<p>
|
|
When analyzing deallocations, results for V1 allocators are pretty similar, with <code>SimpleSegregatedStorageV1</code>
|
|
leading the test. We must remember that <code>SimpleSegregatedStorageV1</code> does not perform any block managing
|
|
and deallocation, so this is the expected behavior. Adaptive pool deallocation times are exactly between <b>DLmalloc</b>
|
|
(<code>AllocatorPlusV1</code>) and segregated storage times. Adaptive pools perform a quite complicated block management
|
|
ordering free blocks by free node count and deallocating completely free blocks if the number of them is superior to
|
|
a configure value (2 blocks in this test). As expected <code>AdPoolAlignOnlyV2</code> performs marginally better than
|
|
<code>AllocPool2PercentV2</code> because deallocating a node allocated with <code>AllocPool2PercentV2</code> (adaptive
|
|
pool with sub-blocks) needs an additional indirection to obtain the block header from the node (first we reach the
|
|
header of the sub-block and this header stores the address of the block header).
|
|
</p>
|
|
|
|
<p>
|
|
When analyzing burst-enabled allocators, however, the picture changes so that adaptive pools perform similarly to
|
|
than <code>SimpleSegregatedStorageV2</code>. Again, burst-allocation allows the elimination of some redundant checks
|
|
and management operations that yield to better performance and amortization of block managing costs.
|
|
</p>
|
|
|
|
<p>
|
|
For big ranges, V2 (burst) adaptive pools are even better than simple segregated storage by a small margin.
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_3.jpg"></img>
|
|
|
|
<p>
|
|
Now let's see the amount of memory allocated from the operating system by <b>DLmalloc</b> with each allocator
|
|
(V1 and V2 practically allocate the same amount of memory so we only represent V2 allocators). <b>DLmalloc</b>
|
|
wrapper (<code>AllocatorPlusV2</code>) allocates more memory than the rest of allocators. Allocated nodes are
|
|
12 byte in 32 bit systems so <b>DLmalloc</b> needs 12 byte + overhead (4 bytes) = 16 bytes per node. With 2000000
|
|
nodes this yields to aproximately 32 MB. For pools, since overhead is practically inexistent the total memory
|
|
is around 24MB (12 bytes per node).
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_4.jpg"></img>
|
|
<p>
|
|
In the previous graph we can see the overhead (in percentage) for each allocator. As expected, <b>DLmalloc</b>
|
|
offers a 33% overhead (4 bytes per each 12 byte node) whereas pools offer less than 2% overhead (remember that
|
|
we've configured <code>AllocPool2PercentVX</code> with a maximum of 2 percent overhead). As expected,
|
|
<code>SimpleSegregatedStorageVX</code> offers the minimum overhead but adaptive pools (with their additional
|
|
headers for block management) are also very size-efficient.
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_5.jpg"></img>
|
|
<p>
|
|
The previous picture shows how much in-use memory <b>DLmalloc</b> reports after deallocating all the nodes.
|
|
Note that this does not count the OS memory <b>DLmalloc</b> still caches, but memory that have been allocated
|
|
through <b>DLmalloc</b> and has not been deallocated yet through <b>DLmalloc</b>. In other words, memory still
|
|
hold by pools.
|
|
</p>
|
|
|
|
<p>
|
|
As expected <b>DLmalloc</b> wrappers (<code>AllocatorPlusVX</code>) have deallocated all their memory (no leaks!).
|
|
Also, simple segregated storage allocators have not deallocated a single block so they still hold all the memory.
|
|
Adaptive pools (<code>AdPoolAlignOnlyVX</code> and <code>AllocPool2PercentVX</code>) only hold the number of blocks
|
|
we've configured: 2 blocks x 4096 bytes per block = 8192 bytes. So we clearly show that adaptive pools
|
|
not only offer a low memory overhead and fast allocation times, but they also return memory to the underlying
|
|
memory allocator!
|
|
</p>
|
|
|
|
<p>
|
|
Once allocators return memory to <b>DLmalloc</b>, <b>DLmalloc</b> can trim its cached OS memory and return
|
|
that memory to the OS. The following graph shows the amount of memory <b>DLmalloc</b> caches after all
|
|
nodes have been deallocated:
|
|
</p>
|
|
|
|
<img src="adaptive_pool_msvc-7.1_6.jpg"></img>
|
|
|
|
<p>
|
|
Cached memory usually depends on the number of contiguous free pages that <b>DLmalloc</b> holds
|
|
internally so numbers are not important. The only thing that this graph shows is that adaptive pools,
|
|
after returning memory to <b>DLmalloc</b>, encourage also memory trimming so that <b>DLmalloc</b> can
|
|
return memory to the operating system.
|
|
</p>
|
|
|
|
<p>
|
|
Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1:
|
|
</p>
|
|
|
|
<img src="adaptive_pool_linux-gcc-4.1_1.jpg"></img>
|
|
|
|
<p>
|
|
Insertion speeds for V1 allocators are more or less the expected with segregated storage leading the group
|
|
and non-pool <b>DLmalloc</b> wrapper being the slowest one. V2 (burst) allocators are faster with similar
|
|
performance between adaptive pools and simple segregated storage as the number of the insertion range grows.
|
|
</p>
|
|
|
|
<img src="adaptive_pool_linux-gcc-4.1_2.jpg"></img>
|
|
|
|
<p>
|
|
Erasure speeds are similar to Windows, with <code>SimpleSegregatedStorageVX</code> leading the test,
|
|
and adaptive pools come next.
|
|
</p>
|
|
|
|
</body>
|
|
</html>
|