mirror of
https://github.com/boostorg/histogram.git
synced 2026-01-30 20:02:13 +00:00
190 lines
12 KiB
HTML
190 lines
12 KiB
HTML
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
|
|
<title>Rationale</title>
|
|
<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
|
|
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
|
|
<link rel="home" href="../index.html" title="Chapter 1. Boost.Histogram">
|
|
<link rel="up" href="../index.html" title="Chapter 1. Boost.Histogram">
|
|
<link rel="prev" href="notes.html" title="Notes">
|
|
<link rel="next" href="references.html" title="References">
|
|
</head>
|
|
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
|
|
<table cellpadding="2" width="100%"><tr>
|
|
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td>
|
|
<td align="center"><a href="../../../../../index.html">Home</a></td>
|
|
<td align="center"><a href="../../../../../libs/libraries.htm">Libraries</a></td>
|
|
<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
|
|
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
|
|
<td align="center"><a href="../../../../../more/index.htm">More</a></td>
|
|
</tr></table>
|
|
<hr>
|
|
<div class="spirit-nav">
|
|
<a accesskey="p" href="notes.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="references.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
|
|
<a name="boost_histogram.rationale"></a><a class="link" href="rationale.html" title="Rationale">Rationale</a>
|
|
</h2></div></div></div>
|
|
<div class="toc"><dl class="toc">
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.design_principles">Design principles</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.interface_convenience">Interface
|
|
convenience</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.language_transparency">Language
|
|
transparency</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.powerful_binning_strategies">Powerful
|
|
binning strategies</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.performance_and_memory_efficiency">Performance
|
|
and memory-efficiency</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.weighted_counts_and_variance_estimates">Weighted
|
|
counts and variance estimates</a></span></dt>
|
|
<dt><span class="section"><a href="rationale.html#boost_histogram.rationale.serialization_and_zero_suppression">Serialization
|
|
and zero-suppression</a></span></dt>
|
|
</dl></div>
|
|
<p>
|
|
I designed the histogram based on a decade of experience collected in working
|
|
with Big Data, more precisely in the field of particle physics and astroparticle
|
|
physics. In many ways, the <code class="computeroutput"><span class="identifier">ROOT</span> <span class="special"><</span><span class="identifier">https</span><span class="special">://</span><span class="identifier">root</span><span class="special">.</span><span class="identifier">cern</span><span class="special">.</span><span class="identifier">ch</span><span class="special">></span></code>_ histograms served as an example of <span class="bold"><strong>not to do it</strong></span>.
|
|
</p>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.design_principles"></a><a class="link" href="rationale.html#boost_histogram.rationale.design_principles" title="Design principles">Design principles</a>
|
|
</h3></div></div></div>
|
|
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
|
|
<li class="listitem">
|
|
"Do one thing and do it well", Doug McIlroy
|
|
</li>
|
|
<li class="listitem">
|
|
The <a href="https://www.python.org/dev/peps/pep-0020" target="_top">Zen of Python</a>
|
|
(also applies to other languages).
|
|
</li>
|
|
</ul></div>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.interface_convenience"></a><a class="link" href="rationale.html#boost_histogram.rationale.interface_convenience" title="Interface convenience">Interface
|
|
convenience</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
A histogram should have the same consistent interface whatever the dimension.
|
|
Like <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code> it should <span class="bold"><strong>just
|
|
work</strong></span>, users shouldn't be forced to make <span class="bold"><strong>a
|
|
priori</strong></span> choices among several histogram classes and options everytime
|
|
they encounter a new data set.
|
|
</p>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.language_transparency"></a><a class="link" href="rationale.html#boost_histogram.rationale.language_transparency" title="Language transparency">Language
|
|
transparency</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
Python is a great language for data analysis, so the histogram needs Python
|
|
bindings. The histogram should be usable as an interface between a complex
|
|
simulation or data-storage system written in C++ and data-analysis/plotting
|
|
in Python: define the histogram in Python, let it be filled on the C++ side,
|
|
and then get it back for further data analysis or plotting.
|
|
</p>
|
|
<p>
|
|
Data analysis in Python is Numpy-based, so Numpy support is a must.
|
|
</p>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.powerful_binning_strategies"></a><a class="link" href="rationale.html#boost_histogram.rationale.powerful_binning_strategies" title="Powerful binning strategies">Powerful
|
|
binning strategies</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
The histogram supports half a dozent different binning strategies, conveniently
|
|
encapsulated in axis objects. There is the standard sorting of real-valued
|
|
data into bins of equal or varying width, but also binning of angles or integer
|
|
values.
|
|
</p>
|
|
<p>
|
|
Extra bins that count over- and underflow values are added by default. This
|
|
feature can be turned off individually for each axis. The extra bins do not
|
|
disturb normal bin counting. On an axis with <code class="computeroutput"><span class="identifier">n</span></code>
|
|
bins, the first bin has the index <code class="computeroutput"><span class="number">0</span></code>,
|
|
the last bin <code class="computeroutput"><span class="identifier">n</span><span class="special">-</span><span class="number">1</span></code>, while the under- and overflow bins are accessible
|
|
at <code class="computeroutput"><span class="special">-</span><span class="number">1</span></code>
|
|
and <code class="computeroutput"><span class="identifier">n</span></code>, respectively.
|
|
</p>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.performance_and_memory_efficiency"></a><a class="link" href="rationale.html#boost_histogram.rationale.performance_and_memory_efficiency" title="Performance and memory-efficiency">Performance
|
|
and memory-efficiency</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
Dense storage in memory is a must for high performance. Unfortunately, the
|
|
<a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_top">curse
|
|
of dimensionality</a> quickly become a problem as the number of dimensions
|
|
grows, leading to histograms which consume large amounts (up to GBs) of memory.
|
|
</p>
|
|
<p>
|
|
Fortunately, having many dimensions typically reduces the number of counts
|
|
per bin, since counts get spread over many dimensions. The histogram uses
|
|
an adaptive count size per bin to be as memory-efficient as possible, by
|
|
starting with the smallest integer size per bin of 1 byte and increasing
|
|
as needed to up to 8 byte. A <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">vector</span></code>
|
|
grows in <span class="bold"><strong>length</strong></span> as new elements are added,
|
|
while the count storage grows in <span class="bold"><strong>depth</strong></span>.
|
|
</p>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.weighted_counts_and_variance_estimates"></a><a class="link" href="rationale.html#boost_histogram.rationale.weighted_counts_and_variance_estimates" title="Weighted counts and variance estimates">Weighted
|
|
counts and variance estimates</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
A histogram categorizes and counts, so the natural choice for the data type
|
|
of the counts are integers. However, in particle physics, histograms are
|
|
also often filled with weighted events, for example, to make sure that two
|
|
histograms look the same in one variable, while the distribution of another,
|
|
correlated variable is a subject of study.
|
|
</p>
|
|
<p>
|
|
The histogram can be filled with either weighted or unweighted counts. In
|
|
the weighted case, the sum of weights is stored in a <code class="computeroutput"><span class="identifier">double</span></code>.
|
|
The histogram provides a variance estimate is both cases. In the unweighted
|
|
case, the estimate is computed from the count itself, using Poisson-theory.
|
|
In the weighted case, the sum of squared weights is stored alongside the
|
|
sum of weights in a second <code class="computeroutput"><span class="identifier">double</span></code>,
|
|
and used to compute a variance estimate.
|
|
</p>
|
|
</div>
|
|
<div class="section">
|
|
<div class="titlepage"><div><div><h3 class="title">
|
|
<a name="boost_histogram.rationale.serialization_and_zero_suppression"></a><a class="link" href="rationale.html#boost_histogram.rationale.serialization_and_zero_suppression" title="Serialization and zero-suppression">Serialization
|
|
and zero-suppression</a>
|
|
</h3></div></div></div>
|
|
<p>
|
|
Serialization is implemented using <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">serialization</span></code>.
|
|
Pickling in Python is implemented based on the C++ serialization code. To
|
|
ensure portability of the pickled histogram, the pickle string is an ASCII
|
|
representation of the histogram, generated with the <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">archive</span><span class="special">::</span><span class="identifier">text_oarchive</span></code>.
|
|
It would be great to switch to a portable binary representation in the future,
|
|
when that becomes available.
|
|
</p>
|
|
<p>
|
|
To reduce the size of the string, run-length encoding is applied (zero-suppression)
|
|
to sequences of zeros. Partly filled histograms often contain large sequences
|
|
of zeros.
|
|
</p>
|
|
</div>
|
|
</div>
|
|
<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr>
|
|
<td align="left"></td>
|
|
<td align="right"><div class="copyright-footer">Copyright © 2016 Hand Dembinski<p>
|
|
Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
</p>
|
|
</div></td>
|
|
</tr></table>
|
|
<hr>
|
|
<div class="spirit-nav">
|
|
<a accesskey="p" href="notes.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="references.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
|
|
</div>
|
|
</body>
|
|
</html>
|