Files
histogram/doc/guide.qbk
Hans Dembinski 4a2e3db0e0 upload edits on docs and minor code fixes (#152)
* renaming naked to remove_cvref_t
* more static_asserts to check implicit conditions
* use new BOOST_TEST_TRAIT_SAME
* reorganize and add operator tests, fixes for array and map adaptors
* improved map adaptor
* storage does not have to support scaling operator anymore
* doc improvements
2019-02-10 22:45:00 +01:00

307 lines
20 KiB
Plaintext

[section:guide User guide]
Boost.Histogram is designed to make simple things simple, yet complex things possible. Correspondingly, this guides covers the basic usage first, and the advanced usage in later sections. For an alternative quick start guide, have a look at the [link histogram.getting_started Getting started] section.
[section Make a histogram]
A histogram consists of a [link histogram.concepts.storage storage] and a sequence of [link histogram.concepts.axis axis] objects. The storage represents a grid of cells of counters. The axis objects maps input values to indices, which are used to look up the cell. You don't normally have to worry about the storage, since the library provides a very good default. There are many interesting axis types to choose from, but for now let us stick to the most common axis, the [classref boost::histogram::axis::regular regular] axis. It represents equidistant intervals on the real line.
Use the convenient factory function [funcref boost::histogram::make_histogram make_histogram] to make the histograms. In the following example, a histogram with a single axis is created.
[import ../examples/guide_make_static_histogram.cpp]
[guide_make_static_histogram]
An axis object defines how input values are mapped to bins, which means that it defines the number of bins along that axis and a mapping function from input values to bins. If you provide one axis, the histogram is one-dimensional. If you provide two, it is two-dimensional, and so on.
In the example above, the compiler knows the number of axes and their type at compile-time, the information can be deduced from the arguments to [funcref boost::histogram::make_histogram make_histogram]. This gives you the best performance, but sometimes you only know the axis configuration at run-time, usually because it depends on run-time user input. You can also create a sequence of axes at run-time and pass them to an overload of the factory function. Here is an example.
[import ../examples/guide_make_dynamic_histogram.cpp]
[guide_make_dynamic_histogram]
[note
When the axis types are known at compile-time, the histogram stores them in a `std::tuple`. If they are only known at run-time, it uses a `std::vector`. In almost all ways, the two versions of the histogram act identically, except that the compile-time version is faster. The [link histogram.overview.rationale.structure.host rationale] has more details on this point.
]
The factory function [funcref boost::histogram::make_histogram make_histogram] uses the default storage type, which provides safe counting, is fast, and memory efficient. If you want to create a histogram with another storage type, use [funcref boost::histogram::make_histogram_with make_histogram_with]. To learn more about other storage types and how to create your own, have a look at the section [link histogram.guide.expert Advanced Usage].
[section Choose the right axis]
The library provides a number of useful axis types. Here is some advice when to use which.
[variablelist
[
[
[classref boost::histogram::axis::regular]
]
[
Axis over an interval on the real line with bins of equal width. Value-to-index conversion is O(1) and very fast. The axis does not allocate memory dynamically. The axis is very flexible thanks to transforms (see below). Due to finite precision of floating point calculations, bin edges may not be exactly at expected values. If you need bin edges at exactly defined floating point values, use the next axis.
]
]
[
[
[classref boost::histogram::axis::variable]
]
[
Axis over an interval on the real line with bins of variable width. Value-to-index conversion is O(log(N)). The axis allocates memory dynamically to store the bin edges. Use this if the regular axis with transforms cannot represent the binning you want. If you need bin edges at exactly defined floating point values, use this axis.
]
]
[
[
[classref boost::histogram::axis::integer]
]
[
Axis over an integer sequence [i, i+1, i+2, ...]. Can also handle real input values, then it represents bins with a fixed bin width of 1. Value-to-index conversion is O(1) and faster than for the [classref boost::histogram::axis::regular regular] axis. Does not allocate memory dynamically. Use this when your input consists of a sequence of integers.
]
]
[
[
[classref boost::histogram::axis::category]
]
[
Axis over a set of unique values of an arbitrary equal-comparable type. Value-to-index conversion is O(N), but faster than [classref boost::histogram::axis::variable variable] axis for N < 10, the typical use case. The axis allocates memory dynamically to store the values.
]
]
]
Check the class descriptions for more information about each axis type. You can write your own axis types, too, see [link histogram.guide.expert Advanced usage].
Here is an example which shows the basic use case for each axis type.
[import ../examples/guide_axis_basic_demo.cpp]
[guide_axis_basic_demo]
[note All builtin axes over the real-line use semi-open bin intervals by convention. As a mnemonic, think of iterator ranges from `begin` to `end`, where `end` is also not included.]
As mentioned in the previous example, you can assign an optional label to any axis to keep track of what the axis is about. Assume you have census data and you want to investigate how yearly income correlates with age, you could do:
[import ../examples/guide_axis_with_labels.cpp]
[guide_axis_with_labels]
Without the metadata it would be difficult to see which axis was covering which quantity. Metadata is the only axis property that can be modified after construction by the user. Axis objects with different metadata do not compare equal.
[section Axis configuration]
All builtin axis types have template arguments for customization. All arguments have reasonable defaults so you can use empty brackets. If your compiler supports C++17, you can drop the brackets altogether. Suitable arguments are then deduced from the constructor call. The template arguments are in order:
[variablelist
[
[Value]
[
The value type is the argument type of the `index()` method. An argument passed to the axis must be implicitly convertible to this type.
]
]
[
[Transform (only [classref boost::histogram::axis::regular regular] axis)]
[
A class that implements a bijective transform between the data space and the space in which the bins are equi-distant. Users can define their own transforms and use them with the axis.
]
]
[
[Metadata]
[
The axis uses an instance this type to store metadata. It is a `std::string` by default, but it can by any copyable type. If you want to save a small amount of stack memory per axis, you pass the empty `boost::histogram::axis::null_type` here.
]
]
[
[Options]
[
Compile-time options for the axis, see [enumref boost::histogram::axis::option]. This is used to enable/disable under- and overflow bins, to make an axis circular, or to enable dynamic grow of the axis beyond the initial range.
]
]
[
[Allocator (only [classref boost::histogram::axis::variable variable] and [classref boost::histogram::axis::category category] axes)]
[
Allocator that is used to request memory dynamically to store values. If you don't know what an allocator is you can safely ignore this argument.
]
]
]
[section Transforms]
A transform is the second template argument of the [classref boost::histogram::axis::regular regular] axis. By default, it is the identity transform, which just forwards the value. Transforms allow you to chose the faster stack-allocated regular axis over the generic [classref boost::histogram::axis::variable variable] axis in more cases.
A common need is a regular binning in the logarithm of the input value. This can be achieved with a [classref boost::histogram::axis::transform::log log transform]. The follow example shows the builtin transforms.
[import ../examples/guide_axis_with_transform.cpp]
[guide_axis_with_transform]
As shown in the example, due to the finite precision of floating point calculations, the bin edges of a transformed regular axis may not be exactly at the expected values. If you need exact correspondence, use a [classref boost::histogram::axis::variable variable] axis.
Users may write their own transforms and use them with the builtin [classref boost::histogram::axis::regular regular] axis, by implementing a type that matches the [link histogram.concepts.transform transform concept].
[endsect]
[section Options]
A set of options can be used to configure each axis type, by OR-ing [enumref boost::histogram::axis::option option identifiers].
[*Under- and overflow bins]
By default, under- and overflow bins are added automatically for each axis (except if adding them would make no sense). If you create an axis with 20 bins, the histogram will actually have 22 bins along that axis. The extra bins are very useful, as explained in the [link histogram.overview.rationale.uoflow rationale]. If the input cannot exceed the axis range, you can disable the extra bins to save memory. Example:
[import ../examples/guide_axis_with_uoflow_off.cpp]
[guide_axis_with_uoflow_off]
The [classref boost::histogram::axis::category category] axis comes only with an overflow bin, which counts all input values that are not part of the initial set.
[*Circular axes]
Each builtin axis except the [classref boost::histogram::axis::category category] axis can be made circular. This means that the axis is periodic at its ends, like a polar angle that wraps around after 360 degrees. This is particularly useful if you want make a histogram over a polar angle. Example:
[import ../examples/guide_axis_circular.cpp]
[guide_axis_circular]
A circular axis cannot have an underflow bin, passing both options together generates a compile-time error. Since the highest bin wraps around to the lowest bin, there is no possibility for overflow either. However, an overflow bin is still added by default if the value is a floating point type, to catch NaNs.
[*Growing axes]
To-do.
[endsect] [/ options]
[endsect] [/ axis configuration]
[endsect] [/ choose the right axis]
[endsect] [/ make a histogram]
[section Fill a histogram]
After you created a histogram, you want to insert tuples of possibly multi-dimensional and values of different types. This is done with the flexible `operator()` call, which you typically do in a loop. Some extra parameters can be passed to the method as shown in the next example.
[import ../examples/guide_fill_histogram.cpp]
[guide_fill_histogram]
`operator()` either takes `N` arguments or a container with `N` elements, where `N` is equal to the number of axes of the histogram. It finds the corresponding bin, and increments the bin counter by one.
`operator()(weight(x), ...)` does the same as the first call, but increments the bin counter by the value `x`. The type of `x` is not restricted, usually it is a real number. The `weight(x)` helper class must be first argument. You can freely mix calls with and without a `weight`. Calls without a `weight` act like the weight is `1`.
Why weighted increments are sometimes useful, especially in a scientific context, is explained [link histogram.overview.rationale.weights in the rationale]. If you don't see the point, you can just ignore this type of call. This feature does not affect the performance of the histogram if you don't use it.
[note The first call to a weighted fill internally switches the default storage from integral counters to another type, which holds two real numbers per bin, one for the sum of weights (the weighted count), and another for the sum of weights squared (the variance of the weighted count). This is not necessary for unweighted fills, because the two sums are identical is all weights are `1`. The default storage automatically optimizes this case by using only one integral number per bin as long as no weights are encountered.]
[endsect]
[section Access bin counts]
After the histogram has been filled, you want to access the counts per bin at some point. You may want to visualize the counts, or compute some quantities like the mean from the data distribution approximated by the histogram.
To access each bin, you use a multi-dimensional index, which consists of a sequence of bin indices for each axis in order. You can use this index to access the value for each and the variance estimate, using the method `histogram::at(...)` (in analogy to `std::vector::at`). It accepts integral indices, one for each axis of the histogram, and returns the associated bin counter type. The bin counter type then allows you to access the count value and its variance.
The calls are demonstrated in the next example.
[import ../examples/guide_access_bin_counts.cpp]
[guide_access_bin_counts]
[note The numbers returned by `value()` and `variance()` are always equal, if weighted fills are not used. The internal structure, which handles the bin counters, has been optimised for this common case. Internally only a single integral number per bin is used until a weighted fill, then the counters internally switch to storing two real numbers per bin. If the very first call to `histogram(...)` is already a weighted increment, the two real numbers per bin are allocated directly without any superfluous conversion from integral counters to double counters. This special case is efficiently handled.]
[endsect]
[section Arithmetic operators]
Some arithmetic operations are supported for histograms. Histograms are...
* equal comparable
* addable (adding histograms with non-matching axes is an error)
* multipliable and divisible by a number
These operations are commutative, except for division. Dividing a number by a histogram is not implemented.
Two histograms compare equal, if...
* all axes compare equal, including axis labels
* all values and variance estimates compare equal
Adding histograms is useful, if you want to parallelize the filling of a histogram over several threads or processes. Fill independent copies of the histogram in worker threads, and then add them all up in the main thread.
Multiplying by a number is useful to re-weight histograms before adding them, for those who need to work with weights. Multiplying by a factor `x` has a different effect on value and variance of each bin counter. The value is multiplied by `x`, but the variance is multiplied by `x*x`. This follows from the properties of the variance, as explained in [link histogram.overview.rationale.variance the rationale].
[warning Because of special behavior of the variance, adding a histogram to itself is not identical to multiplying the original histogram by two, as far as the variance is concerned.]
[note Scaling a histogram automatically converts the bin counters from an integral number per bin to two real numbers per bin, if that has not happened already, because value and variance are different after the multiplication.]
Here is an example which demonstrates the supported operators.
[import ../examples/guide_histogram_operators.cpp]
[guide_histogram_operators]
[endsect]
[section Reductions]
When you have a high-dimensional histogram, sometimes you want to remove some axes and look at the equivalent lower-dimensional version obtained by summing over the counts along the removed axes. Perhaps you found out that there is no interesting structure along an axis, so it is not worth keeping that axis around, or you want to visualize 1d or 2d projections of a high-dimensional histogram.
For this purpose use the `histogram::reduce_to(...)` method, which returns a new reduced histogram with fewer axes. The method accepts indices (one or more), which indicate the axes that are kept. The static histogram only accepts compile-time numbers, while the dynamic histogram also accepts runtime numbers and iterators over numbers.
Here is an example to illustrates this.
[import ../examples/guide_histogram_reduction.cpp]
[guide_histogram_reduction]
[endsect]
[section Streaming]
Simple ostream operators are shipped with the library, which are internally used by the unit tests. These give text representations of axis and histogram configurations, but do not show the histogram content. They may be useful for debugging, but users are encouraged to write their own ostream operators. Therefore, the headers with the builtin implementations are not included by the super header `#include <boost/histogram.hpp>`, so that users can use their own implementations. The following example shows the effect of output streaming.
[import ../examples/guide_histogram_streaming.cpp]
[guide_histogram_streaming]
[endsect]
[section Serialization]
The library supports serialization via [@boost:/libs/serialization/index.html Boost.Serialization]. The serialization code is not included by the super header `#include <boost/histogram.hpp>`, so that the library can be used without Boost.Serialization.
[import ../examples/guide_histogram_serialization.cpp]
[guide_histogram_serialization]
[endsect]
[section:expert Advanced usage]
The library is customizable and extensible by users. Users can create new axis types and use them with the histogram, or implement a custom storage policy, or use a builtin storage policy with a custom counter type.
[section User-defined axis class]
In C++, users can implement their own axis class without touching any library code. The custom axis is just passed to the histogram factories `make_static_histogram(...)` and `make_dynamic_histogram(...)`. The custom axis class must meet the requirements of the [link histogram.concepts.axis axis concept].
The simplest way to make a custom axis is to derive from a builtin class. Here is a contrived example of a custom axis that inherits from the [classref boost::histogram::axis::integer integer axis] and accepts c-strings representing numbers.
[import ../examples/guide_custom_modified_axis.cpp]
[guide_custom_modified_axis]
Alternatively, you can also make an axis completely from scratch. An minimal axis is a functor that maps an input to a bin index. The index has a range `[0, AxisType::size())`.
[import ../examples/guide_custom_minimal_axis.cpp]
[guide_custom_minimal_axis]
Such a minimal axis works, even though it lacks convenience features provided by the builtin axis types. For example, one cannot iterate over this axis. Not even a bin description can be queried, because `operator[]` is not implemented. It is up to the user to implement these optional aspects.
[endsect]
[section User-defined storage policy]
Histograms can be created which use a custom storage class with the factory function [funcref boost::histogram::make_histogram_with]. This factory function accepts many standard containers as storage: vectors, arrays, and maps. These are automatically wrapped with a [classref boost::histogram::storage_adaptor] to provide the storage interface needed by the library.
A `std::vector` may provide higher performance in some cases than the default storage. The counter type can then be chosen by the user. Usually, this would be an integral or floating point type. This storage may be faster than the default storage for low-dimensional histograms (or not, one has to measure).
[warning The no-overflow-guarantee is only valid if the [classref boost::histogram::unlimited_storage default storage] is used. If you change the storage policy, you need to know what you are doing.]
Users who work exclusively with weighted fills should use a `std::vector<double>`, it will be faster than using the default storage. Users may also store complex accumulators in the vector. [classref boost::histogram::accumulators::weighted_sum] tracks a variance estimate together with the sum of weights. [classref boost::histogram::accumulators::mean] computes the mean of samples are sorted into the cell.
An interesting alternative to a `std::vector` is to use a `std::array`. The latter provides a storage with a fixed maximum capacity (the size of the array). `std::array` allocates the memory on the stack. Using this in combination with a static histogram allows one to create histograms completely on the stack, which is very fast.
Finally, a `std::map` and `std::unordered_map` is adapted into a sparse storage, where empty cells do not consume any memory, but the memory consumption per cell is much larger than for a vector or array, and the cells are usually not located in a contiguous memory section.
Here is an example of a histogram constructed with an alternative storage policy.
[import ../examples/guide_custom_storage.cpp]
[guide_custom_storage]
[endsect]
[endsect]
[endsect]