2
0
mirror of https://github.com/boostorg/bloom.git synced 2026-01-19 16:12:10 +00:00
joaquintides 916c845c41 added documentation
* removed unneeded explicit

* fixed boundary results for capacity_for and fpr_for

* renamed used_block_size to used_value_size

* added reset(n,n)

* added initial documentation draft

* static asserted assumption on Block size

* synced up naming in comment with that of docs

* added implementation notes

* editorial

* expanded tables

* removed unneeded explicit

* fixed boundary results for capacity_for and fpr_for

* renamed used_block_size to used_value_size

* added reset(n,n)

* added initial documentation draft

* static asserted assumption on Block size

* synced up naming in comment with that of docs

* added implementation notes

* editorial

* added benchmarks

* editorial

* added configuration section

* editorial

* s/multiinsertion/multi-insertion

* added section on use cases

* editorial
2025-04-02 20:42:19 +02:00
2025-02-21 11:54:48 +01:00
2025-03-30 19:05:52 +02:00
2025-04-02 20:42:19 +02:00
2025-04-02 20:42:19 +02:00
2025-02-21 13:34:35 +01:00
2025-04-02 20:42:19 +02:00
2025-02-21 13:10:15 +01:00
2025-03-22 19:27:39 +01:00

Candidate Boost Bloom Library

(Candidate) Boost.Bloom provides the class template boost::bloom::filter that can be configured to implement a classical Bloom filter as well as variations discussed in the literature such as blocked filters, split block/multi-block filters, and more.

Example

#include <boost/bloom/filter.hpp>
#include <cassert>
#include <string>

int main()
{
  // Bloom filter of strings with 5 bits set per insertion
  using filter = boost::bloom::filter<std::string, 5>;

  // create filter with a capacity of 1'000'000 **bits**
  filter f(1'000'000);

  // insert elements (they can't be erased, Bloom filters are insert-only)
  f.insert("hello");
  f.insert("Boost");
  //...

  // elements inserted are always correctly checked as such
  assert(f.may_contain("hello") == true);

  // elements not inserted may incorrectly be identified as such with a
  // false positive rate (FPR) which is a function of the array capacity,
  // the number of bits set per element and generally how the boost::bloom::filter
  // was specified
  if(f.may_contain("bye")) { // likely false
    //...
  }
}

Filter definition

A boost::bloom::filter can be regarded as an array of buckets selected pseudo-randomly (based on a hash function) upon insertion: each of the buckets is passed to a subfilter that marks one or more of its bits according to some associated strategy.

template<
  typename T, std::size_t K,
  typename Subfilter = block<unsigned char, 1>, std::size_t BucketSize = 0,
  typename Hash = boost::hash<T>, typename Allocator = std::allocator<T>  
>
class filter;
  • T: type of the elements inserted.
  • K number of buckets marked per insertion.
  • Subfilter: type of subfilter used (more on this later).
  • BucketSize: the number of buckets is just the capacity of the underlying array (in bytes), divided by BucketSize. When BucketSize is specified as zero, the value sizeof(Subfilter::value_type) is used.

The default configuration with block<unsigned char,1> corresponds to a classical Bloom filter setting K bits per elements uniformly distributed across the array.

Overlapping buckets

BucketSize can be any value other (and typically less) than sizeof(Subfilter::value_type). When this is the case, subfilters in adjacent buckets act on overlapping byte ranges. Far from being a problem, this improves the resulting false positive rate (FPR) of the filter. The downside is that buckets won't be in general properly aligned in memory, which may result in more cache misses.

Provided subfilters

block<Block, K'>

Sets K' bits in an underlying value of the unsigned integral type Block (e.g. unsigned char, uint32_t, uint64_t). So, a filter<T, K, block<Block, K'>> will set K*K' bits per element. The tradeoff here is that insertion/lookup will be (much) faster than with filter<T, K*K'> while the FPR will be worse (larger). FPR is better the wider Block is.

multiblock<Block, K'>

Instead of setting K' bits in a Block value, this subfilter sets one bit on each of the elements of a Block[K'] subarray. This improves FPR but impacts performance with respect to block<Block, K'>, among other things because cacheline boundaries can be crossed when accessing the subarray.

fast_multiblock32<K'>

Statistically equivalent to multiblock<uint32_t, K'>, but uses faster SIMD-based algorithms when SSE2, AVX2 or Neon are available. The approach is similar to that of Apache Kudu BlockBloomFilter, but that implementation is fixed to K' = 8 whereas we accept any value.

fast_multiblock64<K'>

Statistically equivalent to multiblock<uint64_t, K'>, but uses a faster SIMD-based algorithm when AVX2 is available.

Estimating FPR

For a classical Bloom filter, the theoretical false positive rate, under some simplifying assumptions, is given by

FPR(n,m,k)=\left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k \approx \left(1 - e^{-kn/m}\right)^k \text{ for large } m,

where n is the number of elements inserted in the filter, m its capacity in bits and k the number of bits set per insertion (see a derivation of this formula). For a given inverse load factor c=m/n, the optimum k is the integer closest to:

k_{\text{opt}}=c\cdot\ln2,

yielding a minimum attainable FPR of 1/2^{k_{\text{opt}}} \approx 0.6185^{c}.

In the case of a Boost.Bloom block filter of the form filter<T, K, block<Block, K'>>, we can extend the approach from Putze et al. to derive the (approximate but very precise) formula:

FPR_{\text{block}}(n,m,b,k,k')=\left(\sum_{i=0}^{\infty} \text{Pois}(i,nbk/m) \cdot FPR(i,b,k')\right)^{k},

where

\text{Pois}(i,\lambda)=\frac{\lambda^i e^{-\lambda}}{i!}

is the probability mass function of a Poisson distribution with mean \lambda, and b is the size of Block in bits. If we're using multiblock<Block,K'>, we have

FPR_\text{multiblock}(n,m,b,k,k')=\left(\sum_{i=0}^{\infty} \text{Pois}(i,nbkk'/m) \cdot FPR(i,b,1)^{k'}\right)^{k}.

As we have commented before, in general

FPR_\text{block}(n,m,b,k,k') \geq FPR_\text{multiblock}(n,m,b,k,k') \geq FPR(n,m,kk'),

that is, block and multi-block filters have worse FPR than the classical filter for the same number of bits set per insertion, but they will be much faster. We have the particular case

FPR_{\text{block}}(n,m,b,k,1)=FPR_{\text{multiblock}}(n,m,b,k,1)=FPR(n,m,k),

which follows simply from the observation that using {block|multiblock}<Block, 1> behaves exactly as a classical Bloom filter.

We don't know of any closed, simple formula for the FPR of block and multiblock filters when Bucketsize is not its "natural" size (sizeof(Block) for block<Block, K'>, sizeof(Block)*K' for multiblock<Block, K'>), that is, when subfilter values overlap. We can use the following approximations (s = BucketSize in bits):

FPR_{\text{block}}(n,m,b,s,k,k')=\left(\sum_{i=0}^{\infty} \text{Pois}\left(i,\frac{n(2b-s)k}{m}\right) \cdot FPR(i,2b-s,k')\right)^{k}, FPR_\text{multiblock}(n,m,b,s,k,k')=\left(\sum_{i=0}^{\infty} \text{Pois}\left(i,\frac{n(2bk'-s)k}{m}\right) \cdot FPR\left(i,\frac{2bk'-s}{k'},1\right)^{k'}\right)^{k},

where the replacement of b with 2b-s (or bk' with 2bk'-s for multiblock filters) accounts for the fact that the window of hashing positions affecting a particular bit spreads due to overlapping. Note that the formulas reduce to the non-ovelapping case when s takes its default value (b for block, bk´ for multiblock). These approximations are acceptable for low values of k' but tend to underestimate the actual FPR as k' grows. In general, the use of overlapping improves (decreases) FPR by a factor ranging from 0.6 to 0.9 for typical filter configurations.

Experimental results

Provided in a dedicated repo.

Description
Mirrored via gitea-mirror
Readme 860 KiB
Languages
C++ 95.5%
Python 2%
CMake 1%
Shell 0.6%
Batchfile 0.5%
Other 0.4%