Bloom Filters

In true fashion to Stigler’s Law of Eponymy (no scientific discovery is named after its original discoverer ), Ross Ashby (1960) did a probabilistic analysis of Calvin Mooers’s Zatocoding (1947) (coding system for library cards). This predates Burton Bloom’s work (1970).

Motivation

A Bloom filter for a set $X: \{x_1, x_2, …, x_n\} $ from some universe $\mathbb{U}$ allows one to test whether a given item $x \in \mathbb{B}$ is an element of $X$. Hash tables can already answer $x \in X$ queries in $O(1)$ expected time, and $O(n)$ space.

Bloom filters allow us to answer queries in $O(1)$ time, but with considerably less space, but at the cost of allowing the possibility of false-positives. False-positives make Bloom filters unsuitable as an exact membership data structure, but they are commonly used as filters or sanity checks for more complex data structures.

Sample Real-World Use-Case

The Chromium project has an optimization_guide::BloomFilter class that keeps track of a set of strings, e.g. list of domains provided by some blocklist/allowlist server. WTF::BloomFilter keeps track of unsigned data, e.g. hashes of CSS selectors for the purpose of getting a unique selector.

Pseuodocode

A Bloom filter consists of an array $B[0 .. m-1]$ of bits, together with $k$ hash functions $h_1, h_2, …, h_k: \mathbb{U} \to \{0, 1, …, m-1\}$.

def make_bloom_filter([x_1, ..., x_n]):
  for h in [0, ..., m-1]:
    B[h] = 0 # Zero-out all bits in B
  for x_i in [x_1, ..., x_n]: # For each member of X
    for h_j in [h_1, ..., h_k]: # For each hash function
      B[h_j(x_i)] = 1 # Set the bit that the hash fn and specific x map to
  return B

def maybe_in_X(B, y):
  for h_j in [h_1, ..., h_k]:
    if B[h_j(y)] == 0: return False # Guaranteed correct
  return True # Not necessarily correct

Going by the discussion of the bias of Monte Carlo decision algorithms , a Bloom filter is a false-biased algorithm.

Notice that the various hash functions $h_j$ can be computed in parallel.

False-Positive Rate

For purpose of analysis, the hash functions are assumed to be mutually independent, ideal and random functions. Fortunately, real-world behavior of Bloom filters appears consistent with this unrealistic theoretical analysis.

For all indices $h, i$ and $j$, $ \mathbb{P}\{h_j(x_i) = h\} = 1/m $ because $h_j$ is assumed ideally random. Therefore:

$$ p = \mathbb{P}\{B[h] = 0\} = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m} $$

$\left(1 - \frac{1}{m}\right)$ is the probability that some $h_j(x_i)$ does not land in a specific $B[h]$. For $B[h]$ to remain as zero, we must have $kn$ “misses”, hence $ \left(1 - \frac{1}{m}\right)^{kn} $.

lists some approximations that tend to show up in the analysis of algorithms:

Harmonic Sum: $H_N = 1 + 1/2 + 1/3 + … + 1/N \approx ln(N) $
Triangular Sum: $1 + 2 + 3 + … + N \approx N^2/2$
Geometric Sum when $N = 2^n$: $1 + 2 + 4 + … + N = 2N - 1 $
Stirling’s Approximation: $lg(N!) = lg(1) + lg(2) + lg(3) + … + lg(N) \approx N \cdot lg(N)$
Binomial Coefficients when $k$ is a small constant: $ \binom{N}{k} \approx N^k/k! $
Exponential: $ (1 - 1/x)^x \approx 1/e $

The expected number of 0-bits in $B$ is $\approx mp$. Moreover, Chernoff Bounds imply that the number of 0-bits is close to $mp$ with high probability. Therefore, the probability of a false-positive is very close to:

$$ \left(1 - p\right)^k = \left(1 - e^{-kn/m}\right)^k $$

Why does base the false-positive rate on the expected number of 0-bits and the Chernoff bounds?

If all other parameters are held constant, the false-positive rate increases with $n$ (# items) and decreases with $m$ (# bits). The optimal value of $k$ (# hash functions) is $k = ln(2) \cdot (m/n)$.

considers the logarithm of the false-positive rate:

$$ ln \left((1 - p)^k\right) = k \cdot ln(1 - p) = -\frac{m}{n} \cdot ln(p) \cdot ln(1-p)$$

and argues that by symmetry, it is minimized at $p = 1/2$.

I don’t follow the math that arrives at $p = 1/2$. Given that $m/n > 0$, and the two are held constant, minimizing $-ln(p) \cdot ln(1-p)$ should give us the optimal $p$. That said, the equation of the derivative equated to zero is an unwieldy $\frac{ln(1-p)}{p} = \frac{ln(p)}{1-p}$, which can be eyeballed to $p = 1/2$, and he neighborhood test performed. At least has a graphical visualization.

When is taking the logarithm a good idea? Is it because the logarithm function is monotonically increasing, and somehow simplifies in this case? Should have taken that optimization class in college!

This gives the false-positive rate of $(1/2)^{ln(2)(m/n)} \approx (0.61850^{m/n}$. In practice, $k$ must be an integer, but we can get pretty close to this rate if $m > > n$.

We can therefore achieve any desired false-positive rate $\delta > 0$ using a Bloom filter of size:

$$ m = \Bigg \lceil \frac{lg(1/\delta)}{ln(2)}n \Bigg \rceil = \Theta\left(n log(1/\delta)\right) $$

and $k = \lceil lg(1/\delta) \rceil$ hash functions. For example, with a 32n-bit table (one integer per item) and 22 hash functions, we get $\delta \approx 2 \cdot 10^{-7}$.

Primary References

Algorithms > Filtering and Streaming. Jeff Erickson. jeffe.cs.illinois.edu . Accessed Nov 7, 2021.

References

Stigler's law of eponymy. en.wikipedia.org . Accessed Nov 7, 2021.
Chromium Code Search: lang:cpp bloom -test -.cc. source.chromium.org . source.chromium.org . source.chromium.org . Accessed Nov 7, 2021.
Algorithms > Analysis of Algorithms. Sedgewick, Robert; Kevin Wayne. 2011.
minimize -ln(p)ln(1-p) - Wolfram|Alpha. www.wolframalpha.com . Accessed Nov 8, 2021.