Bloom Filters

Dated Nov 7, 2021; last modified on Mon, 05 Sep 2022

Bloom Filters

In true fashion to Stigler’s Law of Eponymy (no scientific discovery is named after its original discoverer ), Ross Ashby (1960) did a probabilistic analysis of Calvin Mooers’s Zatocoding (1947) (coding system for library cards). This predates Burton Bloom’s work (1970).

Motivation

A Bloom filter for a set \(X: \{x_1, x_2, …, x_n\} \) from some universe \(\mathbb{U}\) allows one to test whether a given item \(x \in \mathbb{B}\) is an element of \(X\). Hash tables can already answer \(x \in X\) queries in \(O(1)\) expected time, and \(O(n)\) space.

Bloom filters allow us to answer queries in \(O(1)\) time, but with considerably less space, but at the cost of allowing the possibility of false-positives. False-positives make Bloom filters unsuitable as an exact membership data structure, but they are commonly used as filters or sanity checks for more complex data structures.

Sample Real-World Use-Case

The Chromium project has an optimization_guide::BloomFilter class that keeps track of a set of strings, e.g. list of domains provided by some blocklist/allowlist server. WTF::BloomFilter keeps track of unsigned data, e.g. hashes of CSS selectors for the purpose of getting a unique selector.

Pseuodocode

A Bloom filter consists of an array \(B[0 .. m-1]\) of bits, together with \(k\) hash functions \(h_1, h_2, …, h_k: \mathbb{U} \to \{0, 1, …, m-1\}\).

def make_bloom_filter([x_1, ..., x_n]):
  for h in [0, ..., m-1]:
    B[h] = 0 # Zero-out all bits in B
  for x_i in [x_1, ..., x_n]: # For each member of X
    for h_j in [h_1, ..., h_k]: # For each hash function
      B[h_j(x_i)] = 1 # Set the bit that the hash fn and specific x map to
  return B

def maybe_in_X(B, y):
  for h_j in [h_1, ..., h_k]:
    if B[h_j(y)] == 0: return False # Guaranteed correct
  return True # Not necessarily correct

Going by the discussion of the bias of Monte Carlo decision algorithms , a Bloom filter is a false-biased algorithm.

Notice that the various hash functions \(h_j\) can be computed in parallel.

False-Positive Rate

For purpose of analysis, the hash functions are assumed to be mutually independent, ideal and random functions. Fortunately, real-world behavior of Bloom filters appears consistent with this unrealistic theoretical analysis.

For all indices \(h, i\) and \(j\), \( \mathbb{P}\{h_j(x_i) = h\} = 1/m \) because \(h_j\) is assumed ideally random. Therefore:

$$ p = \mathbb{P}\{B[h] = 0\} = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m} $$

\(\left(1 - \frac{1}{m}\right)\) is the probability that some \(h_j(x_i)\) does not land in a specific \(B[h]\). For \(B[h]\) to remain as zero, we must have \(kn\) “misses”, hence \( \left(1 - \frac{1}{m}\right)^{kn} \).

lists some approximations that tend to show up in the analysis of algorithms:

  • Harmonic Sum: \(H_N = 1 + 1/2 + 1/3 + … + 1/N \approx ln(N) \)
  • Triangular Sum: \(1 + 2 + 3 + … + N \approx N^2/2\)
  • Geometric Sum when \(N = 2^n\): \(1 + 2 + 4 + … + N = 2N - 1 \)
  • Stirling’s Approximation: \(lg(N!) = lg(1) + lg(2) + lg(3) + … + lg(N) \approx N \cdot lg(N)\)
  • Binomial Coefficients when \(k\) is a small constant: \( \binom{N}{k} \approx N^k/k! \)
  • Exponential: \( (1 - 1/x)^x \approx 1/e \)

The expected number of 0-bits in \(B\) is \(\approx mp\). Moreover, Chernoff Bounds imply that the number of 0-bits is close to \(mp\) with high probability. Therefore, the probability of a false-positive is very close to:

$$ \left(1 - p\right)^k = \left(1 - e^{-kn/m}\right)^k $$

Why does base the false-positive rate on the expected number of 0-bits and the Chernoff bounds?

If all other parameters are held constant, the false-positive rate increases with \(n\) (# items) and decreases with \(m\) (# bits). The optimal value of \(k\) (# hash functions) is \(k = ln(2) \cdot (m/n)\).

considers the logarithm of the false-positive rate:

$$ ln \left((1 - p)^k\right) = k \cdot ln(1 - p) = -\frac{m}{n} \cdot ln(p) \cdot ln(1-p)$$

and argues that by symmetry, it is minimized at \(p = 1/2\).

I don’t follow the math that arrives at \(p = 1/2\). Given that \(m/n > 0\), and the two are held constant, minimizing \(-ln(p) \cdot ln(1-p)\) should give us the optimal \(p\). That said, the equation of the derivative equated to zero is an unwieldy \(\frac{ln(1-p)}{p} = \frac{ln(p)}{1-p}\), which can be eyeballed to \(p = 1/2\), and he neighborhood test performed. At least has a graphical visualization.

When is taking the logarithm a good idea? Is it because the logarithm function is monotonically increasing, and somehow simplifies in this case? Should have taken that optimization class in college!

This gives the false-positive rate of \((1/2)^{ln(2)(m/n)} \approx (0.61850^{m/n}\). In practice, \(k\) must be an integer, but we can get pretty close to this rate if \(m > > n\).

We can therefore achieve any desired false-positive rate \(\delta > 0\) using a Bloom filter of size:

$$ m = \Bigg \lceil \frac{lg(1/\delta)}{ln(2)}n \Bigg \rceil = \Theta\left(n log(1/\delta)\right) $$

and \(k = \lceil lg(1/\delta) \rceil\) hash functions. For example, with a 32n-bit table (one integer per item) and 22 hash functions, we get \(\delta \approx 2 \cdot 10^{-7}\).

Primary References

  1. Algorithms > Filtering and Streaming. Jeff Erickson. jeffe.cs.illinois.edu . Accessed Nov 7, 2021.

References

  1. Stigler's law of eponymy. en.wikipedia.org . Accessed Nov 7, 2021.
  2. Chromium Code Search: lang:cpp bloom -test -.cc. source.chromium.org . source.chromium.org . source.chromium.org . Accessed Nov 7, 2021.
  3. Algorithms > Analysis of Algorithms. Sedgewick, Robert; Kevin Wayne. 2011.
  4. minimize -ln(p)ln(1-p) - Wolfram|Alpha. www.wolframalpha.com . Accessed Nov 8, 2021.