# Bloom Filters

## Bloom Filters

In true fashion to Stigler’s Law of Eponymy (no scientific discovery is named after its original discoverer ), Ross Ashby (1960) did a probabilistic analysis of Calvin Mooers’s Zatocoding (1947) (coding system for library cards). This predates Burton Bloom’s work (1970).

### Motivation

A Bloom filter for a set $$X: \{x_1, x_2, …, x_n\}$$ from some universe $$\mathbb{U}$$ allows one to test whether a given item $$x \in \mathbb{B}$$ is an element of $$X$$. Hash tables can already answer $$x \in X$$ queries in $$O(1)$$ expected time, and $$O(n)$$ space.

Bloom filters allow us to answer queries in $$O(1)$$ time, but with considerably less space, but at the cost of allowing the possibility of false-positives. False-positives make Bloom filters unsuitable as an exact membership data structure, but they are commonly used as filters or sanity checks for more complex data structures.

#### Sample Real-World Use-Case

The Chromium project has an optimization_guide::BloomFilter class that keeps track of a set of strings, e.g. list of domains provided by some blocklist/allowlist server. WTF::BloomFilter keeps track of unsigned data, e.g. hashes of CSS selectors for the purpose of getting a unique selector.

### Pseuodocode

A Bloom filter consists of an array $$B[0 .. m-1]$$ of bits, together with $$k$$ hash functions $$h_1, h_2, …, h_k: \mathbb{U} \to \{0, 1, …, m-1\}$$.

def make_bloom_filter([x_1, ..., x_n]):
for h in [0, ..., m-1]:
B[h] = 0 # Zero-out all bits in B
for x_i in [x_1, ..., x_n]: # For each member of X
for h_j in [h_1, ..., h_k]: # For each hash function
B[h_j(x_i)] = 1 # Set the bit that the hash fn and specific x map to
return B

def maybe_in_X(B, y):
for h_j in [h_1, ..., h_k]:
if B[h_j(y)] == 0: return False # Guaranteed correct
return True # Not necessarily correct


Going by the discussion of the bias of Monte Carlo decision algorithms , a Bloom filter is a false-biased algorithm.

Notice that the various hash functions $$h_j$$ can be computed in parallel.

### False-Positive Rate

For purpose of analysis, the hash functions are assumed to be mutually independent, ideal and random functions. Fortunately, real-world behavior of Bloom filters appears consistent with this unrealistic theoretical analysis.

For all indices $$h, i$$ and $$j$$, $$\mathbb{P}\{h_j(x_i) = h\} = 1/m$$ because $$h_j$$ is assumed ideally random. Therefore:

$$p = \mathbb{P}\{B[h] = 0\} = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}$$

$$\left(1 - \frac{1}{m}\right)$$ is the probability that some $$h_j(x_i)$$ does not land in a specific $$B[h]$$. For $$B[h]$$ to remain as zero, we must have $$kn$$ “misses”, hence $$\left(1 - \frac{1}{m}\right)^{kn}$$.

lists some approximations that tend to show up in the analysis of algorithms:

• Harmonic Sum: $$H_N = 1 + 1/2 + 1/3 + … + 1/N \approx ln(N)$$
• Triangular Sum: $$1 + 2 + 3 + … + N \approx N^2/2$$
• Geometric Sum when $$N = 2^n$$: $$1 + 2 + 4 + … + N = 2N - 1$$
• Stirling’s Approximation: $$lg(N!) = lg(1) + lg(2) + lg(3) + … + lg(N) \approx N \cdot lg(N)$$
• Binomial Coefficients when $$k$$ is a small constant: $$\binom{N}{k} \approx N^k/k!$$
• Exponential: $$(1 - 1/x)^x \approx 1/e$$

The expected number of 0-bits in $$B$$ is $$\approx mp$$. Moreover, Chernoff Bounds imply that the number of 0-bits is close to $$mp$$ with high probability. Therefore, the probability of a false-positive is very close to:

$$\left(1 - p\right)^k = \left(1 - e^{-kn/m}\right)^k$$

Why does base the false-positive rate on the expected number of 0-bits and the Chernoff bounds?

If all other parameters are held constant, the false-positive rate increases with $$n$$ (# items) and decreases with $$m$$ (# bits). The optimal value of $$k$$ (# hash functions) is $$k = ln(2) \cdot (m/n)$$.

considers the logarithm of the false-positive rate:

$$ln \left((1 - p)^k\right) = k \cdot ln(1 - p) = -\frac{m}{n} \cdot ln(p) \cdot ln(1-p)$$

and argues that by symmetry, it is minimized at $$p = 1/2$$.

I don’t follow the math that arrives at $$p = 1/2$$. Given that $$m/n > 0$$, and the two are held constant, minimizing $$-ln(p) \cdot ln(1-p)$$ should give us the optimal $$p$$. That said, the equation of the derivative equated to zero is an unwieldy $$\frac{ln(1-p)}{p} = \frac{ln(p)}{1-p}$$, which can be eyeballed to $$p = 1/2$$, and he neighborhood test performed. At least has a graphical visualization.

When is taking the logarithm a good idea? Is it because the logarithm function is monotonically increasing, and somehow simplifies in this case? Should have taken that optimization class in college!

This gives the false-positive rate of $$(1/2)^{ln(2)(m/n)} \approx (0.61850^{m/n}$$. In practice, $$k$$ must be an integer, but we can get pretty close to this rate if $$m > > n$$.

We can therefore achieve any desired false-positive rate $$\delta > 0$$ using a Bloom filter of size:

$$m = \Bigg \lceil \frac{lg(1/\delta)}{ln(2)}n \Bigg \rceil = \Theta\left(n log(1/\delta)\right)$$

and $$k = \lceil lg(1/\delta) \rceil$$ hash functions. For example, with a 32n-bit table (one integer per item) and 22 hash functions, we get $$\delta \approx 2 \cdot 10^{-7}$$.

## Primary References

1. Algorithms > Filtering and Streaming. Jeff Erickson. http://jeffe.cs.illinois.edu/teaching/algorithms/notes/06-bloom.pdf . Accessed Nov 7, 2021.

## References

1. Stigler's law of eponymy. https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy . Accessed Nov 7, 2021.
2. Algorithms > Analysis of Algorithms. Sedgewick, Robert; Kevin Wayne. 2011.
3. minimize -ln(p)ln(1-p) - Wolfram|Alpha. https://www.wolframalpha.com/input/?i=minimize+-ln%28p%29ln%281-p%29 . Accessed Nov 8, 2021.