Bloom Filters
Motivation
A Bloom filter for a set \(X: \{x_1, x_2, …, x_n\} \) from some universe \(\mathbb{U}\) allows one to test whether a given item \(x \in \mathbb{B}\) is an element of \(X\). Hash tables can already answer \(x \in X\) queries in \(O(1)\) expected time, and \(O(n)\) space.
Bloom filters allow us to answer queries in \(O(1)\) time, but with considerably less space, but at the cost of allowing the possibility of false-positives. False-positives make Bloom filters unsuitable as an exact membership data structure, but they are commonly used as filters or sanity checks for more complex data structures.
Sample Real-World Use-Case
The Chromium project has an optimization_guide::BloomFilter
class that keeps
track of a set of strings, e.g. list of domains provided by some
blocklist/allowlist server. WTF::BloomFilter
keeps track of unsigned data,
e.g. hashes of CSS selectors for the purpose of getting a unique selector.
Pseuodocode
A Bloom filter consists of an array \(B[0 .. m-1]\) of bits, together with \(k\) hash functions \(h_1, h_2, …, h_k: \mathbb{U} \to \{0, 1, …, m-1\}\).
def make_bloom_filter([x_1, ..., x_n]):
for h in [0, ..., m-1]:
B[h] = 0 # Zero-out all bits in B
for x_i in [x_1, ..., x_n]: # For each member of X
for h_j in [h_1, ..., h_k]: # For each hash function
B[h_j(x_i)] = 1 # Set the bit that the hash fn and specific x map to
return B
def maybe_in_X(B, y):
for h_j in [h_1, ..., h_k]:
if B[h_j(y)] == 0: return False # Guaranteed correct
return True # Not necessarily correct
Going by the
discussion of the bias of Monte Carlo decision algorithms
, a Bloom filter is a false
-biased algorithm.
Notice that the various hash functions \(h_j\) can be computed in parallel.
False-Positive Rate
For purpose of analysis, the hash functions are assumed to be mutually independent, ideal and random functions. Fortunately, real-world behavior of Bloom filters appears consistent with this unrealistic theoretical analysis.
For all indices \(h, i\) and \(j\), \( \mathbb{P}\{h_j(x_i) = h\} = 1/m \) because \(h_j\) is assumed ideally random. Therefore:
$$ p = \mathbb{P}\{B[h] = 0\} = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m} $$
\(\left(1 - \frac{1}{m}\right)\) is the probability that some \(h_j(x_i)\) does not land in a specific \(B[h]\). For \(B[h]\) to remain as zero, we must have \(kn\) “misses”, hence \( \left(1 - \frac{1}{m}\right)^{kn} \).
lists some approximations that tend to show up in the analysis of algorithms:
- Harmonic Sum: \(H_N = 1 + 1/2 + 1/3 + … + 1/N \approx ln(N) \)
- Triangular Sum: \(1 + 2 + 3 + … + N \approx N^2/2\)
- Geometric Sum when \(N = 2^n\): \(1 + 2 + 4 + … + N = 2N - 1 \)
- Stirling’s Approximation: \(lg(N!) = lg(1) + lg(2) + lg(3) + … + lg(N) \approx N \cdot lg(N)\)
- Binomial Coefficients when \(k\) is a small constant: \( \binom{N}{k} \approx N^k/k! \)
- Exponential: \( (1 - 1/x)^x \approx 1/e \)
The expected number of 0-bits in \(B\) is \(\approx mp\). Moreover, Chernoff Bounds imply that the number of 0-bits is close to \(mp\) with high probability. Therefore, the probability of a false-positive is very close to:
$$ \left(1 - p\right)^k = \left(1 - e^{-kn/m}\right)^k $$
If all other parameters are held constant, the false-positive rate increases with \(n\) (# items) and decreases with \(m\) (# bits). The optimal value of \(k\) (# hash functions) is \(k = ln(2) \cdot (m/n)\).
considers the logarithm of the false-positive rate:
$$ ln \left((1 - p)^k\right) = k \cdot ln(1 - p) = -\frac{m}{n} \cdot ln(p) \cdot ln(1-p)$$
and argues that by symmetry, it is minimized at \(p = 1/2\).
I don’t follow the math that arrives at \(p = 1/2\). Given that \(m/n > 0\), and the two are held constant, minimizing \(-ln(p) \cdot ln(1-p)\) should give us the optimal \(p\). That said, the equation of the derivative equated to zero is an unwieldy \(\frac{ln(1-p)}{p} = \frac{ln(p)}{1-p}\), which can be eyeballed to \(p = 1/2\), and he neighborhood test performed. At least has a graphical visualization.
When is taking the logarithm a good idea? Is it because the logarithm function is monotonically increasing, and somehow simplifies in this case? Should have taken that optimization class in college!
This gives the false-positive rate of \((1/2)^{ln(2)(m/n)} \approx (0.61850^{m/n}\). In practice, \(k\) must be an integer, but we can get pretty close to this rate if \(m > > n\).
We can therefore achieve any desired false-positive rate \(\delta > 0\) using a Bloom filter of size:
$$ m = \Bigg \lceil \frac{lg(1/\delta)}{ln(2)}n \Bigg \rceil = \Theta\left(n log(1/\delta)\right) $$
and \(k = \lceil lg(1/\delta) \rceil\) hash functions. For example, with a 32n-bit table (one integer per item) and 22 hash functions, we get \(\delta \approx 2 \cdot 10^{-7}\).
Primary References
References
- Algorithms > Analysis of Algorithms. Sedgewick, Robert; Kevin Wayne. 2011.
In true fashion to Stigler’s Law of Eponymy (no scientific discovery is named after its original discoverer ), Ross Ashby (1960) did a probabilistic analysis of Calvin Mooers’s Zatocoding (1947) (coding system for library cards). This predates Burton Bloom’s work (1970).