CSC263 – Week 5, Lecture 1
Cristyn Howard
Monday, February 5, 2018
Q: A webserver needs to reference a large list of blacklisted websites. The entire list
cannot be stored in main memory, because it is too large. How can we store it?
A: In a BLOOM FILTER
Bloom Filters
• probablistic (i.e. approximate) dictionary that stores FS : a summary/fingerprint of a dynamic set S
BF[0, …, m − 1] : array of m bits, initialized to 0.
{h1, …, ht} : set of t < m hash fuctions where hi : U → {0, ..., m − 1}
m, t : parameters defining # of bits & # of hash functions, respectively
• Operations:
Θ(n) INSERT(FS , x) : for i from 1 to t, BF[hi(x)] = 1;
∗ element passed to all hash functions, produces a t-tuple of indices ∈ {0, ..., m − 1} that form fingerprint of x
∗ bits at all indices in fingerprint of x permanently set to 1
Θ(n) SEARCH(FS , x) : for i from 1 to t, if (BF[hi(x)] == 0) return FALSE, else return TRUE
∗ ∗ ∗
– cannot delete elements from S
– SEARCH may return false positives - may say that x is likely in S when it is not.
• Use when: space constraints exist, false positives are acceptable, and when deleting is not necessary.
Returns: {FALSE ≡ x definitely not in FS }||{TRUE ≡ x probably in FS }
If any index in the hash of x is 0, then x has definitely not been placed in S.
However, if x is not in S, and the union of all indices in the hashes of all elements stored in S contains all indices in the hash of x, then SEARCH will return a false positive.
very space and time efficient way to store and search large sets. • Limitations:
• Benefits:
1
.
False Positive Probability Analysis
Assume n elements inserted. Compute probability that SEARCH(BF, x) returns a false positive, i.e. a positive when x has not been inserted.
Recall that hash functions have uniform, independent probability distributions.
2