Bloom Filter
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 18, 2020
1
Basic definitions
Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}
Cash register model stream:
Sequence of m elements a1,…, am where ai Î U.
Elements of U may or may not occur once or several times in the stream. Example: A sequence of email addresses, a sequence of IP addresses,…
Filtering a data stream (today’s lecture):
Select elements with property x from the stream
2
Filtering data stream
Each element of data stream is a tuple.
Given a list of keys S, determine which stream elements are
in S.
Obvious solution: Hash table
We do not have enough memory to store all S in a hash table. We might process millions of filters on the same stream.
3
Applications
Email spam filtering:
We know approximately 1 billion “good” email addresses.
If an email comes from one of these, it is NOT spam.
http://www.junkemailfilter.com
4
Applications
Publish-subscribe systems:
You are collecting a lots of message (new articles).
People express interest in certain sets of keywords.
Determine whether each message matches user’s interest.
https://docs.oracle.com
5
First cut solution
Given a set of keys S.
Create a bit array B of n bits, initially at 0s.
Choose a hash function h with range [0, n).
Hash each member s Î S to one of n buckets, and set that
bit to 1, i.e. B[h(s)] = 1.
Hash each element a of the stream and output only those that hash to bit that was set to 1.
Output a if B[h(a)] = 1.
6
Item
Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to.
Hash func h
0010001011000
Bit array B It hashes to a bucket set
Drop the item.
to 0 so it is surely not in S.
Create false positives but no false negatives:
If the item is in S, we surely output it (no false negatives). If not, we may still output it (false positives).
7
Parameter settings
|S| = 1 billion email addresses. |B| = 1 GB = 8 billion bits.
If the email is in S, then it hash to bucket that set to 1. It always gets through (no false negatives).
Approximately 1/8 of the bit are set to 1. So about 1/8 of the email addresses not in S get through to the output (false positives).
Often, less than 1/8, since we might have collisions, i.e. more than one email address hash to the same position.
8
Analysis: Balls and bins model
We want to analyze the number of false positives.
General problem: If we randomly throw m balls into n bins,
what is the probability that a bin get at least one ball?
Our case:
Balls = elements of S
Bins = buckets
Randomly throwing = hashing
9
Analysis: Balls and bins model
We have m balls, n bins.
Prob. that a bin does not have any ball: (1-1/n)m
Prob. that a bin has at least one ball: 1-(1-1/n)m
Since (1-1/n)m ~ e-m/n, the prob. that a bin gets at least one ball is: 1 – e-m/n
Fraction of 1s in the array B = prob. of false positives = 1 – e-m/n
Example: m = 1 billions, n = 8 billion bits = 1GB, the fraction of 1s in B is 0.1175 (~1/8 = 0.125).
10
Bloom filter
|S| = m, |B| = n
Use k independent hash function h1, … , hk with range [0, n).
Construct Bloom filter: Set B to all 0s.
Hash each member s Î S to one of n buckets using k hash functions, and set B[hi(s)] = 1 for i = 1, … , k.
Filtering:
When a stream element a arrives,
o IfB[hi(a)]=1foralli=1,…,k,thendeclareaÎS. o Otherwise, discard the element a.
11
Bloom filter
Example:
|S| = 3, |B| = 18
S = {x, y, z}
k=3
Discard w since all B[h(w)] are not all 1s.
12
Bloom filter analysis
What fraction of the bit vector B are 1s? Randomly throw km balls into n bins.
So fraction of 1s is 1 – e-km/n
Prob. that 1 position of a is 1s: 1 – e-km/n
Prob. that all k positions of a is 1s: (1 – e-km/n)k So, false positive probability is: (1 – e-km/n)k
13
False positive prob.
Parameter settings
|S| = 1 billion, n = 8 billion k = 1: (1 – e-1/8) = 0.1175
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04
k=2:(1–e-1/4)2 =0.0493 Can we increase k to ∞?
No
Optimal value of k:
0.020 2 4 6 8 10 12 14 16 18 20
k = nln(2)/m
Our case, optimal k = 8ln(2) » 6 Erroratk=6:(1–e-3/4)6 =0.0216
Number of hash functions, k
14
Storage system
Bloom filter speeds up answers in a key-value storage system.
Values are stored on a disk which has slow access times while bloom filter decisions are much faster.
Source: Wikipedia
15
Bloom filter: Summary
Bloom filters guarantee no false negatives and use limited memory, which is suitable for processing data stream.
Great for pre-processing before more expensive check.
Very suitable for hardware implementation. Hash function computation can be parallelized.
16
Homework
Implement the Bloom filter algorithm on the dataset from Assignment 1, using word tokens from vocal.kos.text:
Description: Consider that word tokens in the file as “good” words and words not mentioned in the file as “bad” words.
Goal: Implement an add-on for Gmail to help users filter out “bad” words efficiently while writing an email.
17