程序代写代做代考 algorithm AI Bloom Filter

Bloom Filter
COMPCSI 753: Algorithms for Massive Data
Instructor: Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 18, 2020
1

Basic definitions
 Let U be a universe of size n, i.e. U = {1, 2, 3, … , n}
 Cash register model stream:
 Sequence of m elements a1,…, am where ai Î U.
 Elements of U may or may not occur once or several times in the stream.  Example: A sequence of email addresses, a sequence of IP addresses,…
 Filtering a data stream (today’s lecture):
 Select elements with property x from the stream
2

Filtering data stream
 Each element of data stream is a tuple.
 Given a list of keys S, determine which stream elements are
in S.
 Obvious solution: Hash table
 We do not have enough memory to store all S in a hash table.  We might process millions of filters on the same stream.
3

Applications
 Email spam filtering:
 We know approximately 1 billion “good” email addresses.
 If an email comes from one of these, it is NOT spam.
http://www.junkemailfilter.com
4

Applications
 Publish-subscribe systems:
 You are collecting a lots of message (new articles).
 People express interest in certain sets of keywords.
 Determine whether each message matches user’s interest.
https://docs.oracle.com
5

First cut solution
 Given a set of keys S.
 Create a bit array B of n bits, initially at 0s.
 Choose a hash function h with range [0, n).
 Hash each member s Î S to one of n buckets, and set that
bit to 1, i.e. B[h(s)] = 1.
 Hash each element a of the stream and output only those that hash to bit that was set to 1.
 Output a if B[h(a)] = 1.
6

Item
Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to.
Hash func h
0010001011000
Bit array B It hashes to a bucket set
Drop the item.
to 0 so it is surely not in S.
 Create false positives but no false negatives:
 If the item is in S, we surely output it (no false negatives).  If not, we may still output it (false positives).
7

Parameter settings
 |S| = 1 billion email addresses. |B| = 1 GB = 8 billion bits.
If the email is in S, then it hash to bucket that set to 1. It always gets through (no false negatives).
 Approximately 1/8 of the bit are set to 1. So about 1/8 of the email addresses not in S get through to the output (false positives).
 Often, less than 1/8, since we might have collisions, i.e. more than one email address hash to the same position.
8

Analysis: Balls and bins model
 We want to analyze the number of false positives.
 General problem: If we randomly throw m balls into n bins,
what is the probability that a bin get at least one ball?
 Our case:
 Balls = elements of S
 Bins = buckets
 Randomly throwing = hashing
9

Analysis: Balls and bins model
 We have m balls, n bins.
 Prob. that a bin does not have any ball: (1-1/n)m
 Prob. that a bin has at least one ball: 1-(1-1/n)m
 Since (1-1/n)m ~ e-m/n, the prob. that a bin gets at least one ball is: 1 – e-m/n
 Fraction of 1s in the array B = prob. of false positives = 1 – e-m/n
 Example: m = 1 billions, n = 8 billion bits = 1GB, the fraction of 1s in B is 0.1175 (~1/8 = 0.125).
10

Bloom filter
 |S| = m, |B| = n
 Use k independent hash function h1, … , hk with range [0, n).
 Construct Bloom filter:  Set B to all 0s.
 Hash each member s Î S to one of n buckets using k hash functions, and set B[hi(s)] = 1 for i = 1, … , k.
 Filtering:
 When a stream element a arrives,
o IfB[hi(a)]=1foralli=1,…,k,thendeclareaÎS. o Otherwise, discard the element a.
11

Bloom filter
 Example:
 |S| = 3, |B| = 18
 S = {x, y, z}
k=3
 Discard w since all B[h(w)] are not all 1s.
12

Bloom filter analysis
 What fraction of the bit vector B are 1s?  Randomly throw km balls into n bins.
 So fraction of 1s is 1 – e-km/n
 Prob. that 1 position of a is 1s: 1 – e-km/n
 Prob. that all k positions of a is 1s: (1 – e-km/n)k  So, false positive probability is: (1 – e-km/n)k
13

False positive prob.
Parameter settings
 |S| = 1 billion, n = 8 billion  k = 1: (1 – e-1/8) = 0.1175
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04
k=2:(1–e-1/4)2 =0.0493  Can we increase k to ∞?
No
 Optimal value of k:
0.020 2 4 6 8 10 12 14 16 18 20
 k = nln(2)/m
 Our case, optimal k = 8ln(2) » 6 Erroratk=6:(1–e-3/4)6 =0.0216
Number of hash functions, k
14

Storage system
 Bloom filter speeds up answers in a key-value storage system.
 Values are stored on a disk which has slow access times while bloom filter decisions are much faster.
Source: Wikipedia
15

Bloom filter: Summary
 Bloom filters guarantee no false negatives and use limited memory, which is suitable for processing data stream.
 Great for pre-processing before more expensive check.
 Very suitable for hardware implementation.  Hash function computation can be parallelized.
16

Homework
 Implement the Bloom filter algorithm on the dataset from Assignment 1, using word tokens from vocal.kos.text:
 Description: Consider that word tokens in the file as “good” words and words not mentioned in the file as “bad” words.
 Goal: Implement an add-on for Gmail to help users filter out “bad” words efficiently while writing an email.
17