SECTION 8: HASHING APPLICATIONS, SET
RESEMBLANCE & PRIMALITY TESTING
ISABELLE ZHENG
TABLE OF CONTENTS
¡ Hashing Applications: Bloom Filters and Fingerprinting ¡ Set resemblance
¡ Primality Testing
¡ Section Problems
BLOOM FILTERS
A Bloom filter is a probabilistic data structure used for set membership problems. It is more space efficient than conventional hashing schemes.
• There are ! bits and ” hash functions #$, #&, … #( .
• When adding an element ) to the set, set bits #$()), #&()), … #(()) to 1.
• To check if ) is already in the set, check if the corresponding bits are set to 1.
With Bloom filters, we trade away correctness for space — it’s possible we say ) is in the Bloom filter when it is not. However, with Bloom filters, we only need ! bits of memory.
BLOOM FILTERS EXAMPLE
0
0
0
0
0
0
0
* = {#”,#’}
!”#” =1,!’#” =4 !”#’ =5,!’#’ =4 !”#- =5,!’#- =1
0
0
0
0
FINGERPRIN TING
Goal: use a short, identifying “fingerprint” for some pattern ! to pattern match in a larger file.
• Hash each set of |!| consecutive characters (sliding window) into a 16 bit value (or other size) by taking mods.
• Randomly select some prime #.
• Instead of taking mods for every set of |!| consecutive characters naively, we can just modify the hashed value
for the previous set of |!| characters, which we call $. Let % be the leftmost digit of $ and & be the rightmost digit of our new number $’.
$’= 10$−10,-.% +& 012#
When we update, we remove the leftmost digit % and insert a new rightmost digit &. We can use multiple primes to make the probability of a false positive small.
FINGERPRINTING EXAMPLE
|”|=5 #=13 $% = 10 $ −10*+, – +/ 012#
3
1
4
1
5
2
SET RESEMBLANCE
Our goal: determine whether or not two documents are “near duplicates” – the document similarity problem
How?
• Define set resemblance
• Find a way to estimate resemblance efficiently
• Turn document similarity into a set resemblance problem
DEFINING RESEMBLANCE
Consider two sets A and B. We define the resemblance of A and B (also called the Jaccard Coefficient) to be: !”#”$%&'()” *,, =. *,, = |*∩,|
|*∪,| . *,, = 0 If two sets are disjoint
How long does it take to compare two sets this way?
Notice that:
0≤.*,, ≤1
. *, , = 1 If two sets are identical
5((7)
5(( log () 5(()
Naive
Sort, then compare Using hashing
RANDOM PERMUTATIONS
We need a “black box” !! that will efficiently output random permutations on our universe. For example, !! 1,. = “,(.) !! 50,. = “23(.)
Say our random permutations are in the family: ” ∶ 0, 15 → [0, 15] Then, they might look like:
+
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
“,(.)
9
2
14
11
6
3
7
8
15
10
4
13
12
0
1
5
“0(.)
3
4
7
12
6
14
1
5
2
8
15
7
11
13
10
9
We use “: 4 to denote the set of elements obtained by computing !!(;, .) for every . in 4 (”calling card”). Ifwehaveaset 4={3,5,11,4} whatis”0 4 ?
ESTIMATING RESEMBLANCE
Ifwecompute!” # and!” $ ,notethatmin{!” # }=min{!” $ }onlyifsomeelement+suchthat !” x = min{!” # }=min{!” $ }.
Then, +, the minimum of the union of two sets # ∪ $, has to lie in the intersection # ∩ $. Pr[min{!” # } = min{!” $ }]
=Pr[min{!” #∪$ }=min{!” #∩$ }] = |#∩$|=5(#,$)
|#∪$|
We can just estimate resemblance by taking many permutations and computing their minimums! Then our estimate for resemblance is just:
9:;<=>;? @AB 5 #, $ = # A@ =>;Dh?:
# A@ F?B=G;>;