COMPSCI 753 Algorithms for Massive Data Semester 2, 2020
Tutorial 1: Locality-sensitive Hashing
Ninh Pham
1 Computing MinHash signatures and estimating Jaccard sim- ilarities
Given the 4 sets S1 = {c,f},S2 = {a,b},S3 = {d,e},S4 = {a,c,e}. 1. Present these sets as a binary matrix.
2. Compute the minhash values of each set using the permutation π = {b, e, a, f, c, d}. Solution:
Elements Integers S1 S2 S3 S4
a00101 b10100 c21001 d30010 e40011 f51000
Table 1: Binary matrix presents the sets.
Integer π Elements S1 S2 S3 S4
1ba0101 4eb0100 0ac1001 5fd0010 2ce0011 3df1000
Following the procedure from the lecture note, we have the answer: Hash function S1 S2 S3 S4
π abca 1
2 Fast computing MinHash signatures
Since it is not feasible to permute a very large matrix explicitly, we will simulate random permutations by using random universal hash functions below:
h1(x)=2x+1 mod6,h2(x)=3x+2 mod6,andh3(x)=5x+2 mod6.
1. Compute the minhash values using these universal hash functions. Note that you have
to map a string to an integer, e.g. a → 0,b → 1,…
2. Which of these hash functions are true permutations?
3. How close are the estimated Jaccard similarities of the six pairs of columns to the true Jaccard similarities?
Solution:
Integers S1 S2 S3 S4 h1(x)=2x+1 mod6 h2(x)=3x+2 mod6 h3(x)=5x+2 mod6
00101122 10100351 21001520 30010155 40011324 51000553
Hash functions S1 S2 S3 S4
h1(x) 5111 h2(x) 2222 h3(x) 0140
Table 4: The minhash values with universal hash functions.
S1 S2 S3 S4 S1 1 0 0 1/4 S2 0 1 0 1/3 S3 0 0 1 1/4 S4 1/4 1/3 1/4 1
Table 5: The actual Jaccard similarity values
S1 S2 S3 S4 S1 1 1/3 1/3 2/3 S2 1/3 1 2/3 2/3 S3 1/3 2/3 1 2/3 S4 2/3 2/3 2/3 1
Table 6: The estimated Jaccard similarity using these 3 hash functions. 2
3 Tuning the parameters for LSH
Evaluate the S-curve 1 − (1 − sr)b, i.e. the probability of being a candidate pair, for s = {0.1, 0.2, . . . , 0.9} using the following values of r and b.
1. r=3andb=10. 2. r=6andb=20. 3. r=5andb=50.
For each value (r,b) above, compute the threshold, that is the value of s which the value of 1−(1−sr)b is exactly 1/2. How is it different from our approximation (1/b)1/r? Which setting we should use in order to achieve the false negatives of 70%-similar pairs at most 5% and false positives of 30%-similar pairs at most 15%.
Solution:
(3, 10) 0.0100
0.0772 0.2394 0.4839 0.7369 0.9123 0.9850 0.9992 1.0000
Exact s Estimate (1/b)1/r
(6, 20) 0.0000
0.0013 0.0145 0.0788 0.2702 0.6154 0.9182 0.9977 1.0000
(3, 10) 0.4062
0.4642
(5, 50) 0.0005
0.0159 0.1145 0.4023 0.7956 0.9825 0.9999 1.0000 1.0000
(6, 20) 0.5694
s
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(5, 50) 0.4244
It is clearly that we need to use r = 5 and b = 50 since pairs is 0.9999 and 30%-similar pairs is 0.1145.
3
0.6070
the probability of collision of 70%-similar
0.4573