COMP9313: Big Data Management
Course web site: http://www.cse.unsw.edu.au/~cs9313/
Chapter 7.2:
Finding Similar Items
Docu- ment
Candidate pairs:
those pairs of signatures that we need to test for similarity
of strings of length k that appear in the doc- ument
Signatures:
short integer vectors that represent the sets, and reflect their similarity
Step 3: Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
Locality- Sensitive Hashing
LSH: First Cut
❖ Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s=0.8)
❖ LSH – General idea: Use a function f(x,y) that tells whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated
❖ For Min-Hash matrices:
➢ Hash columns of signature matrix M to many buckets
➢ Each pair of documents that hashes into the same bucket is a candidate pair
Candidates from Min
❖ Pick a similarity threshold s (0 < s < 1)
❖ Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows:
M (i, x) = M (i, y) for at least frac. s values of i
➢ We expect documents x and y to have the same (Jaccard) similarity as their signatures
LSH for Min
❖ Big idea: Hash columns of signature matrix M several times
❖ Arrange that (only) similar columns are likely to hash to the same bucket, with high probability
❖ Candidate pairs are those that hash to the same bucket
r rows per band
One signature
Signature matrix M 7.7
Partition M into Bands
❖ Divide matrix M into b bands of r rows
❖ For each band, hash its portion of each column to a hash table with k
➢ Make k as large as possible
❖ Candidate column pairs are those that hash to the same bucket for ≥ 1 band
❖ Tune b and r to catch most similar pairs, but few non-similar pairs
Hashing Bands
Columns 2 and 6
are probably identical (candidate pair)
Columns 6 and 7 are surely different.
Hashing Bands
❖ Regardless of what those columns look like in the other three bands, this pair of columns will be a candidate pair
❖ Two columns that do not agree in band 1 have three other chances to become a candidate pair; they might be identical in any one of these other bands.
Simplifying Assumption
❖ There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band
❖ Hereafter, we assume that “same bucket” means “identical in that band”
❖ Assumption needed only to simplify analysis, not for correctness of algorithm
Example of Bands
Assume the following case:
❖ Suppose 100,000 columns of M (100k docs) ❖ Signatures of 100 integers (rows)
❖ Therefore, signatures take 40Mb
❖ Choose b = 20 bands of r = 5 integers/band
❖ Goal: Find pairs of documents that are at least s = 0.8 similar
❖ Find pairs of s=0.8 similarity, set b=20, r=5 ❖ Assume: sim(C1, C2) = 0.8
➢ Since sim(C1, C2) s, we want C1, C2 to be a candidate pair: We want them to hash to at least 1 common bucket (at least one band is identical)
❖ Probability C1, C2 identical in one particular band: (0.8)5 = 0.328 ❖ Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 =
➢ i.e., about 1/3000th of the 80%-similar column pairs are false
negatives (we miss them)
➢ We would find 99.965% pairs of truly similar documents
are 80% Similar
❖ Find pairs of s=0.8 similarity, set b=20, r=5 ❖ Assume: sim(C1, C2) = 0.3
➢ Since sim(C1, C2) < s we want C1, C2 to hash to NO common buckets (all bands should be different)
❖ Probability C1, C2 identical in one particular band: (0.3)5 = 0.00243 ❖ Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20 =
➢ In other words, approximately 4.74% pairs of docs with similarity 0.3%
end up becoming candidate pairs
They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s
are 30% Similar
LSH Involves a Tradeoff
➢ The number of Min-Hashes (rows of M)
➢ The number of bands b, and
➢ The number of rows r per band to balance false positives/negatives
❖ Example: If we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up
Analysis of LSH
What We Want
No chance if t < s
Probability = 1 if t>s
Probability of sharing a bucket
Similarity t =sim(C1, C2) of two sets 7.16
Similarity threshold s
What 1 Band of 1 Row Gives You
Probability of equal hash-values = similarity
Probability of sharing a bucket
Similarity t =sim(C1, C2) of two sets 7.17
❖ The probability that the minhash signatures for the documents agree in any one particular row of the signature matrix is t (sim(C1, C2) )
❖ Pick any band (r rows)
➢ Prob. that all rows in band equal = tr
➢ Prob. that some row in band unequal = 1 – tr
❖ Prob. that no band identical = (1 – tr)b
❖ Prob. that at least 1 band identical = 1 – (1 – tr)b
Rows Gives You
s ~ (1/b)1/r
Probability of sharing a bucket
At least one band identical
1 -(1 -tr )b
All rows of a band are equal
No bands identical
Similarity t=sim(C1, C2) of two sets 7.19
Some row of a band unequal
❖ Similarity threshold s
❖ Prob. that at least 1 band is identical:
❖ Picking r and b to get the best S-curve ➢ 50 hash-functions (r=5, b=10)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
Blue area: False Negative rate Black area: False Positive rate
Prob. sharing a bucket
LSH Summary
❖ Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures
❖ Check in main memory that candidate pairs really do have similar signatures
❖ Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents
Summary: 3 Steps
❖ Shingling: Convert documents to sets
➢ We used hashing to assign each shingle an ID
❖ Min-Hashing: Convert large sets to short signatures, while preserving similarity
➢ We used similarity preserving hashing to generate signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
➢ We used hashing to get around generating random permutations
❖ Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from
similar documents
➢ We used hashing to find candidate pairs of similarity s
Distance Measures
❖ Generalized LSH is based on some kind of “distance” between points. ➢ Similar points are “close.”
❖ Example: Jaccard similarity is not a distance; 1 minus Jaccard similarity is.
❖ d is a distance measure if it is a function from pairs of points to real numbers such that:
1. d(x,y) > 0.
2. d(x,y) = 0 iff x = y.
3. d(x,y) = d(y,x).
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality).
Some Euclidean Distances
❖ L2 norm: d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension.
➢ The most common notion of “distance.”
❖ L1 norm: sum of the differences in each dimension.
➢ Manhattan distance = distance if you had to travel along coordinates only.
L2-norm: dist(a,b) = (42+32)= 5
dist(a,b) =4+3 = 7
❖ Jaccard distance for sets = 1 minus Jaccard similarity.
❖ Cosine distance for vectors = angle between the vectors.
❖ Edit distance for strings = number of inserts and deletes to change one string into another.
Euclidean Distances
Cosine Distance
❖ Think of a point as a vector from the origin [0,0,...,0] to its location. ❖ Two points’ vectors make an angle, whose cosine is the normalized
dot-product of the vectors: p1.p2/|p2||p1|.
➢ Example: p1 = [1,0,2,-2,0]; p2 = [0,0,3,0,0]. ➢ p1.p2 = 6; |p1| = |p2| = 9 = 3.
➢ cos() = 6/9; is about 48 degrees.
Edit Distance
❖ The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other.
❖ An equivalent definition: d(x,y) = |x| + |y| - 2|LCS(x,y)|.
➢ LCS = longest common subsequence = any longest string
obtained both by deleting from x and deleting from y. ❖ Example:
➢ x = abcde ; y = bcduve.
➢ Turn x into y by deleting a, then inserting u and v after d.
Edit distance = 3.
➢ Or, computing edit distance through the LCS, note that LCS(x,y) =
➢ Then:|x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3 = edit distance.
Hash Functions Decide Equality
❖ There is a subtlety about what a “hash function” is, in the context of LSH families.
❖ A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison.
❖ Example: the family of minhash functions computes minhash values and says “yes” iff they are the same.
❖ Shorthand: “h(x) = h(y)” means h says “yes” for pair of elements x and y.
Suppose we have a space S of points with a distance measure d.
A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for
any x and y in S:
1. If d(x,y) < d1, then the probability over all h in H, that h(x) = h(y)
is at least p1.
2. If d(x,y) > d2, then the probability over all h in H, that h(x) = h(y) is at most p2.
LSH Families Defined
LS Families:
Illustration
High probability; at least p1
Low probability; at most p2
Example: LS
❖ Claim: H is a (1/3, 3/4, 2/3, 1/4)-sensitive family for S and d.
If distance < 1/3 (so similarity > 2/3)
Then probability that minhash values agree is > 2/3
If distance > 3/4 (so similarity < 1/4)
Then probability that minhash values agree is < 1/4
For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2.
Similar Items
Exact Approach to Finding
Similarity Join
Finding pairs of records with a similarity on their join attributes > t
❖ Given two collections of records R and S, a similarity function sim(., .), and a threshold τ, the set similarity join between R and S, is to find all record pairs r (from R) and s (from S), such that sim(r, s) >= τ.
❖ Given the above example, and set τ=0.5, the results are: (r1, s1) (similarity 0.75), (r2, s2) (similarity 0.5), (r3, s1) (similarity 0.5), (r3, s2) (similarity 0.5).
❖ LSH can solve this problem approximately. 7.35
Similarity Join
Application: Record linkage
Keanu L.
Step 1: Table S Similarity Join
Step 2: Verification
step Solution
A Naïve Solution
❖ Map: <23, (a,b,c)> → (a, 23), (b, 23), (c, 23)
❖ Reduce: (a,23),(a,29),(a,50), …→Verify each pair (23, 29), (23, 50),
(29, 50) … …
❖ Too much data to transfer
❖ Too many pairs to verify
Solving frequency skew: prefix filtering
❖ Sort tokens by frequency (ascending)
❖ Prefix of a set: least frequent tokens r1
❖ Prefixes of similar sets should share tokens 7.39
Sorted by frequency
Prefix filtering: example
Record 1 Record 2
❖ Each set has 5 tokens
❖ “Similar”: they share at least 4 tokens ❖ Prefix length: 2
Hadoop Solution: Overview
❖ Stage 1: Order tokens by frequency
❖ Stage 2: Finding “similar” id pairs (verification)
❖ Stage 3: remove duplicates
Stage 1: Sort tokens by frequency
Compute token frequencies
MapReduce phase 1
MapReduce phase 2
Stage 2: Find “similar” id pairs
Partition using prefixes
Verify similarity
Compute the Length of Shared Tokens
❖ Jaccard Similarity: sim(r, s) = |rs|/|rs|
❖ If sim(r, s) >= τ, l = |rs| >= |rs| * τ >= max(|r|, |s|) * τ
❖ Given a record r, you can compute the prefix length as p = |r| – l + 1
❖ r and s is a candidate pair, they must share at least one token in the first (|r| – l + 1) tokens
❖ Given a record r = (A, B, C, D) and p = 2, the mapper emits (A, r) and (B, r)
Stage 3: Remove Duplicates
More Optimization Strategies
❖ Project 3: Do it using Spark on Google Dataproc
❖ It is your job to design more optimization strategies. The faster the
❖ Thinking:
➢ How to compute the prefix length of a single record when processing it?
➢ How to pass the sorted list to each worker? ➢ Is it necessary to generate duplicate pairs?
❖ Chapter 3 of Mining of Massive Datasets.
References
Chapter 7.