代写代考 COMP9313: Big Data Management

COMP9313: Big Data Management
Course web site: http://www.cse.unsw.edu.au/~cs9313/

Chapter 7.2:
Finding Similar Items

Docu- ment
Candidate pairs:
those pairs of signatures that we need to test for similarity
of strings of length k that appear in the doc- ument
Signatures:
short integer vectors that represent the sets, and reflect their similarity
Step 3: Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents
Locality- Sensitive Hashing

LSH: First Cut
❖ Goal: Find documents with Jaccard similarity at least s (for some similarity threshold, e.g., s=0.8)
❖ LSH – General idea: Use a function f(x,y) that tells whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated
❖ For Min-Hash matrices:
➢ Hash columns of signature matrix M to many buckets
➢ Each pair of documents that hashes into the same bucket is a candidate pair

Candidates from Min
❖ Pick a similarity threshold s (0 < s < 1) ❖ Columns x and y of M are a candidate pair if their signatures agree on at least fraction s of their rows: M (i, x) = M (i, y) for at least frac. s values of i ➢ We expect documents x and y to have the same (Jaccard) similarity as their signatures LSH for Min ❖ Big idea: Hash columns of signature matrix M several times ❖ Arrange that (only) similar columns are likely to hash to the same bucket, with high probability ❖ Candidate pairs are those that hash to the same bucket r rows per band One signature Signature matrix M 7.7 Partition M into Bands ❖ Divide matrix M into b bands of r rows ❖ For each band, hash its portion of each column to a hash table with k ➢ Make k as large as possible ❖ Candidate column pairs are those that hash to the same bucket for ≥ 1 band ❖ Tune b and r to catch most similar pairs, but few non-similar pairs Hashing Bands Columns 2 and 6 are probably identical (candidate pair) Columns 6 and 7 are surely different. Hashing Bands ❖ Regardless of what those columns look like in the other three bands, this pair of columns will be a candidate pair ❖ Two columns that do not agree in band 1 have three other chances to become a candidate pair; they might be identical in any one of these other bands. Simplifying Assumption ❖ There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band ❖ Hereafter, we assume that “same bucket” means “identical in that band” ❖ Assumption needed only to simplify analysis, not for correctness of algorithm Example of Bands Assume the following case: ❖ Suppose 100,000 columns of M (100k docs) ❖ Signatures of 100 integers (rows) ❖ Therefore, signatures take 40Mb ❖ Choose b = 20 bands of r = 5 integers/band ❖ Goal: Find pairs of documents that are at least s = 0.8 similar ❖ Find pairs of  s=0.8 similarity, set b=20, r=5 ❖ Assume: sim(C1, C2) = 0.8 ➢ Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair: We want them to hash to at least 1 common bucket (at least one band is identical) ❖ Probability C1, C2 identical in one particular band: (0.8)5 = 0.328 ❖ Probability C1, C2 are not similar in all of the 20 bands: (1-0.328)20 = ➢ i.e., about 1/3000th of the 80%-similar column pairs are false negatives (we miss them) ➢ We would find 99.965% pairs of truly similar documents are 80% Similar ❖ Find pairs of  s=0.8 similarity, set b=20, r=5 ❖ Assume: sim(C1, C2) = 0.3 ➢ Since sim(C1, C2) < s we want C1, C2 to hash to NO common buckets (all bands should be different) ❖ Probability C1, C2 identical in one particular band: (0.3)5 = 0.00243 ❖ Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 - 0.00243)20 = ➢ In other words, approximately 4.74% pairs of docs with similarity 0.3% end up becoming candidate pairs  They are false positives since we will have to examine them (they are candidate pairs) but then it will turn out their similarity is below threshold s are 30% Similar LSH Involves a Tradeoff ➢ The number of Min-Hashes (rows of M) ➢ The number of bands b, and ➢ The number of rows r per band to balance false positives/negatives ❖ Example: If we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up Analysis of LSH What We Want No chance if t < s Probability = 1 if t>s
Probability of sharing a bucket
Similarity t =sim(C1, C2) of two sets 7.16
Similarity threshold s

What 1 Band of 1 Row Gives You
Probability of equal hash-values = similarity
Probability of sharing a bucket
Similarity t =sim(C1, C2) of two sets 7.17

❖ The probability that the minhash signatures for the documents agree in any one particular row of the signature matrix is t (sim(C1, C2) )
❖ Pick any band (r rows)
➢ Prob. that all rows in band equal = tr
➢ Prob. that some row in band unequal = 1 – tr
❖ Prob. that no band identical = (1 – tr)b
❖ Prob. that at least 1 band identical = 1 – (1 – tr)b

Rows Gives You
s ~ (1/b)1/r
Probability of sharing a bucket
At least one band identical
1 -(1 -tr )b
All rows of a band are equal
No bands identical
Similarity t=sim(C1, C2) of two sets 7.19
Some row of a band unequal

❖ Similarity threshold s
❖ Prob. that at least 1 band is identical:

❖ Picking r and b to get the best S-curve ➢ 50 hash-functions (r=5, b=10)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
Blue area: False Negative rate Black area: False Positive rate
Prob. sharing a bucket

LSH Summary
❖ Tune M, b, r to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures
❖ Check in main memory that candidate pairs really do have similar signatures
❖ Optional: In another pass through data, check that the remaining candidate pairs really represent similar documents

Summary: 3 Steps
❖ Shingling: Convert documents to sets
➢ We used hashing to assign each shingle an ID
❖ Min-Hashing: Convert large sets to short signatures, while preserving similarity
➢ We used similarity preserving hashing to generate signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
➢ We used hashing to get around generating random permutations
❖ Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from
similar documents
➢ We used hashing to find candidate pairs of similarity  s

Distance Measures
❖ Generalized LSH is based on some kind of “distance” between points. ➢ Similar points are “close.”
❖ Example: Jaccard similarity is not a distance; 1 minus Jaccard similarity is.
❖ d is a distance measure if it is a function from pairs of points to real numbers such that:
1. d(x,y) > 0.
2. d(x,y) = 0 iff x = y.
3. d(x,y) = d(y,x).
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality). Some Euclidean Distances ❖ L2 norm: d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension. ➢ The most common notion of “distance.” ❖ L1 norm: sum of the differences in each dimension. ➢ Manhattan distance = distance if you had to travel along coordinates only. L2-norm: dist(a,b) = (42+32)= 5 dist(a,b) =4+3 = 7 ❖ Jaccard distance for sets = 1 minus Jaccard similarity. ❖ Cosine distance for vectors = angle between the vectors. ❖ Edit distance for strings = number of inserts and deletes to change one string into another. Euclidean Distances Cosine Distance ❖ Think of a point as a vector from the origin [0,0,...,0] to its location. ❖ Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors: p1.p2/|p2||p1|. ➢ Example: p1 = [1,0,2,-2,0]; p2 = [0,0,3,0,0]. ➢ p1.p2 = 6; |p1| = |p2| = 9 = 3. ➢ cos() = 6/9;  is about 48 degrees. Edit Distance ❖ The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. ❖ An equivalent definition: d(x,y) = |x| + |y| - 2|LCS(x,y)|. ➢ LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y. ❖ Example: ➢ x = abcde ; y = bcduve. ➢ Turn x into y by deleting a, then inserting u and v after d.  Edit distance = 3. ➢ Or, computing edit distance through the LCS, note that LCS(x,y) = ➢ Then:|x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3 = edit distance. Hash Functions Decide Equality ❖ There is a subtlety about what a “hash function” is, in the context of LSH families. ❖ A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison. ❖ Example: the family of minhash functions computes minhash values and says “yes” iff they are the same. ❖ Shorthand: “h(x) = h(y)” means h says “yes” for pair of elements x and y. Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S: 1. If d(x,y) < d1, then the probability over all h in H, that h(x) = h(y) is at least p1. 2. If d(x,y) > d2, then the probability over all h in H, that h(x) = h(y) is at most p2.
LSH Families Defined

LS Families:
Illustration
High probability; at least p1
Low probability; at most p2

Example: LS
❖ Claim: H is a (1/3, 3/4, 2/3, 1/4)-sensitive family for S and d.
If distance < 1/3 (so similarity > 2/3)
Then probability that minhash values agree is > 2/3
If distance > 3/4 (so similarity < 1/4) Then probability that minhash values agree is < 1/4 For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2. Similar Items Exact Approach to Finding Similarity Join Finding pairs of records with a similarity on their join attributes > t

❖ Given two collections of records R and S, a similarity function sim(., .), and a threshold τ, the set similarity join between R and S, is to find all record pairs r (from R) and s (from S), such that sim(r, s) >= τ.
❖ Given the above example, and set τ=0.5, the results are: (r1, s1) (similarity 0.75), (r2, s2) (similarity 0.5), (r3, s1) (similarity 0.5), (r3, s2) (similarity 0.5).
❖ LSH can solve this problem approximately. 7.35
Similarity Join

Application: Record linkage
Keanu L.

Step 1: Table S Similarity Join
Step 2: Verification
step Solution

A Naïve Solution
❖ Map: <23, (a,b,c)> → (a, 23), (b, 23), (c, 23)
❖ Reduce: (a,23),(a,29),(a,50), …→Verify each pair (23, 29), (23, 50),
(29, 50) … …
❖ Too much data to transfer
❖ Too many pairs to verify

Solving frequency skew: prefix filtering
❖ Sort tokens by frequency (ascending)
❖ Prefix of a set: least frequent tokens r1
❖ Prefixes of similar sets should share tokens 7.39
Sorted by frequency

Prefix filtering: example
Record 1 Record 2
❖ Each set has 5 tokens
❖ “Similar”: they share at least 4 tokens ❖ Prefix length: 2

Hadoop Solution: Overview
❖ Stage 1: Order tokens by frequency
❖ Stage 2: Finding “similar” id pairs (verification)
❖ Stage 3: remove duplicates

Stage 1: Sort tokens by frequency
Compute token frequencies
MapReduce phase 1
MapReduce phase 2

Stage 2: Find “similar” id pairs
Partition using prefixes
Verify similarity

Compute the Length of Shared Tokens
❖ Jaccard Similarity: sim(r, s) = |rs|/|rs|
❖ If sim(r, s) >= τ, l = |rs| >= |rs| * τ >= max(|r|, |s|) * τ
❖ Given a record r, you can compute the prefix length as p = |r| – l + 1
❖ r and s is a candidate pair, they must share at least one token in the first (|r| – l + 1) tokens
❖ Given a record r = (A, B, C, D) and p = 2, the mapper emits (A, r) and (B, r)

Stage 3: Remove Duplicates

More Optimization Strategies
❖ Project 3: Do it using Spark on Google Dataproc
❖ It is your job to design more optimization strategies. The faster the
❖ Thinking:
➢ How to compute the prefix length of a single record when processing it?
➢ How to pass the sorted list to each worker? ➢ Is it necessary to generate duplicate pairs?

❖ Chapter 3 of Mining of Massive Datasets.
References

Chapter 7.