程序代写代做代考 C database graph algorithm crawler Finding similar items: Locality-sensitive Hashing

Finding similar items: Locality-sensitive Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 4, 2020
1

Recap: Universal hashing
 Given a random choice of a hash function h from the universal family such that h: U  {0, 1, … , n-1}.
 Our dictionary needs O(n) space and O(1) expected time per operation (search, insert, delete).
 Expectation is over the random choice of hash function.  Independent of input set.
2

Outline
 Similarity search applications  Finding similar documents:
 Shingling
 MinHash
 Locality-sensitive hashing (LSH)
3

Scene completion problem
 Problem:
 Patch up holes in an image by finding similar image region from a huge
image database on the Web.
Source: http://graphics.cs.cmu.edu/projects/scene-completion/
4

Scene completion problem
5

Scene completion problem
10 nearest neighbors from a collection of 20,000 images
6

Scene completion problem
10 nearest neighbors from a collection of 2 million images
7

Scene completion problem
10 nearest neighbors from a collection of 2 million images
8

Other similarity search applications
 Web crawlers and near-duplicate web pages (mirror pages)
Source: https://medium.com/@promptcloud/data-crawling-real-time-bidding-concept-mechanism-807ad8c0e0c7
9

Other similarity search applications
 Collaborative-filtering recommender systems require similar users or items.
Source: https://dataaspirant.com/2015/05/25/collaborative-filtering-recommendation-engine-implementation-in-python/ 10

A common metaphor
 We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
 More examples in machine learning:
 k-nearest neighbor models (classification/regression/outlier detection)
Source: https://www.python- course.eu/k_nearest_neighbor_classifier.php
11

A common metaphor
 We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
 More examples in machine learning:  Clustering
Source: https://data-flair.training/blogs/clustering-in-r-tutorial/
12

General problem for our today’s lecture
 Input:
 A data set X of n points in high-dimensional space.
 A similarity function sim and a similarity threshold s.  Goal:
 Find all similar pairs xi, xj in X such that sim(xi, xj) ≥ s.  Solutions:
 Naïve solution: O(n2).
 Magic: This can be done in O(n)?
13

Jaccard similarity
 The Jaccard similarity of two sets S and T is the ratio of their intersection size to their union size.
 Example: S
T
4 in intersection 8 in union
J(S, T) = 1/2.
𝑺∩𝑻 𝑺∪𝑻
14

Finding similar documents
 Input:
 n documents (e.g. web pages).
 The Jaccard similarity function J and a similarity threshold s.  Goal:
 Find all near-duplicate pairs S, T such that J(S, T) ≥ s.  Problems:
 Many small pieces of a document (e.g. words, phrases…) can appear out of order in another.
 n documents are too large to fit into main memory.
 Algorithmic challenge: O(n2) pairs of documents to compare.
15

Outline
 Similarity search applications  Finding similar documents:
 Shingling
 MinHash
 Locality-sensitive hashing (LSH)
16

3 essential steps
 Shingling:
 To convert documents to sets.
 MinHash:
 To convert large sets to short signatures, while preserving Jaccard
similarity.
 Locality-sensitive hashing (LSH):
 To focus on candidate pairs which are likely from similar documents to break the barrier of O(n2).
17

Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
18

Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
19

Documents as high-dimensional data
 Motivation:
 A: A rose is red, a rose is white.  B: A rose is white, a rose is red.  C: A rose is a rose is a rose.
 Simple approaches:
 Document = set of words in document.
 Document = set of “important” words in document.  Not work well.
 Take into account the order of words.
20

Shingling: Convert documents to sets
 k-shingle is a sequence of k tokens (e.g. characters, words).  Document = set of k-shingles
 Example with k = 3:
 A: A rose is red, a rose is white.
 A’ = {a rose is, rose is red, is red a, red a rose, rose is white}.
 B: A rose is white, a rose is red.
 B’ = {a rose is, rose is white, is white a, white a rose, rose is red}.  C: A rose is a rose is a rose.
C’= {a rose is,rose is a,is a rose}.
21

Compressing shingles
 To compress long shingles, we can hash them to 4-byte integer.  Represent a document as a set of hash values (integers) of its k-
shingles.
 Example with k = 3:
 A’ = {a rose is, rose is red, is red a, red a rose, rose is white}.
 A’= {1,3,8,5,7}.
 B’ = {a rose is, rose is white, is white a, white a rose, rose is red}.  B’ = {1, 7, 9, 2, 3}.
 C’ = {a rose is, rose is a, is a rose}.
 C’ = {1, 4, 6}.
22

Working assumption
 If documents are similar, they will share many common shingles, hence, will have a high Jaccard similarity.
 We should pick k large enough to differentiate documents  k = 5 for short documents.
 k = 10 for long documents.
23

Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
24

Encoding a set as bit vectors
 Encoding for ease of presenting MinHash idea:  Encode a set by a binary vector.
 Each dimension is an element in the universal set.  Examples:
 A = {1, 2, 5, 6, 7}  B = {1, 2, 3, 6}
 C = {1, 6, 7}
 Universal set: U = {1, 2, 3, 4, 5, 6, 7}  A = 1100111
 B = 1110010
 C = 1000011
25

Shingles
From sets to a boolean matrix
A = {1, 2, 5, 6, 7} B = {1, 2, 3, 6}
C = {1, 6, 7}
D = {2, 3, 4, 5}
2
1101 0101
1
1110
A
BCD
3
4 0001
5 1001
6
1110 1010
7
Documents
26

Min-Hashing vs. general hashing
 General hashing: A
B
CD
 Min-hashing:
High Jaccard similarity
Low Jaccard similarity
ABC
D
27

MinHash’s idea
 MinHash:
High Jaccard similarity Low Jaccard similarity
ABC
D
 Algorithm’s idea:
 Use MinHash to hash each set/document into a hash table.
 Only compute Jaccard similarity of any pair of sets in the same bucket.
28

MinHash
 Goal: find a hash function h such that
 If J(S, T) is high, then with high probability h(S) = h( T) (collision).
 If J(S, T) is low, then with high probability h(S) ≠ h( T) (no collision) .
 MinHash for the Jaccard similarity: Pr[h(S) = h(T)] = J(S, T)
29

Shingles
MinHash’s signature
 Permute the Boolean matrix with a random permutation .
A
BCD
 MinHash function:
h(S) = the index of the first row with
2 3 4 5
1101 0101
value 1 of S (in the permuted order ) h(S) = min (S)
0001 1001 1110 1010
 Use several independent MinHash (i.e. permutation ) to construct the signature of each set.
6
1 1110
7
Documents
30

Shingles
Example
Signature matrix M
 ABCD 411110
A
BCD
1101 0101
2
7
1
6
3 571010
2 3 4 5
0001 1001 1110
6
Documents
31

Shingles
Example
Signature matrix M
 ABCD 411110
A
BCD
1101 0101
2
7
1
6
3 571010
2 3 4 5
0001 1001 1110
6
Documents
32

Shingles
Example
Signature matrix M
 ABCD 411110
A
BCD
221101
1
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
33

Shingles
Example
Signature matrix M
 ABCD 411110
A
BCD
221101
1
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
34

Shingles
Example
Signature matrix M
 ABCD 411110
ABCD
221101
221
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
35

Shingles
Example
Signature matrix M
 ABCD 411110
ABCD
221101
2231
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
36

Shingles
Example
Signature matrix M ABCD
 ABCD 411110
2231
221101
3 4 5
3rd element of the permutation is the first to map to a 1.
0101 0001 1001 1110
7
1
6
3 571010
6
Documents 37

Shingles
Example
Signature matrix M
 ABCD 23411110
ABCD
34221101
2231
777
621
166 51361110 45571010
3 4 5
0101 0001 1001
Documents
38

Shingles
Example
Signature matrix M ABCD
 ABCD 23411110
2231
34221101 77730101 62140001
1112 1221
1001 51361110 45571010
166
5
Documents
39

MinHash property
 Claim: Pr[h(S) = h(T)] = J(S,T)
 Observation after permutation: STS
a-type
 Look down the column,
a-type 1 1 b-type 1 0 c-type 0 1 d-type 0 0
b-type
c-type
 If we observe a-type: h(S) = h(T)
 If we observe b-type or c-type: h(S) ≠ h(T)
 Pr[h(S) = h( T)] = (a-type) / (a-type + b-type + c-type) = J(S, T)
40
T

Similarity of signatures
 MinHash property:
Pr[h(S) = h(T)] = J(S, T)
 We use d MinHash functions (i.e. random permutations) to achieve a signature of size d for each set.
 The similarity of two signatures = the fraction of hash functions in which they agree (the number of collisions / d).
 The Jaccard similarity of two sets (columns) equals to the expected similarity of their signatures.
41

Shingles
Example
Signature matrix M ABCD
 ABCD 23411110
2231
34221101 77730101 62140001 16651001 51361110 45571010
1112 1221
Documents
Col/Col 0.5 Sig/Sig 0.67
0 0
Similarities:
A-B C-D
42

MinHash signatures
 Pick d = 100 random permutations, we can represent a document as a short integer vector.
 The corresponding column of the signature matrix M.  Preserve the pairwise Jaccard similarity.
 The size of signature is very small, i.e. ~100 bytes.
 Important note of implementation:
 Permute rows even once is very expensive.
 One-pass implementation (reading 3.3.5, chapter 3 of Mining of Massive Datasets).
43

One-pass MinHash signatures
 Idea: For each column S and a universal hash function hi, keep a “slot” for the min-hash value.
 Algorithm:
 Initialize all sig(S)[i] = .  Scan rows looking for 1s.
o Suppose row j has 1 in column S. o Then for each hi :
• If hi(j) < sig(S)[i], then sig(S)[i]  hi(j).  How to pick a universal hash function h(x)?  ha,b(x)=((a*x+b) mod p) mod n where 0 ≤ a, b < p and p is a large prime > n.
44

Exercise (section 3.3.5)
 Compute MinHash signatures using the two hash function h1(x) = x + 1 mod 5 and h2(x) = 3x + 1 mod 5.
45

Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
46

Motivation of MinHash/LSH
 n = 1 million documents.
 A naïve solution needs to compute n(n – 1) / 2 ~ 5*1011
Jaccard similarities.
 A standard PC computes 106 Jaccard similarities/sec, then
we need approximate 5 days.
 Nowadays, the internet has n » 4.5 billion pages .
47

LSH: First cut
 Input:
 The MinHash signature matrix M.
 Goal:
 Find all pairs of documents with Jaccard similarity ≥ s = 0.8.
 LSH idea:
 The higher Jaccard similarity, the larger fractions of identical hash values.
 A pair of signatures is a candidate (which needs to compute the actual Jaccard similarity) if their hash values are the same on at least r fractions.
 Problem:
 Choose r such that we do not have many false negatives (pairs with J ≥ 0.8
but not candidates) and false positives (pairs with J < 0.8 but candidates). 48 Naïve solution and analysis  r = 1:  J(S, T) = 0.8 (should be retrieved) False negatives o Probabilitythat(S,T)isacandidate:0.8 o Probabilitythat(S,T)isnotacandidate:1–0.8=0.2  J(S’, T’) = 0.5 (should not be retrieved) False positives o Probability that (S’, T’) is a candidate: 0.5 o Probabilitythat(S’,T’)isnotacandidate:1–0.5=0.5  r = 5:  J(S, T) = 0.8 (should be retrieved) False negatives o Probability that (S, T) is a candidate: 0.85 = 0.328 o Probability that (S, T) is not a candidate: 1 – 0.85 = 0.672  J(S’, T’) = 0.5 (should not be retrieved) False positives o Probability that (S’, T’) is a candidate: 0.55 = 0.031 o Probability that (S’, T’) is not a candidate: 1 – 0.55 = 0.969 49 Partition M into b bands b bands Signature matrix M 50 r rows per band One signature Partition M into b bands  M contains b bands, each band has r rows.  For each band, hash its r rows of each column into a hash table (using general hash functions) with very large buckets.  The collision probability of two different rows is super tiny.  For each band, candidates are column pairs those hash to the same bucket.  Candidate pairs might be on the same bucket for more than 1 band.  Tune b and r to return most similar pairs but few dissimilar pairs. 51 Buckets Columns S6 and S7 are surely different. Matrix M S1 S2 S3 S4 S5 S6 S7 r rows b bands Partition M into b bands Columns S2 and S6 are probably identical (candidate pair) 52 Simplifying assumption  The number of buckets is sufficiently large so that two different columns will hash into different buckets.  Hence, the “same bucket” means “identical in that band”.  The assumption will simplify analysis, not for correctness of algorithms. 53 Analysis for J(S, T) = 0.8  Input: Divide matrix M of d = 100 rows into b = 20, r = 5.  Goal: (S, T) is a candidate, i.e. S and T hashed into the same bucket for at least 1 band.  Analysis:  Probability that (S, T) identical in 1 band: 0.85 = 0.328  Probability that (S, T) not identical in 1 band: 1 – 0.85 = 0.672  Probability that (S, T) not identical in all 20 bands: (1 – 0.85)20= 0.00035  Probability that (S,T) identical in at least 1 band: 1 - 0.00035 = 0.99965  Conclusion:  About 1/3000 of the 80%-similar column pair are false negatives (not return).  We can find 99.965% truly similar pairs of documents. 54 Analysis for J(S, T) = 0.3  Input: Divide matrix M of d = 100 rows into b = 20, r = 5.  Goal: (S, T) is not a candidate, i.e. S and T are not hashed into the same bucket for all bands.  Analysis:  Probability that (S, T) identical in 1 band: 0.35 = 0.00243  Probability that (S, T) not identical in 1 band: 1 – 0.35 = 0.99757  Probability that (S, T) not identical in all 20 bands: (1 – 0.35)20= 0.9525  Probability that (S, T) is a candidate: 1 - 0.9525 = 0.0475  Conclusion:  Approximately 4.75% of the 30%-similar column pair are false positives (candidate pair).  We will not report them since we can compute their actual Jaccard similarity. 55 Tradeoff of LSH  Parameter settings to balance the false negatives/positives:  The number of MinHash (d) to construct the signature matrix M.  The number of bands b.  The number of rows r per band.  Exercise:  If we set b = 15, r = 5 rows: o Whatistheprobability80%-similarpairisnotacandidate? o Whatistheprobability30%-similarpairisacandidate? o False negatives increase, false positives decrease. 56 General analysis for J(S, T) = t  Parameters:  b bands and r rows per band.  Analysis:  Probability that (S, T) identical in 1 band: tr  Probability that (S, T) not identical in 1 band: 1 – tr  Probability that (S, T) not identical in all b bands: (1 – tr)b  Probability that (S, T) is not a candidate: (1 – tr)b  Probability that (S, T) is a candidate: 1 – (1 – tr)b 57 Similarity threshold s Ideal case – what we want Probability of sharing a bucket No chance if t < s Similarity t = sim(S, T) of two sets Probability = 1 if t>s
58

b = 1, r = 1
Probability of sharing a bucket
Remember:
Similarity t = sim(S, T) of two sets
Probability of equal hash-values = similarity
59

General b and r
s ~ (1/b)1/r
At least one band identical
No bands identical
Probability of sharing a bucket
1 -(1 -t r)b
Similarity t = sim(S, T) of two sets
Some row of a band unequal
All rows of a band are equal
60

Prob. sharing a bucket
General b and r form the S-curse  Pick r and b to get the best S-curse.
 Example: b = 10, r = 5, d = 50
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Blue area: False Negative rate Green area: False Positive rate
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
61

LSH summary
 Tune b and r such that
 Decrease false negatives: get almost all pairs with similar signatures.
 Decrease false positives: eliminate most pairs that do not have similar signatures.
 Check in main memory that candidate pairs really do have similar signatures (i.e. estimate Jaccard similarity)
 Optional: In another pass through data, compute the actual Jaccard similarity of candidate pairs to return the result.
62

Summary of 3 steps
 Shingling: Convert documents to sets
 We used hashing to assign each shingle an integer ID.
 Min-Hashing: Convert large sets to short signatures, while preserving Jaccard similarity.
 We used similarity preserving hashing to generate signatures with property Pr[h(S) = h(T)] = J(S, T).
 We used hashing to get around generating random permutations (efficient implementation).
 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents.
 We used hashing to find candidate pairs of Jaccard similarity  s.
63