Finding similar items: Locality-sensitive Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides from
http://mmds.org
Designed for the textbook Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Auckland, Aug 4, 2020
1
Recap: Universal hashing
Given a random choice of a hash function h from the universal family such that h: U {0, 1, … , n-1}.
Our dictionary needs O(n) space and O(1) expected time per operation (search, insert, delete).
Expectation is over the random choice of hash function. Independent of input set.
2
Outline
Similarity search applications Finding similar documents:
Shingling
MinHash
Locality-sensitive hashing (LSH)
3
Scene completion problem
Problem:
Patch up holes in an image by finding similar image region from a huge
image database on the Web.
Source: http://graphics.cs.cmu.edu/projects/scene-completion/
4
Scene completion problem
5
Scene completion problem
10 nearest neighbors from a collection of 20,000 images
6
Scene completion problem
10 nearest neighbors from a collection of 2 million images
7
Scene completion problem
10 nearest neighbors from a collection of 2 million images
8
Other similarity search applications
Web crawlers and near-duplicate web pages (mirror pages)
Source: https://medium.com/@promptcloud/data-crawling-real-time-bidding-concept-mechanism-807ad8c0e0c7
9
Other similarity search applications
Collaborative-filtering recommender systems require similar users or items.
Source: https://dataaspirant.com/2015/05/25/collaborative-filtering-recommendation-engine-implementation-in-python/ 10
A common metaphor
We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
More examples in machine learning:
k-nearest neighbor models (classification/regression/outlier detection)
Source: https://www.python- course.eu/k_nearest_neighbor_classifier.php
11
A common metaphor
We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
More examples in machine learning: Clustering
Source: https://data-flair.training/blogs/clustering-in-r-tutorial/
12
General problem for our today’s lecture
Input:
A data set X of n points in high-dimensional space.
A similarity function sim(x, y) and a similarity threshold s. Goal:
Find all similar pairs xi, xj in X such that sim(xi, xj) ≥ s. Solutions:
Naïve solution: O(n2)
Magic: This can be done in O(n)!
13
Jaccard similarity
The Jaccard similarity of two sets S and T is the ratio of their intersection size to their union size.
Example: S
T
4 items in intersection 8 items in union
J(S, T) = 1/2
𝑺∩𝑻 𝑺∪𝑻
14
Finding similar documents
Input:
n documents (e.g. web pages)
The Jaccard similarity function J and a similarity threshold s. Goal:
Find all near-duplicate pairs S, T such that J(S, T) ≥ s. Problems:
Many small pieces of a document (e.g. words, phrases…) can appear out of order in another.
n documents are too large to fit into main memory.
Algorithmic challenge: O(n2) pairs of documents to compare.
15
Outline
Similarity search applications Finding similar documents:
Shingling
MinHash
Locality-sensitive hashing (LSH)
16
3 essential steps
Shingling:
To convert documents to sets.
MinHash:
To convert large sets to short signatures, while preserving Jaccard
similarity.
Locality-sensitive hashing (LSH):
To focus on candidate pairs which are likely from similar documents to break the barrier of O(n2).
17
Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
18
Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
19
Documents as high-dimensional data
Motivation:
A: A rose is red, a rose is white. B: A rose is white, a rose is red. C: A rose is a rose is a rose.
Simple approaches:
Document = set of words in document.
Document = set of “important” words in document. Not work well.
Take into account the order of words.
20
Shingling: Convert documents to sets
k-shingle is a sequence of k tokens (e.g. characters, words). Document = set of k-shingles
Example with k = 3:
A: A rose is red, a rose is white.
A’ = {a rose is, rose is red, is red a, red a rose, rose is white}.
B: A rose is white, a rose is red.
B’ = {a rose is, rose is white, is white a, white a rose, rose is red}. C: A rose is a rose is a rose.
C’= {a rose is,rose is a,is a rose}.
21
Compressing shingles
To compress long shingles, we can hash them to 4-byte integer. Represent a document as a set of hash values (integers) of its k-
shingles.
Example with k = 3:
A’ = {a rose is, rose is red, is red a, red a rose, rose is white}.
A’= {1,3,8,5,7}.
B’ = {a rose is, rose is white, is white a, white a rose, rose is red}. B’ = {1, 7, 9, 2, 3}.
C’ = {a rose is, rose is a, is a rose}.
C’ = {1, 4, 6}.
22
Working assumption
If documents are similar, they will share many common shingles, hence, will have a high Jaccard similarity.
We should pick k large enough to differentiate documents k = 5 for short documents.
k = 10 for long documents.
23
Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
24
Encoding a set as bit vectors
Encoding for ease of presenting MinHash idea: Encode a set by a binary vector.
Each dimension is an element in the universal set. Examples:
A = {1, 2, 5, 6, 7} B = {1, 2, 3, 6}
C = {1, 6, 7}
Universal set: U = {1, 2, 3, 4, 5, 6, 7} A = 1100111
B = 1110010
C = 1000011
25
Shingles
From sets to a boolean matrix
A = {1, 2, 5, 6, 7} B = {1, 2, 3, 6}
C = {1, 6, 7}
D = {2, 3, 4, 5}
2
1101 0101
1
1110
A
BCD
3
4 0001
5 1001
6
1110 1010
7
Documents
26
Min-Hashing vs. general hashing
General hashing: A
B
CD
Min-hashing:
High Jaccard similarity
Low Jaccard similarity
ABC
D
27
MinHash’s idea
MinHash:
High Jaccard similarity Low Jaccard similarity
ABC
D
Algorithm’s idea:
Use MinHash to hash each set/document into a hash table.
Only compute Jaccard similarity of any pair of sets in the same bucket.
28
MinHash
Find a hash family H such that randomly select h in H, then If J(S, T) is high, then with high probability h(S) = h( T) (collision)
If J(S, T) is low, then with high probability h(S) ≠ h( T) (no collision)
MinHash for the Jaccard similarity: Pr[h(S) = h(T)] = J(S, T)
29
Shingles
MinHash’s signature
Permute the Boolean matrix with a random permutation .
A
BCD
MinHash function:
2
1101 0101
h(S) = the index of the first row with value 1 of S (in the permuted order )
3
h(S) = min (S)
4
5 1001
Use several independent MinHash 6
1110 1010
(i.e. permutation ) to construct the signature of each set.
7
Documents
1
1110
0001
30
Shingles
Example
Signature matrix M
ABCD 411110
A
BCD
1101 0101
2
7
1
6
3 571010
2 3 4 5
0001 1001 1110
6
Documents
31
Shingles
Example
Signature matrix M
ABCD 411110
A
BCD
1101 0101
2
7
1
6
3 571010
2 3 4 5
0001 1001 1110
6
Documents
32
Shingles
Example
Signature matrix M
ABCD 411110
A
BCD
221101
1
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
33
Shingles
Example
Signature matrix M
ABCD 411110
A
BCD
221101
1
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
34
Shingles
Example
Signature matrix M
ABCD 411110
ABCD
221101
221
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
35
Shingles
Example
Signature matrix M
ABCD 411110
ABCD
221101
2231
0101 0001 1001 1110
7
1
6
3 571010
3 4 5
6
Documents
36
Shingles
Example
Signature matrix M ABCD
ABCD 411110
2231
221101
3 4 5
The 3rd row of is the first map to 1.
0101 0001 1001 1110
7
1
6
3 571010
6
Documents 37
Shingles
Example
Signature matrix M
ABCD 23411110
ABCD
34221101
2231
777
621
166 51361110 45571010
3 4 5
0101 0001 1001
Documents
38
Shingles
Example
Signature matrix M ABCD
ABCD 23411110
2231
34221101 77730101 62140001
1112 1221
1001 51361110 45571010
166
5
Documents
39
MinHash property
Claim: Pr[h(S) = h(T)] = J(S,T)
Observation after permutation: STS
a-type
Look down the column,
a-type 1 1 b-type 1 0 c-type 0 1 d-type 0 0
b-type
c-type
If we observe a-type: h(S) = h(T)
If we observe b-type or c-type: h(S) ≠ h(T)
Pr[h(S) = h( T)] = (a-type) / (a-type + b-type + c-type) = J(S, T)
40
T
Similarity of signatures
MinHash property:
Pr[h(S) = h(T)] = J(S, T)
We use d MinHash functions (i.e. random permutations) to achieve a signature of size d for each set.
The similarity of two signatures = the fraction of hash functions in which they agree (the number of collisions / d)
The Jaccard similarity of two sets (columns) equals to the expected similarity of their signatures.
41
Shingles
Example
Signature matrix M ABCD
ABCD 23411110
2231
34221101 77730101 62140001 16651001 51361110 45571010
1112 1221
Documents
Col/Col 0.5 Sig/Sig 0.67
0 0
Similarities:
A-B C-D
42
MinHash signatures
Pick d = 100 random permutations, we can represent a document as a short integer vector.
The corresponding column of the signature matrix M. Preserve the pairwise Jaccard similarity.
The size of signature is very small, i.e. ~400 bytes.
Important note of implementation:
Permute rows even once is very expensive.
One-pass implementation (reading 3.3.5, chapter 3 of Mining of Massive Datasets).
43
One-pass MinHash signatures
Idea: For each column S and a universal hash function hi, keep a “slot” for the min-hash value.
Algorithm:
Initialize all sig(S)[i] = . Scan rows looking for 1s.
o Suppose row j has 1 in column S. o Then for each hi :
• If hi(j) < sig(S)[i], then sig(S)[i] hi(j). How to pick a universal hash function h(x)?
ha,b(x)=((a*x+b) mod p) mod n where 0 ≤ a, b < p and p is a large prime > n.
44
Exercise (section 3.3.5)
Compute MinHash signatures using the two hash function h1(x) = x + 1 mod 5 and h2(x) = 3x + 1 mod 5.
45
Doc
Locality- Sensitive Hashing
Candidate pairs:
those pairs of signatures that we need to compute Jaccard similarity
Framework picture
Sets:
Signatures:
short integer vectors that represent the sets, and reflect their Jaccard similarity
of strings of length k that appear in the doc
46
Motivation of MinHash/LSH
n = 1 million documents.
A naïve solution needs to compute n(n – 1) / 2 ~ 5*1011
Jaccard similarities.
A standard PC computes 106 Jaccard similarities/sec, then
we need approximate 5 days.
Nowadays, the internet has n » 4.5 billion pages .
47
LSH
Input:
The MinHash signature matrix M
EFG
H
Goal:
2231 1112 1221
Find all pairs of documents with Jaccard similarity ≥ s = 0.8
Signature matrix M
48
LSH: Observation
LSH idea:
The higher Jaccard similarity, the larger
EFG
H
fractions of identical hash values
2231 1112 1 2 2 1
A pair of signatures is a candidate (which needs to compute the actual Jaccard similarity) if their hash values are the same on at least r fractions
Problem:
Similarities:
E-F E-G
Choose r such that we do not have many false negatives (pairs with J ≥ 0.8 but not candidates) and false positives (pairs with J < 0.5 but candidates)
Actual J Est. J
0.8 0.4
Signature matrix M
0.67 0.33
49
Naïve solution and analysis
r = 1:
J(S, T) = 0.8 (should be candidate)
False negatives
o Probabilitythat(S,T)isacandidate:0.8
o Probabilitythat(S,T)isnotacandidate:1–0.8=0.2 J(S’, T’) = 0.5 (should not be candidate)
False positives
o Probability that (S’, T’) is a candidate: 0.5
o Probabilitythat(S’,T’)isnotacandidate:1–0.5=0.5
r = 5:
J(S, T) = 0.8 (should be candidate)
False negatives
o Probability that (S, T) is a candidate: 0.85 = 0.328
o Probability that (S, T) is not a candidate: 1 – 0.85 = 0.672 J(S’, T’) = 0.5 (should not be candidate)
False positives
o Probability that (S’, T’) is a candidate: 0.55 = 0.031
o Probability that (S’, T’) is not a candidate: 1 – 0.55 = 0.969
50
Partition M into b bands
b bands
Signature matrix M
51
r rows per band
One signature
Partition M into b bands M contains b bands, each band has r rows.
For each band, hash its r rows of each column into a hash table (using general hash functions) with very large buckets.
The collision probability of two different rows is super tiny.
For each band, candidates are column pairs those hash to the same bucket.
Candidate pairs might be on the same bucket for more than 1 band.
Tune b and r to return most similar pairs but few dissimilar pairs. 52
Buckets
Columns S6 and S7 are surely different
Matrix M
S1 S2 S3 S4 S5 S6 S7
r rows
b bands
Partition M into b bands
Columns S2 and S6 are probably identical (candidate pair)
53
Simplifying assumption
The number of buckets is sufficiently large so that two different columns will hash into different buckets.
Hence, the “same bucket” means “identical in that band”.
The assumption will simplify analysis, not for correctness of algorithms.
54
Analysis for J(S, T) = 0.8
Input: Divide matrix M of d = 100 rows into b = 20, r = 5.
Goal: (S, T) is a candidate, i.e. S and T hashed into the same bucket for at least 1 band.
Analysis:
Probability that (S, T) identical in 1 band: 0.85 = 0.328
Probability that (S, T) not identical in 1 band: 1 – 0.85 = 0.672
Probability that (S, T) not identical in all 20 bands: (1 – 0.85)20= 0.00035 Probability that (S,T) identical in at least 1 band: 1 - 0.00035 = 0.99965
Conclusion:
About 1/3000 of the 80%-similar column pair are false negatives (not return). We can find 99.965% truly similar pairs of documents.
55
Analysis for J(S, T) = 0.3
Input: Divide matrix M of d = 100 rows into b = 20, r = 5.
Goal: (S, T) is not a candidate, i.e. S and T are not hashed into the same bucket for all bands.
Analysis:
Probability that (S, T) identical in 1 band: 0.35 = 0.00243
Probability that (S, T) not identical in 1 band: 1 – 0.35 = 0.99757
Probability that (S, T) not identical in all 20 bands: (1 – 0.35)20= 0.9525 Probability that (S, T) is a candidate: 1 - 0.9525 = 0.0475
Conclusion:
Approximately 4.75% of the 30%-similar column pair are false positives
(candidate pair).
We will not report them since we can compute their actual Jaccard similarity.
56
Tradeoff of LSH
Parameter settings to balance the false negatives/positives:
The number of MinHash (d) to construct the signature matrix M The number of bands b
The number of rows r per band
Exercise:
If we set b = 15, r = 5 rows:
o Whatistheprobability80%-similarpairisnotacandidate? o Whatistheprobability30%-similarpairisacandidate?
o False negatives increase, false positives decrease?
57
General analysis for J(S, T) = t Parameters:
b bands and r rows per band. Analysis:
Probability that (S, T) identical in 1 band: tr
Probability that (S, T) not identical in 1 band: 1 – tr
Probability that (S, T) not identical in all b bands: (1 – tr)b Probability that (S, T) is not a candidate: (1 – tr)b
Probability that (S, T) is a candidate: 1 – (1 – tr)b
58
Similarity threshold s
Ideal case – what we want
Probability of sharing a bucket
No chance if t < s
1
0
Similarity t = J(S, T) of two sets
Probability = 1 if t>s
59
b = 1, r = 1
Probability of sharing a bucket
Remember:
Similarity t = sim(S, T) of two sets
Probability of equal hash-values = similarity
60
General b and r
s ~ (1/b)1/r
At least one band identical
No bands identical
Probability of sharing a bucket
1 -(1 -t r)b
Similarity t = sim(S, T) of two sets
Some row of a band unequal
All rows of a band are equal
61
Prob. sharing a bucket
General b and r form the S-curse Pick r and b to get the best S-curse.
Example: b = 10, r = 5, d = 50
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Blue area: False Negative rate Green area: False Positive rate
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
62
LSH summary
Tune b and r such that
Decrease false negatives: get almost all pairs with similar signatures.
Decrease false positives: eliminate most pairs that do not have similar signatures.
Check in main memory that candidate pairs really do have similar signatures (i.e. estimate Jaccard similarity)
Optional: In another pass through data, compute the actual Jaccard similarity of candidate pairs to return the result.
63
Summary of 3 steps
Shingling: Convert documents to sets
We used hashing to assign each shingle an integer ID.
Min-Hashing: Convert large sets to short signatures, while preserving Jaccard similarity.
We used similarity preserving hashing to generate signatures with property Pr[h(S) = h(T)] = J(S, T).
We used hashing to get around generating random permutations (efficient implementation).
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents.
We used hashing to find candidate pairs of Jaccard similarity s.
64