Approximate Near Neighbor Search: Locality-sensitive Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Auckland, Aug 10, 2020
1
Outline
Popular similarity/distance measure Definitions
Applications
Locality-sensitive hashing framework for approximate near neighbor search
LSH definition
LSH framework for near neighbor search
2
A common metaphor
We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
Near neighbors search:
Source: https://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/
3
Distance/similarity measure
For each application, we need to define a similarity/distance measure to distinguish between
Similar and dissimilar objects. Near and far apart objects.
Popular similarity/distance measures: Jaccard similarity/distance
Cosine similarity/distance Hamming distance
Euclidean distance (L2)
Manhattan distance (L1)
4
Jaccard similarity/distance
Jaccard similarity of two sets S and T: 𝑺∩𝑻
Jaccard distance of two sets S and T: 1 – J(S, T)
Example:
S
T
𝑺∪𝑻
4 items in intersection 8 items in union
J(S, T) = 1/2
1 – J(S, T) = 1/2
5
Applications
Texts/Documents:
Present text/documents as a set of shingles
Usage: Detect plagiarism, mirror pages, articles from same source Recommender system data:
Online-purchases: users as sets, items as set elements
Netflix: users as sets, movies rated as set elements
Usages: Collaborative filtering recommender systems need to compute the similarity between users-users, items-items for accurate recommendation
6
Applications
Market basket data:
Items as sets, transaction IDs as set elements.
Usage: Association rules X Y if J(X, Y) ≥ s. Example:
Bread = {1, 2, 4, 5}
Milk = {1, 3, 4, 5}
Diapers = {2, 3, 4, 5} Beer = {2, 3, 4}
Eggs = {2}
Cola = {3, 5}
J(Diapers, Beer) = 3/4
7
Shingles
MinHash [Broder’97]
Permute the Boolean matrix with a random permutation
ABCD
MinHash function:
1101 0101
h(S) = the index of the first row with value 1 of S (in the permuted order )
0001 1001 1110 1010
h(S) = min (S)
Pr[h(S) = h(T)] = J(S, T) 7
1 2 3 4 5 6
1110
Documents
8
Cosine similarity/distance
Given two points x = (x1, … , xd) and y = (y1, … , yd)
in high-dimensional sphere, i.e.
𝟐𝟐 𝟐𝟏 𝟐𝟐 𝟐𝒅
Cosine similarity: cos(qxy)=
y
x qxy
where 0 ≤ qxy ≤ p
Cosine distance: 1 – cos(qxy)
qxz
z
𝒅𝒊𝟏 𝒊 𝒊
𝟐𝟐 𝟐𝟐
9
Applications
Texts/Documents:
Present texts/documents as a high-dimensional points using term
frequency–inverse document frequency (tf-idf). L2normalization,i.e.normalizingxby(x1,…,xd)/ 𝒙 𝟐𝟐.
Usage: Information retrieval (text search, topic modeling) and most of MinHash applications.
A tf-idf definition for a corpus of n documents: tf(t, d) = number of times term t occur in document d
total number of terms in document d idf(t) = log n
2 1 number of documents where t appears tf-idf(t, d) = tf(t, d) * idf(t)
10
Applications
Images:
Present an image as a high-dimensional vector, e.g. scale-invariant feature
transform (SIFT), deep learning (“image2vec”),… Usage: content-based image retrieval
Source: https://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/ 11
SimHash [Charikar’02]
Generate a random Gaussian vector r = (r1, … , rd) where ri is drawn from N(0, 1)
SimHash function: hr(x) =
x
Prr [hr(x)=hr(y)]=1-qxy /p
≥0 <0
r
qxy y
12
Outline
Popular similarity/distance measure
Locality-sensitive hashing framework for approximate near neighbor search
13
r-near neighbor search
r-near neighbor search (rNNS) problem:
Given a data set S Ì Rd, a distance function d(x, y), a distance threshold
r, and a query q Î Rd, return some point x Î S where d(x, q) ≤ r. r
Other variants:
k-nearest neighbors search
r-near neighbor reporting
14
Problem relaxation
Randomized c-approximation rNNS:
Given a data set S Ì Rd, a distance function d(x, y), parameters r > 0,
c > 1, d > 0, and a query q Î Rd, with probability 1 – d
o If exists x Î S and d(x, q) ≤ r, return some point x’ Î S where
d(x’, q) ≤ cr.
o Else, return nothing.
cr
r
cr
r
Return one of red or blue points
Return nothing
15
LSH definition
Definition [Indyk & Motwani’98]:
Given a distance function d: U ́U R and positive values r, c, p1, p2 where p1 > p2,
c > 1. A family of functions H is called
(r, c, p1, p2)-sensitive if for uniformly chosen h Î H and all x, y Î U:
cr
r
If d(x, y) ≤ r then Pr [h(x) = h(y)] ≥ p1; (similar/near points)
If d(x, y) ≥ cr then Pr [h(x) = h(y)] ≤ p2. (dissimilar/far away points)
16
Exercise
MinHash is locality-sensitive hashing Pr[h(S) = h( T)] = J(S, T)
What are the values of p1, p2 given r and cr?
SimHash is locality-sensitive hashing Prr [hr(x)=hr(y)]=1-qxy /p
What are the values of p1, p2 given r and cr?
17
Locality-sensitive hashing framework
Claim:
We can solve the randomized c-approximation r-near neighbor search in
sublinear time using locality-sensitive hashing.
Sublinear time: O(nr), 0 < r < 1 for any query and data set of size n.
Intuition: Candidate points are points within the same bucket of the query point.
Hash table
18
Small distance
Large distance
q x1 x2
x3
Recap: LSH definition
Definition [Indyk & Motwani’98]:
Given a distance function d: U ́U R and positive values r, c, p1, p2 where p1 > p2,
c > 1. A family of functions H is called
(r, c, p1, p2)-sensitive if for uniformly chosen h Î H and all x, y Î U:
cr
r
If d(x, y) ≤ r then Pr [h(x) = h(y)] ≥ p1; (similar/near points)
If d(x, y) ≥ cr then Pr [h(x) = h(y)] ≤ p2. (dissimilar/far away points)
How to make it as small as possible, i.e. p2 = 1/n.
19
Concatenate several hash functions
Define a hash function g(x) = (h1(x), h2(x), … , hk(x)) where hi, 1 ≤ i ≤ k, are chosen independently and randomly from a LSH family H.
If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ (p1)k (similar/near points)
If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ (p2)k (dissimilar/far away points)
Set k = log(n) / log(1/p2), we have (p2)k = (p2) log(n) / log(1/p2) = 1/n
(p )k = (p )log(n)/log(1/p2) = 1/nr where 0 < r = log(1/p1) < 1 1 1 log(1/p2)
20
LSH framework intuition
Similar points:
If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr
If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket.
Dissimilar points:
If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n
Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query.
21
LSH framework intuition
Similar points:
If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr
If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket.
How to get nr points?
If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n
Dissimilar points:
Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query.
No worries on dissimilar points
22
LSH framework intuition
Similar points:
If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr
If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket.
Use L = nr hash tables
If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n
Dissimilar points:
Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query.
In expectation, less than L = nr points
23
LSH framework: Hash tables construction
Space usage: O(nL + dn)
g(x)= (h1(x), h2(x), ... , hk(x)) where hi Î H
x
g1(x) g2(x)
S
gL(x)
L hash tables
Prob. collision of near points: 1/nr Prob. collision of far away points: 1/n
24
LSH framework: Hash tables construction
Preprocessing:
1) Choose L hash function gj, 1 ≤ j ≤ L, by setting
gj= (h1,j, h2,j, ... , hk,j) where hi,j, 1 ≤ i ≤ k, are chosen independently and randomly from the LSH family H.
2) Construct L hash tables, where each hash table Tj, 1 ≤ j ≤ L, contains the data set S hashed using the hash function gj.
Complexity:
Space usage: O(nL + dn)
Time: O(dknL) with the assumption that the evaluation of h(x) takes O(d) time.
25
LSH framework: Hash tables lookup
Important:
We interrupt search after finding first L’ = 2nr points, including duplicates. g1(q)
Filtering
Refinement
TL g(q)= (h1(q), h2(q), ... , hk(q)) where hi Î H
q
g2(q) T1 T2
d(x, q) ≤ cr
26
gL(q)
LSH framework: Hash tables lookup
Query for a point q:
For each j = 1, 2, ..., L
1) Retrieve the points from the bucket gj(q) in the Tj hash table. 2) For each of the reported point x
2.1) Compute the distance d(x, q). 2.2) If d(x, q) ≤ cr, return x and stop.
3) Stop as soon as the number of reported points is more than L’. Complexity:
Time: O(dkL + dL’) with the assumption that the evaluation of h(q) takes O(d) time.
Compute hash values Compute actual distance with L’ points
27
Analysis
Parameter settings:
Prob. collision of near points: 1/nr Prob. collision of far away points: 1/n
k = log(n) / log(1/p2) 0 < r = log(1/p1) < 1
log(1/p2) L = nr hash tables
Check L’ = 2L = 2nr points for computing actual distance
Complexity:
Preprocessing:
o Spaceusage:O(nL+dn)
o Time:O(dkn1+r)withtheassumptionthattheevaluationofh(x)
takes O(d) time. Query:
o Sublinear time: O(dknr + dnr) with the assumption that the evaluation of h(q) takes O(d) time.
28
Recap: Problem relaxation
Randomized c-approximation rNNS:
Given a data set S Ì Rd, a distance function d(x, y), parameters r > 0,
c > 1, d > 0, and a query q Î Rd, with probability 1 – d
o If exists x Î S and d(x, q) ≤ r, return some point x’ Î S where
d(x’, q) ≤ cr.
o Else, return nothing.
cr
r
cr
r
Return one of red or blue points
Return nothing
29
Correctness
We need to show that, with probability 1 – d: Case1:IfexistsxÎSandd(x,q)≤r,thereexistsahashtable Tj such
that gj(x) = gj(q) for some 1 ≤ j ≤ L. This means that we can find x.
Case 2: The total number of collisions from far away points is at most 2L. This means that after checking 2L points, we are certain that all points are far away from q more than cr.
cr
r
cr
r
Case 1
Case 2
Return one of red or blue points
Return nothing
30
Correctness: Case 1
Observation: If d(x, q) ≤ r, then
Prob. x colliding with q in any hash table: ≥ 1/nr
Prob. x not colliding with q in any hash table: ≤ 1 – 1/nr
Prob. x not colliding with q in all L hash tables: ≤ (1 – 1/nr)L
Prob. x not colliding with q in all L hash tables: 1/e with L = nr
Prob. x colliding with q in at least 1 hash tables: 1 – 1/e with L = nr
With prob. 1-1/e, any near point x will collide with q in at least 1 hash table. This means that we can find x.
(1 – 1/x)x » 1/e
31
Correctness: Case 2
Observation:
Maximum number of far away points: n
Probability distribution
Tail probability
If d(x, q) ≥ cr, then Pr [h(x) = h(q)] ≤ 1/n. Hence, in expectation, we have at most 1 far away point collide with q in each hash table.
E [number of far away points] ≤ L. Using Markov’s inequality:
Pr [X ≥ a] ≤ E [X] for any a > 0. a
Let X be the number of far away points collided with q in L tables. Pr [X ≥ 2L] ≤ E [X] ≤ 𝟏/𝟐
2L
With prob. 1/2, after checking all first L’ = 2L points far away from
q, we are certain that all points are far away. Hence, return nothing.
32
Correctness
Probability both case 1 and case 2 hold is at least: 1 – ((1 – 1/2) + (1 – (1 – 1/e))) = 1/2 – 1/e
Prob. case 2 fails Prob. case 1 fails
Theorem: There exists a sublinear algorithm to answer the
c-approximate rNNS with probability 1/2 – 1/e.
Boosting the accuracy: By repeating the LSH algorithm O(1/d) times, we can amplify the probability of success in at least 1 trial to 1 – d for any d > 0.
33
Practical algorithm
Query for a point q:
For each j = 1, 2, …, L
1) Retrieve the points from the bucket gj(q) in the Tj hash table. 2) For each of the reported point x
2.1) Compute the distance d(x, q).
2.2) If d(x, q) ≤ r, return x.
2.3) Update the k-nearest neighbors with the returned x.
We cannot have sublinear theoretical guarantee but the accuracy is much higher. In practice, it can run in sublinear time in many of data sets.
34