程序代写代做代考 C information retrieval algorithm deep learning Approximate Near Neighbor Search: Locality-sensitive Hashing

Approximate Near Neighbor Search: Locality-sensitive Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Auckland, Aug 10, 2020
1

Outline
 Popular similarity/distance measure  Definitions
 Applications
 Locality-sensitive hashing framework for approximate near neighbor search
 LSH definition
 LSH framework for near neighbor search
2

A common metaphor
 We can present many problems as finding “similar” items or finding near-neighbors in high dimensions.
 Near neighbors search:
Source: https://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/
3

Distance/similarity measure
 For each application, we need to define a similarity/distance measure to distinguish between
 Similar and dissimilar objects.  Near and far apart objects.
 Popular similarity/distance measures:  Jaccard similarity/distance
 Cosine similarity/distance  Hamming distance
 Euclidean distance (L2)
 Manhattan distance (L1)
4

Jaccard similarity/distance
 Jaccard similarity of two sets S and T: 𝑺∩𝑻
 Jaccard distance of two sets S and T: 1 – J(S, T)
 Example:
S
T
𝑺∪𝑻
4 items in intersection 8 items in union
J(S, T) = 1/2
1 – J(S, T) = 1/2
5

Applications
 Texts/Documents:
 Present text/documents as a set of shingles
 Usage: Detect plagiarism, mirror pages, articles from same source  Recommender system data:
 Online-purchases: users as sets, items as set elements
 Netflix: users as sets, movies rated as set elements
 Usages: Collaborative filtering recommender systems need to compute the similarity between users-users, items-items for accurate recommendation
6

Applications
 Market basket data:
 Items as sets, transaction IDs as set elements.
 Usage: Association rules X  Y if J(X, Y) ≥ s.  Example:
 Bread = {1, 2, 4, 5}
 Milk = {1, 3, 4, 5}
 Diapers = {2, 3, 4, 5}  Beer = {2, 3, 4}
 Eggs = {2}
 Cola = {3, 5}
 J(Diapers, Beer) = 3/4
7

Shingles
MinHash [Broder’97]
 Permute the Boolean matrix with a random permutation 
ABCD
 MinHash function:
1101 0101
h(S) = the index of the first row with value 1 of S (in the permuted order )
0001 1001 1110 1010
h(S) = min (S)
 Pr[h(S) = h(T)] = J(S, T) 7
1 2 3 4 5 6
1110
Documents
8

Cosine similarity/distance
 Given two points x = (x1, … , xd) and y = (y1, … , yd)
in high-dimensional sphere, i.e.
𝟐𝟐 𝟐𝟏 𝟐𝟐 𝟐𝒅
 Cosine similarity: cos(qxy)=
y
x qxy
where 0 ≤ qxy ≤ p
 Cosine distance: 1 – cos(qxy)
qxz
z
𝒅𝒊􏰍𝟏 𝒊 𝒊
𝟐𝟐 𝟐𝟐
9

Applications
 Texts/Documents:
 Present texts/documents as a high-dimensional points using term
frequency–inverse document frequency (tf-idf). L2normalization,i.e.normalizingxby(x1,…,xd)/ 𝒙 𝟐𝟐.
 Usage: Information retrieval (text search, topic modeling) and most of MinHash applications.
 A tf-idf definition for a corpus of n documents:  tf(t, d) = number of times term t occur in document d
total number of terms in document d  idf(t) = log n
2 1 􏰁 number of documents where t appears  tf-idf(t, d) = tf(t, d) * idf(t)
10

Applications
 Images:
 Present an image as a high-dimensional vector, e.g. scale-invariant feature
transform (SIFT), deep learning (“image2vec”),…  Usage: content-based image retrieval
Source: https://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/ 11

SimHash [Charikar’02]
 Generate a random Gaussian vector r = (r1, … , rd) where ri is drawn from N(0, 1)
 SimHash function: hr(x) =
x
Prr [hr(x)=hr(y)]=1-qxy /p
≥0 <0 r qxy y 12 Outline  Popular similarity/distance measure  Locality-sensitive hashing framework for approximate near neighbor search 13 r-near neighbor search  r-near neighbor search (rNNS) problem:  Given a data set S Ì Rd, a distance function d(x, y), a distance threshold r, and a query q Î Rd, return some point x Î S where d(x, q) ≤ r. r  Other variants:  k-nearest neighbors search  r-near neighbor reporting 14 Problem relaxation  Randomized c-approximation rNNS:  Given a data set S Ì Rd, a distance function d(x, y), parameters r > 0,
c > 1, d > 0, and a query q Î Rd, with probability 1 – d
o If exists x Î S and d(x, q) ≤ r, return some point x’ Î S where
d(x’, q) ≤ cr.
o Else, return nothing.
cr
r
cr
r
Return one of red or blue points
Return nothing
15

LSH definition
 Definition [Indyk & Motwani’98]:
 Given a distance function d: U ́U  R and positive values r, c, p1, p2 where p1 > p2,
c > 1. A family of functions H is called
(r, c, p1, p2)-sensitive if for uniformly chosen h Î H and all x, y Î U:
cr
r
 If d(x, y) ≤ r then Pr [h(x) = h(y)] ≥ p1; (similar/near points)
 If d(x, y) ≥ cr then Pr [h(x) = h(y)] ≤ p2. (dissimilar/far away points)
16

Exercise
 MinHash is locality-sensitive hashing  Pr[h(S) = h( T)] = J(S, T)
 What are the values of p1, p2 given r and cr?
 SimHash is locality-sensitive hashing Prr [hr(x)=hr(y)]=1-qxy /p
 What are the values of p1, p2 given r and cr?
17

Locality-sensitive hashing framework
 Claim:
 We can solve the randomized c-approximation r-near neighbor search in
sublinear time using locality-sensitive hashing.
 Sublinear time: O(nr), 0 < r < 1 for any query and data set of size n.  Intuition: Candidate points are points within the same bucket of the query point. Hash table 18 Small distance Large distance q x1 x2 x3 Recap: LSH definition  Definition [Indyk & Motwani’98]:  Given a distance function d: U ́U  R and positive values r, c, p1, p2 where p1 > p2,
c > 1. A family of functions H is called
(r, c, p1, p2)-sensitive if for uniformly chosen h Î H and all x, y Î U:
cr
r
 If d(x, y) ≤ r then Pr [h(x) = h(y)] ≥ p1; (similar/near points)
 If d(x, y) ≥ cr then Pr [h(x) = h(y)] ≤ p2. (dissimilar/far away points)
How to make it as small as possible, i.e. p2 = 1/n.
19

Concatenate several hash functions
 Define a hash function g(x) = (h1(x), h2(x), … , hk(x)) where hi, 1 ≤ i ≤ k, are chosen independently and randomly from a LSH family H.
 If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ (p1)k (similar/near points)
 If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ (p2)k (dissimilar/far away points)
 Set k = log(n) / log(1/p2), we have  (p2)k = (p2) log(n) / log(1/p2) = 1/n
 (p )k = (p )log(n)/log(1/p2) = 1/nr where 0 < r = log(1/p1) < 1 1 1 log(1/p2) 20 LSH framework intuition  Similar points:  If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr  If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket.  Dissimilar points:  If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n  Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query. 21 LSH framework intuition  Similar points:  If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr  If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket. How to get nr points?  If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n  Dissimilar points:  Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query. No worries on dissimilar points 22 LSH framework intuition  Similar points:  If d(x, y) ≤ r then Pr [g(x) = g(y)] ≥ 1/nr  If there is a similar point, i.e. d(x, q) ≤ r, in expectation, we need to check nr points to find a similar point in our bucket. Use L = nr hash tables  If d(x, y) ≥ cr then Pr [g(x) = g(y)] ≤ 1/n  Dissimilar points:  Since we have n points, in expectation, only 1 dissimilar point is in the same bucket of query. In expectation, less than L = nr points 23 LSH framework: Hash tables construction Space usage: O(nL + dn) g(x)= (h1(x), h2(x), ... , hk(x)) where hi Î H x g1(x) g2(x) S gL(x) L hash tables Prob. collision of near points: 1/nr Prob. collision of far away points: 1/n 24 LSH framework: Hash tables construction  Preprocessing: 1) Choose L hash function gj, 1 ≤ j ≤ L, by setting gj= (h1,j, h2,j, ... , hk,j) where hi,j, 1 ≤ i ≤ k, are chosen independently and randomly from the LSH family H. 2) Construct L hash tables, where each hash table Tj, 1 ≤ j ≤ L, contains the data set S hashed using the hash function gj.  Complexity:  Space usage: O(nL + dn)  Time: O(dknL) with the assumption that the evaluation of h(x) takes O(d) time. 25 LSH framework: Hash tables lookup Important: We interrupt search after finding first L’ = 2nr points, including duplicates. g1(q) Filtering Refinement TL g(q)= (h1(q), h2(q), ... , hk(q)) where hi Î H q g2(q) T1 T2 d(x, q) ≤ cr 26 gL(q) LSH framework: Hash tables lookup  Query for a point q: For each j = 1, 2, ..., L 1) Retrieve the points from the bucket gj(q) in the Tj hash table. 2) For each of the reported point x 2.1) Compute the distance d(x, q). 2.2) If d(x, q) ≤ cr, return x and stop. 3) Stop as soon as the number of reported points is more than L’.  Complexity:  Time: O(dkL + dL’) with the assumption that the evaluation of h(q) takes O(d) time. Compute hash values Compute actual distance with L’ points 27 Analysis  Parameter settings: Prob. collision of near points: 1/nr Prob. collision of far away points: 1/n  k = log(n) / log(1/p2)  0 < r = log(1/p1) < 1 log(1/p2)  L = nr hash tables  Check L’ = 2L = 2nr points for computing actual distance  Complexity:  Preprocessing: o Spaceusage:O(nL+dn) o Time:O(dkn1+r)withtheassumptionthattheevaluationofh(x) takes O(d) time.  Query: o Sublinear time: O(dknr + dnr) with the assumption that the evaluation of h(q) takes O(d) time. 28 Recap: Problem relaxation  Randomized c-approximation rNNS:  Given a data set S Ì Rd, a distance function d(x, y), parameters r > 0,
c > 1, d > 0, and a query q Î Rd, with probability 1 – d
o If exists x Î S and d(x, q) ≤ r, return some point x’ Î S where
d(x’, q) ≤ cr.
o Else, return nothing.
cr
r
cr
r
Return one of red or blue points
Return nothing
29

Correctness
 We need to show that, with probability 1 – d: Case1:IfexistsxÎSandd(x,q)≤r,thereexistsahashtable Tj such
that gj(x) = gj(q) for some 1 ≤ j ≤ L. This means that we can find x.
 Case 2: The total number of collisions from far away points is at most 2L. This means that after checking 2L points, we are certain that all points are far away from q more than cr.
cr
r
cr
r
Case 1
Case 2
Return one of red or blue points
Return nothing
30

Correctness: Case 1
 Observation: If d(x, q) ≤ r, then
 Prob. x colliding with q in any hash table: ≥ 1/nr
 Prob. x not colliding with q in any hash table: ≤ 1 – 1/nr
 Prob. x not colliding with q in all L hash tables: ≤ (1 – 1/nr)L
 Prob. x not colliding with q in all L hash tables: 1/e with L = nr
 Prob. x colliding with q in at least 1 hash tables: 1 – 1/e with L = nr
 With prob. 1-1/e, any near point x will collide with q in at least 1 hash table. This means that we can find x.
(1 – 1/x)x » 1/e
31

Correctness: Case 2
 Observation:
 Maximum number of far away points: n
Probability distribution
Tail probability
 If d(x, q) ≥ cr, then Pr [h(x) = h(q)] ≤ 1/n. Hence, in expectation, we have at most 1 far away point collide with q in each hash table.



 E [number of far away points] ≤ L.  Using Markov’s inequality:
Pr [X ≥ a] ≤ E [X] for any a > 0. a
 Let X be the number of far away points collided with q in L tables.  Pr [X ≥ 2L] ≤ E [X] ≤ 𝟏/𝟐
2L
 With prob. 1/2, after checking all first L’ = 2L points far away from
q, we are certain that all points are far away. Hence, return nothing.
32

Correctness
 Probability both case 1 and case 2 hold is at least: 1 – ((1 – 1/2) + (1 – (1 – 1/e))) = 1/2 – 1/e
Prob. case 2 fails Prob. case 1 fails
 Theorem: There exists a sublinear algorithm to answer the
c-approximate rNNS with probability 1/2 – 1/e.
 Boosting the accuracy: By repeating the LSH algorithm O(1/d) times, we can amplify the probability of success in at least 1 trial to 1 – d for any d > 0.
33

Practical algorithm
 Query for a point q:
For each j = 1, 2, …, L
1) Retrieve the points from the bucket gj(q) in the Tj hash table. 2) For each of the reported point x
2.1) Compute the distance d(x, q).
2.2) If d(x, q) ≤ r, return x.
2.3) Update the k-nearest neighbors with the returned x.
 We cannot have sublinear theoretical guarantee but the accuracy is much higher. In practice, it can run in sublinear time in many of data sets.
34

Related Posts