Similarity Search •Problem Definition:
• Given a query 𝑞 and dataset 𝐷, find o ∈ 𝐷, where 𝑜 is similar to 𝑞
•Two types of similarity search
• Range search: •𝑑𝑖𝑠𝑡𝑜,𝑞 ≤τ
• Nearest neighbor search •𝑑𝑖𝑠𝑡 𝑜∗,𝑞 ≤𝑑𝑖𝑠𝑡 𝑜,𝑞 ,∀𝑜∈𝐷
𝑜∗ 𝜏
𝑞
• Top-k version
•Distance/similarity function varies
• Euclidean, Jaccard, inner product, … •Classic problem, with mutual solutions
2
High Dimensional Similarity Search •Applications and relationship to Big Data
• Almost every object can be and has been represented by a high dimensional vector
• Words, documents • Image, audio, video •…
• Similarity search is a fundamental process in information retrieval
• E.g., Google search engine, face recognition system, …
•High Dimension makes a huge difference!
• Traditional solutions are no longer feasible
• This lecture is about why and how
• We focus on high dimensional vectors in Euclidean space
3
Similarity Search in Low Dimensional Space
4
Similarity Search in One Dimensional Space
•Just numbers, use binary search, binary search tree, B+ Tree…
•The essential idea behind: objects can be sorted
Similarity Search in Two Dimensional Space •Why binary search no longer works?
• No order! •Voronoi diagram
Euclidean distance Manhattan distance
Similarity Search in Two Dimensional Space
•Partition based algorithms
• Partition data into “cells”
• Nearest neighbors are in the same cell with query or adjacent cells
•How many “cells” to probe on 3-dimensional space?
7
Similarity Search in Metric Space •Triangle inequality
•𝑑𝑖𝑠𝑡 𝑥,𝑞 ≤𝑑𝑖𝑠𝑡 𝑥,𝑦 +𝑑𝑖𝑠𝑡 𝑦,𝑞 •Orchard’s Algorithm
• for each 𝑥 ∈ 𝐷, create a list of points in increasing order of distance to 𝑥
• given query 𝑞, randomly pick a point 𝑥 as the initial candidate (i.e., pivot 𝑝), compute 𝑑𝑖𝑠𝑡 𝑝, 𝑞
• walk along the list of 𝑝, and compute the distances to 𝑞. If found 𝑦 closer to 𝑞 than 𝑝, then use 𝑦 as the new pivot (e.g., 𝑝 ← 𝑦).
• repeat the procedure, and stop when •𝑑𝑖𝑠𝑡𝑝,𝑦 >2⋅𝑑𝑖𝑠𝑡𝑝,𝑞
8
Similarity Search in Metric Space
• Orchard’s Algorithm, stop when 𝑑𝑖𝑠𝑡 𝑝, 𝑦 >
2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞
2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 <𝑑𝑖𝑠𝑡 𝑝,𝑦 and
𝑑𝑖𝑠𝑡 𝑝,𝑦 ≤𝑑𝑖𝑠𝑡 𝑝,𝑞 +𝑑𝑖𝑠𝑡 𝑦,𝑞 ⇒2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 <𝑑𝑖𝑠𝑡 𝑝,𝑞 +𝑑𝑖𝑠𝑡 𝑦,𝑞 ⇔ 𝑑𝑖𝑠𝑡 𝑝,𝑞 < 𝑑𝑖𝑠𝑡 𝑦,𝑞
•Since the list of 𝑝 is in increasing order of distanceto𝑝,𝑑𝑖𝑠𝑡 𝑝,𝑦 >2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 hold for all the rest 𝑦’s.
9
None of the Above Works in
High Dimensional Space!
10
Curse of Dimensionality
•Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings.
•Triangle inequality
• The pruning power reduces heavily
•What is the volume of a high dimensional
“ring” (i.e., hyperspherical shell)?
• 𝑉𝑟𝑖𝑛𝑔 𝑤=1,𝑑=2 = 29%
𝑉𝑏𝑎𝑙𝑙 𝑟=10,𝑑=2
• 𝑉𝑟𝑖𝑛𝑔 𝑤=1,𝑑=100 = 99.997%
𝑉𝑏𝑎𝑙𝑙 𝑟=10,𝑑=100
11
Approximate Nearest Neighbor Search in High Dimensional Space
•There is no sub-linear solution to find the exact result of a nearest neighbor query
• So we relax the condition
•approximate nearest neighbor search (ANNS)
• allow returned points to be not the NN of query
• Success: returns the true NN
• use success rate (e.g., percentage of succeed queries) to evaluate the method
• Hard to bound the success rate 12
c-approximate NN Search •Success: returns 𝑜 such that
•𝑑𝑖𝑠𝑡 𝑜,𝑞 ≤𝑐⋅𝑑𝑖𝑠𝑡(𝑜∗,𝑞)
𝑜∗ 𝑐𝑟
𝑟
•Then we can bound the success 𝑞 probability
• Usually noted as 1 − δ
•Solution: Locality Sensitive Hashing (LSH)
13
Locality Sensitive Hashing •Hash function
• Index: Map data/objects to values (e.g., hash key) • Same data ⇒ same hash key (with 100% probability)
• Different data ⇒ different hash keys (with high probability)
• Retrieval: Easy to retrieve identical objects (as they have the same hash key)
• Applications: hash map, hash join
• Low cost
• Space: 𝑂(𝑛) • Time: 𝑂(1)
• Why it cannot be used in nearest neighbor search? • Even a minor difference leads to totally different hash keys
14
Locality Sensitive Hashing
•Index: make the hash functions error tolerant
• Similar data ⇒ same hash key (with high probability)
• Dissimilar data ⇒ different hash keys (with high probability)
• Retrieval:
• Compute the hash key for the query
• Obtain all the data has the same key with query (i.e., candidates)
• Find the nearest one to the query
• Cost:
• Space: 𝑂(𝑛)
• Time: 𝑂 1 + 𝑂(|𝑐𝑎𝑛𝑑|)
•It is not the real Locality Sensitive Hashing! • We still have several unsolved issues…
15
LSH Functions •Formal definition:
• Given point 𝑜 , 𝑜 , distance 𝑟 , 𝑟 , probability
12 12
• An LSH function h(⋅) should satisfy
𝑝,𝑝 12
• Pr h 𝑜 = h 𝑜 ≥ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 ≤ 𝑟 121121
• Pr h 𝑜 = h 𝑜 ≤ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 > 𝑟 122122
•Whatish⋅ foragivendistance/similarity function?
• Jaccard similarity • Angular distance
• Euclidean distance
16
MinHash – LSH Function for Jaccard Similarity •Each data object is a set
|𝑗𝑆∩𝑖𝑆|= 𝑆,𝑆𝑑𝑟𝑎𝑐𝑐𝑎𝐽•
12
|𝑗𝑆∪𝑖𝑆|
•Randomly generate a global order for all the 𝑆 𝑛ڂ = elements in C
𝑖1
•Let h(𝑆) be the minimal member of 𝑆 with
respect to the global order
• For example, 𝑆 = {𝑏, 𝑐, 𝑒, h, 𝑖}, we use inversed alphabet order, then re-ordered 𝑆 = {𝑖, h, 𝑒, 𝑐, 𝑏}, henceh𝑆 =𝑖.
17
MinHash
•Now we compute Pr h 𝑆 = h 𝑆
•Every element 𝑒 ∈ 𝑆 ∪ 𝑆 has equal chance 12
to be the first element among 𝑆 ∪ 𝑆 after re-
ordering
•𝑒∈𝑆 ∩𝑆 ifandonlyifh 𝑆 =h 𝑆
12
12
1212
•𝑒∉𝑆 ∩𝑆 ifandonlyifh 𝑆 ≠h 𝑆 1212
•Pr h 𝑆 =h 𝑆 =|{𝑒𝑖|h𝑖 𝑆1 =h𝑖 𝑆2 }| =|𝑆𝑖∩𝑆𝑗| =
12
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑆 ,𝑆 12
|{𝑒𝑖}| |𝑆𝑖∪𝑆𝑗|
18
SimHash – LSH Function for Angular Distance •Each data object is a d dimensional vector
• 𝜃(𝑥, 𝑦) is the angle between 𝑥 and 𝑦 •Randomly generate a normal vector 𝑎, where
𝑎𝑖 ~𝑁(0,1)
•Let h 𝑥; 𝑎 = sgn(𝑎𝑇𝑥)
•sgno =ቊ1;𝑖𝑓𝑜≥0 −1; 𝑖𝑓 𝑜 < 0
• 𝑥 lies on which side of 𝑎’s corresponding hyperplane
19
SimHash
•Now we compute Pr h 𝑜 = h 𝑜
•h 𝑜 ≠h 𝑜 iff𝑜 and𝑜 areondifferent 1212
12
sides of the hyperplane with 𝑎 as its normal vector
•Prh𝑜 =h𝑜 =1−𝜃 12𝜋
𝑜 1
𝑎
𝑜2
θ
𝜃= 𝜋
20
p-stable LSH - LSH function for Euclidean distance
•Each data object is a d dimensional vector
•𝑑𝑖𝑠𝑡𝑥,𝑦= σ𝑑𝑥𝑖−𝑦𝑖 1
•Randomly generate a normal vector 𝑎, where )0,1(𝑁~ 𝑖𝑎
• Normal distribution is 2-stable, i.e., if 𝑎𝑖 ~𝑁(0,1), thenσ𝑑𝑎 ⋅𝑥 ~𝑁(0, 𝑥 2)
2
1𝑖𝑖𝑇2
•Let h 𝑥;𝑎,𝑏 = 𝑎 𝑥+𝑏 , where 𝑏~𝑈(0,1) and 𝑤
𝑤 is user specified parameter
𝑡𝑑 𝑡 − 1 𝑡 𝑓 1 𝑤 = 𝑏 ,𝑎 ; 𝑜 Pr h 𝑜 ; 𝑎, 𝑏 = h •
𝑤 𝑜1,𝑜2 𝑝 𝑜1,𝑜2 0 2 1
• 𝑓 ⋅ is the pdf of the absolute value of normal variable 𝑝
21
p-stable LSH
•Intuition of p-stable LSH
• Similar points have higher chance to be hashed together
22
Pr h 𝑥 = h 𝑦 for different Hash Functions
MinHash
SimHash
p-stable LSH
23
Problem of Single Hash Function
•Hard to distinguish if two pairs have distances close to each other
• Pr h 𝑜 = h 𝑜 ≥ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 ≤ 𝑟 121121
• Pr h 𝑜 = h 𝑜 ≤ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 > 𝑟 122122
•We also want to control where the drastic change happens…
• Close to 𝑑𝑖𝑠𝑡(𝑜∗, 𝑞) • Given range
24
AND-OR Composition
•Recall for a single hash function, we have
𝑝 𝑜1,𝑜2
•Now we consider two scenarios:
• Combine 𝑘 hashes together, using AND operation • One must match all the hashes
𝑘
• Combine 𝑙 hashes together, using OR operation
• One need to match at least one of the hashes
• Pr 𝐻 𝑜 = 𝐻 𝑜 = 1 − (1 − 𝑝 )𝑙 𝑂𝑅 1 𝑂𝑅 2 𝑜1,𝑜2
• Not match only when all the hashes don’t match 25
• Pr h 𝑜 = h 𝑜 = 𝑝(𝑑𝑖𝑠𝑡(𝑜 , 𝑜 )), denoted as
1212
•Pr𝐻 𝑜 =𝐻 𝑜 =𝑝 𝐴𝑁𝐷 1 𝐴𝑁𝐷 2 𝑜1,𝑜2
AND-OR Composition
•Example with minHash, 𝑘 = 5, 𝑙 = 5
Prh𝑜1 =h𝑜2
Pr𝐻 𝑜 =𝐻 𝑜 𝐴𝑁𝐷 1 𝐴𝑁𝐷 2
Pr𝐻 𝑜 =𝐻 𝑜 𝑂𝑅 1 𝑂𝑅 2
26
AND-OR Composition in LSH •Let h𝑖,𝑗 be LSH functions, where 𝑖 ∈
1,2,…,𝑙 ,𝑗 ∈ {1,2,…,𝑘}
•Let𝐻𝑜=[h 𝑜,h 𝑜,…,h 𝑜] 𝑖 𝑖,1 𝑖,2 𝑖,𝑘
• super-hash
•𝐻𝑜 =𝐻𝑜 ⇔∀𝑗∈1,2,…,𝑘,h 𝑜 =h 𝑜
𝑖 1 𝑖 2 𝑖,𝑗 1 𝑖,𝑗 2 •Consider query 𝑞 and any data point 𝑜, 𝑜 is a
nearest neighbor candidate of 𝑞 if •∃𝑖∈ 1,2,…,𝑙,𝐻 𝑜 =𝐻 𝑞
𝑖𝑖
•The probability of 𝑜 is a nearest neighbor
candidate of 𝑞 is
•1−(1−𝑝𝑘 )𝑙 𝑞,𝑜
27
The Effectiveness of LSH
• 1 − (1 − 𝑝𝑘 )𝑙 changes with 𝑝 , (𝑘 = 20, 𝑙 =
5)
𝑞,𝑜 𝑞,𝑜
𝑝𝑞,𝑜
1−(1−𝑝𝑘 )𝑙 𝑞,𝑜
0.2
0.002
0.4
0.050
0.6
0.333
0.7
0.601
0.8
0.863
0.9
0.988
•E.g., we are expected to retrieve 98.8% of the data with Jaccard > 0.9
28
False Positives and False Negatives •False Positive:
• returned data with dist o, q > 𝑟 2
•False Negative
• not returned data with dist o, q < 𝑟
1
•They can be controlled by carefully chosen 𝑘
and 𝑙
• It’s a trade-off between space/time and accuracy
29
The Framework of NNS using LSH • Pre-processing
• Generate LSH functions • minHash: random permutations
• simHash: random normal vectors
• p-stable: random normal vectors and random uniform values
• Index
• Compute 𝐻 𝑜 for each data object 𝑜, 𝑖 ∈ {1,...,𝑙}
𝑜 as key in the 𝑖-th hash table • Compute 𝐻 𝑞 for query 𝑞, 𝑖 ∈ {1,...,𝑙}
𝑖
• Index 𝑜 using 𝐻
• Query
𝑖
𝑖
•Generatecandidateset 𝑜∃𝑖∈ 1,...,𝑙,𝐻 𝑞 =𝐻 𝑜 𝑖𝑖
• Compute the actual distance for all the candidates and return the nearest one to the query
30
The Drawback of LSH
•Concatenate k hashes is too “strong”
• h 𝑜 ≠ h 𝑜 ⇒ 𝐻 𝑞 ≠ 𝐻 𝑜 for any 𝑗 𝑖,𝑗1 𝑖,𝑗2 𝑖 𝑖
•Not adaptive to the distribution of the distances
• What if not enough candidates?
• Need to tune w (or build indexes different w’s) to
handle different cases
31
Multi-Probe LSH • Observation:
• If 𝑞’s nearest neighbor does not falls into 𝑞’s hash bucket, then most likely it will fall into the adjacent bucket to 𝑞’s
•Why?σ𝑑𝑎 ⋅𝑥 −σ𝑑𝑎 ⋅𝑞 ~𝑁(0, 𝑥,𝑞 2) •Idea: 1 𝑖 𝑖 1 𝑖 𝑖 2
• Not only look at the hash bucket where 𝑞 falls into, but also those adjacent to it
• Problem:
• How many such bucket? 2𝑘
• And they are not equally important!
32
Multi-Probe LSH
•Consider the case when 𝑘 = 2:
•Notethat𝐻(𝑜)=(h 𝑜 ,h (𝑜)) 𝑖 𝑖,1 𝑖,2
•The ideal probe order would be:
•h𝑖,1 𝑞 ,h𝑖,2 𝑞 :0.315
•h𝑖,1 𝑞 ,h𝑖,2 𝑞 −1:0.284
•h𝑖,1 𝑞 +1,h𝑖,2 𝑞 :0.150 h𝑖,1 •h𝑖,1 𝑞 −1,h𝑖,2 𝑞 :0.035
•h𝑖,1 𝑞 ,h𝑖,2 𝑞 +1:0.019
We don’t have to compute the integration, but use the offset between f 𝑞 and the boundaries.
33
h𝑖,2
Multi Probe LSH • Pros:
• Requires less 𝑙
• Because we use hash tables more efficiently
• More robust against the unlucky points
• Cons:
• Lose the theoretical guarantee about the results • Not parallel-friendly
34
Collision Counting LSH (C2LSH) •C2LSH (SIGMOG’12 paper)
• Which one is closer to 𝑞?
𝒒
1
1
1
1
𝑜 1
1
1
1
2
𝑜2
1
1
1
1
1
1
1
1
1
2
1
1
2
2
3
4
1
1
1
1
2
1
1
1
1
2
3
4
• We will omit the theoretical parts hence leads to a slightly different version to the paper.
• But the essential ideas are the same
• Project 1 is to implement C2LSH using PySpark!
35
Counting the Collisions
•Collision: match on a single hash function
• Use number of collisions to determine the candidates
• Match one of the super hash with 𝑞 → collides at least 𝛼𝑚 hash values with 𝑞
•Recall in LSH, The probability of 𝑜 with dist 𝑜, 𝑞 ≤ 𝑟 is a nearest neighbor
1𝑘𝑙
candidateof𝑞is1−(1−𝑝 ) 1
•Now we compute the case with collision counting...
36
The Collision Probability
• ∀𝑜 with dist 𝑜, 𝑞 ≤ 𝑟 , we have
1𝑚𝑚𝑖 𝑚−𝑖 •Pr#𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛𝑜 ≥𝛼𝑚 =σ𝑖=𝛼𝑚 𝑖 𝑝 1−𝑝
•𝑝=Prh 𝑜 =h 𝑞 ≥𝑝 𝑗𝑗1
• We define 𝑚 Bernoulli random variables 𝑋 ∼
𝐵(𝑚, 1 − 𝑝) with 1 ≤ 𝑖 ≤ 𝑚.
• Let 𝑋 equal 1 if 𝑜 does not collide with 𝑞
𝑖
𝑖
• i.e., Pr 𝑋𝑖 = 1 = 1 − 𝑝
• Let 𝑋 equal 0 if 𝑜 collides with 𝑞 𝑖
• i.e., Pr 𝑋𝑖 = 0 = 𝑝 •HenceE𝑋 =1−𝑝
ത 𝑖 ത σ𝑚 𝑋 • Thus𝐸(𝑋)=1−𝑝,where𝑋= 𝑖=1 𝑖.
• Let 𝑡 = 𝑝 − 𝛼 > 0, we have:𝑚
തത σ𝑚𝑋
•Pr𝑋−𝐸𝑋 ≥𝑡 =Pr 𝑖=1 𝑖− 1−𝑝 ≥𝑡 =Pr[σ𝑋𝑖≥(1−𝛼)𝑚]
𝑚
37
The Collision Probability
• From Hoeffding’s Inequality, we have
2 𝑝−𝛼 2𝑚2 •Pr𝑋−𝐸𝑋 ≥𝑡 =Prσ𝑋𝑖≥ 1−𝛼𝑚 ≤exp −σ𝑚 1−02 =
exp−2𝑝−𝛼2𝑚 ≤exp−2𝑝 −𝛼2𝑚 𝑖=1 1
• Since the event “#collision(𝑜) ≥ 𝛼𝑚” is equivalent to the event “𝑜 misses the collision with 𝑞 less than (1 − 𝛼)𝑚 times”,
≥ 𝑟 in 2
തത
• Pr #𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 𝑜 ≥ 𝛼𝑚 = Pr σ𝑋 < 1 − 𝛼 𝑚 ≥ 1 −
𝑖
• Now you can compute the case for 𝑜 with dist 𝑜,𝑞
a similar way...
exp −2 𝑝 −𝛼 2𝑚 1
• Then we can accordingly set 𝛼 to control false positives and false negatives
38
Virtual Rehashing
•When we are not getting enough candidates...
• E.g., # of candidates < top-k • Observation:
• A close point 𝑜 usually falls into adjacent hash buckets of 𝑞’s if it does not collide with 𝑞
• Why?
•σ𝑑𝑎 ⋅𝑥 −σ𝑑𝑎 ⋅𝑞 ~𝑁(0, 𝑥,𝑞 2)
1𝑖𝑖1𝑖𝑖 2
• Idea:
• Include the adjacent hash buckets into
consideration
• So you don’t need to re-hash them again...
39
Virtual Rehashing •Atfirstconsiderh𝑜 =h𝑞
𝒒
1
1
1
1
1
1
1
1
1
1
𝑜 1
1
0
2
-1
1
2
4
-3
0
-1
•Considerh 𝑜 =h 𝑞 ±1ifnotenough candidates
•Thenh 𝑜 =h 𝑞 ±2andsoon... 40
𝒒
1
1
1
1
1
1
1
1
1
1
𝑜 1
1
0
2
-1
1
2
4
-3
0
-1
𝒒
1
1
1
1
1
1
1
1
1
1
𝑜 1
1
0
2
-1
1
2
4
-3
0
-1
The Framework of NNS using C2LSH • Pre-processing
• Generate LSH functions
• Random normal vectors and random uniform values
• Index
• Compute and store h𝑖 𝑜 for each data object 𝑜,
𝑖 ∈ {1,...,𝑚}
• Query
• Compute h𝑖 𝑞 for query 𝑞, 𝑖 ∈ {1,...,𝑚}
• Take those 𝑜 that shares at least 𝛼𝑚 hashes with 𝑞 as candidates
• Relax the collision condition (e.g., virtual rehashing) and repeat the above step, until we got enough candidates
41
Pseudo code of Candidate Generation in C2LSH
candGen(data_hashes, query_hashes, 𝛼𝑚, 𝛽𝑛): offset ← 0
cand ← ∅ while true:
for each (id, hashes) in data_hashes:
if count(hashes, query_hashes, offset) ≥ 𝛼𝑚:
cand ← cand ∪ {id} if 𝑐𝑎𝑛𝑑 < 𝛽𝑛:
offset ← offset + 1 else:
break return cand
42
Pseudo code of Candidate Generation in C2LSH
count(hashes_1, hashes_2, offset): counter ← 0
for each h𝑎𝑠h1, h𝑎𝑠h2 in hashes_1, hashes_2: if h𝑎𝑠h1 − h𝑎𝑠h2 ≤ offset:
counter ← counter + 1 return counter
43