程序代写代做代考 go algorithm database C distributed system information retrieval COMP9313:

COMP9313:
Big Data Management
High Dimensional Similarity Search

Similarity Search •Problem Definition:
• Given a query 𝑞 and dataset 𝐷, find o ∈ 𝐷, where 𝑜 is similar to 𝑞
•Two types of similarity search
• Range search: •𝑑𝑖𝑠𝑡𝑜,𝑞 ≤τ
• Nearest neighbor search •𝑑𝑖𝑠𝑡 𝑜∗,𝑞 ≤𝑑𝑖𝑠𝑡 𝑜,𝑞 ,∀𝑜∈𝐷
𝑜∗ 𝜏
𝑞
• Top-k version
•Distance/similarity function varies
• Euclidean, Jaccard, inner product, … •Classic problem, with mutual solutions
2

High Dimensional Similarity Search •Applications and relationship to Big Data
• Almost every object can be and has been represented by a high dimensional vector
• Words, documents • Image, audio, video •…
• Similarity search is a fundamental process in information retrieval
• E.g., Google search engine, face recognition system, …
•High Dimension makes a huge difference!
• Traditional solutions are no longer feasible
• This lecture is about why and how
• We focus on high dimensional vectors in Euclidean space
3

Similarity Search in Low Dimensional Space
4

Similarity Search in One Dimensional Space
•Just numbers, use binary search, binary search tree, B+ Tree…
•The essential idea behind: objects can be sorted

Similarity Search in Two Dimensional Space •Why binary search no longer works?
• No order! •Voronoi diagram
Euclidean distance Manhattan distance

Similarity Search in Two Dimensional Space
•Partition based algorithms
• Partition data into “cells”
• Nearest neighbors are in the same cell with query or adjacent cells
•How many “cells” to probe on 3-dimensional space?
7

Similarity Search in Metric Space •Triangle inequality
•𝑑𝑖𝑠𝑡 𝑥,𝑞 ≤𝑑𝑖𝑠𝑡 𝑥,𝑦 +𝑑𝑖𝑠𝑡 𝑦,𝑞 •Orchard’s Algorithm
• for each 𝑥 ∈ 𝐷, create a list of points in increasing order of distance to 𝑥
• given query 𝑞, randomly pick a point 𝑥 as the initial candidate (i.e., pivot 𝑝), compute 𝑑𝑖𝑠𝑡 𝑝, 𝑞
• walk along the list of 𝑝, and compute the distances to 𝑞. If found 𝑦 closer to 𝑞 than 𝑝, then use 𝑦 as the new pivot (e.g., 𝑝 ← 𝑦).
• repeat the procedure, and stop when •𝑑𝑖𝑠𝑡𝑝,𝑦 >2⋅𝑑𝑖𝑠𝑡𝑝,𝑞
8

Similarity Search in Metric Space
• Orchard’s Algorithm, stop when 𝑑𝑖𝑠𝑡 𝑝, 𝑦 >
2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞
2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 <𝑑𝑖𝑠𝑡 𝑝,𝑦 and 𝑑𝑖𝑠𝑡 𝑝,𝑦 ≤𝑑𝑖𝑠𝑡 𝑝,𝑞 +𝑑𝑖𝑠𝑡 𝑦,𝑞 ⇒2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 <𝑑𝑖𝑠𝑡 𝑝,𝑞 +𝑑𝑖𝑠𝑡 𝑦,𝑞 ⇔ 𝑑𝑖𝑠𝑡 𝑝,𝑞 < 𝑑𝑖𝑠𝑡 𝑦,𝑞 •Since the list of 𝑝 is in increasing order of distanceto𝑝,𝑑𝑖𝑠𝑡 𝑝,𝑦 >2⋅𝑑𝑖𝑠𝑡 𝑝,𝑞 hold for all the rest 𝑦’s.
9

None of the Above Works in
High Dimensional Space!
10

Curse of Dimensionality
•Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings.
•Triangle inequality
• The pruning power reduces heavily
•What is the volume of a high dimensional
“ring” (i.e., hyperspherical shell)?
• 𝑉𝑟𝑖𝑛𝑔 𝑤=1,𝑑=2 = 29%
𝑉𝑏𝑎𝑙𝑙 𝑟=10,𝑑=2
• 𝑉𝑟𝑖𝑛𝑔 𝑤=1,𝑑=100 = 99.997%
𝑉𝑏𝑎𝑙𝑙 𝑟=10,𝑑=100
11

Approximate Nearest Neighbor Search in High Dimensional Space
•There is no sub-linear solution to find the exact result of a nearest neighbor query
• So we relax the condition
•approximate nearest neighbor search (ANNS)
• allow returned points to be not the NN of query
• Success: returns the true NN
• use success rate (e.g., percentage of succeed queries) to evaluate the method
• Hard to bound the success rate 12

c-approximate NN Search •Success: returns 𝑜 such that
•𝑑𝑖𝑠𝑡 𝑜,𝑞 ≤𝑐⋅𝑑𝑖𝑠𝑡(𝑜∗,𝑞)
𝑜∗ 𝑐𝑟
𝑟
•Then we can bound the success 𝑞 probability
• Usually noted as 1 − δ
•Solution: Locality Sensitive Hashing (LSH)
13

Locality Sensitive Hashing •Hash function
• Index: Map data/objects to values (e.g., hash key) • Same data ⇒ same hash key (with 100% probability)
• Different data ⇒ different hash keys (with high probability)
• Retrieval: Easy to retrieve identical objects (as they have the same hash key)
• Applications: hash map, hash join
• Low cost
• Space: 𝑂(𝑛) • Time: 𝑂(1)
• Why it cannot be used in nearest neighbor search? • Even a minor difference leads to totally different hash keys
14

Locality Sensitive Hashing
•Index: make the hash functions error tolerant
• Similar data ⇒ same hash key (with high probability)
• Dissimilar data ⇒ different hash keys (with high probability)
• Retrieval:
• Compute the hash key for the query
• Obtain all the data has the same key with query (i.e., candidates)
• Find the nearest one to the query
• Cost:
• Space: 𝑂(𝑛)
• Time: 𝑂 1 + 𝑂(|𝑐𝑎𝑛𝑑|)
•It is not the real Locality Sensitive Hashing! • We still have several unsolved issues…
15

LSH Functions •Formal definition:
• Given point 𝑜 , 𝑜 , distance 𝑟 , 𝑟 , probability
12 12
• An LSH function h(⋅) should satisfy
𝑝,𝑝 12
• Pr h 𝑜 = h 𝑜 ≥ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 ≤ 𝑟 121121
• Pr h 𝑜 = h 𝑜 ≤ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 > 𝑟 122122
•Whatish⋅ foragivendistance/similarity function?
• Jaccard similarity • Angular distance
• Euclidean distance
16

MinHash – LSH Function for Jaccard Similarity •Each data object is a set
|𝑗𝑆∩𝑖𝑆|= 𝑆,𝑆𝑑𝑟𝑎𝑐𝑐𝑎𝐽•
12
|𝑗𝑆∪𝑖𝑆|
•Randomly generate a global order for all the 𝑆 𝑛ڂ = elements in C
𝑖1
•Let h(𝑆) be the minimal member of 𝑆 with
respect to the global order
• For example, 𝑆 = {𝑏, 𝑐, 𝑒, h, 𝑖}, we use inversed alphabet order, then re-ordered 𝑆 = {𝑖, h, 𝑒, 𝑐, 𝑏}, henceh𝑆 =𝑖.
17
MinHash
•Now we compute Pr h 𝑆 = h 𝑆
•Every element 𝑒 ∈ 𝑆 ∪ 𝑆 has equal chance 12
to be the first element among 𝑆 ∪ 𝑆 after re-
ordering
•𝑒∈𝑆 ∩𝑆 ifandonlyifh 𝑆 =h 𝑆
12
12
1212
•𝑒∉𝑆 ∩𝑆 ifandonlyifh 𝑆 ≠h 𝑆 1212
•Pr h 𝑆 =h 𝑆 =|{𝑒𝑖|h𝑖 𝑆1 =h𝑖 𝑆2 }| =|𝑆𝑖∩𝑆𝑗| =
12
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑆 ,𝑆 12
|{𝑒𝑖}| |𝑆𝑖∪𝑆𝑗|
18

SimHash – LSH Function for Angular Distance •Each data object is a d dimensional vector
• 𝜃(𝑥, 𝑦) is the angle between 𝑥 and 𝑦 •Randomly generate a normal vector 𝑎, where
𝑎𝑖 ~𝑁(0,1)
•Let h 𝑥; 𝑎 = sgn(𝑎𝑇𝑥)
•sgno =ቊ1;𝑖𝑓𝑜≥0 −1; 𝑖𝑓 𝑜 < 0 • 𝑥 lies on which side of 𝑎’s corresponding hyperplane 19 SimHash •Now we compute Pr h 𝑜 = h 𝑜 •h 𝑜 ≠h 𝑜 iff𝑜 and𝑜 areondifferent 1212 12 sides of the hyperplane with 𝑎 as its normal vector •Prh𝑜 =h𝑜 =1−𝜃 12𝜋 𝑜 1 𝑎 𝑜2 θ 𝜃= 𝜋 20 p-stable LSH - LSH function for Euclidean distance •Each data object is a d dimensional vector •𝑑𝑖𝑠𝑡𝑥,𝑦= σ𝑑𝑥𝑖−𝑦𝑖 1 •Randomly generate a normal vector 𝑎, where )0,1(𝑁~ 𝑖𝑎 • Normal distribution is 2-stable, i.e., if 𝑎𝑖 ~𝑁(0,1), thenσ𝑑𝑎 ⋅𝑥 ~𝑁(0, 𝑥 2) 2 1𝑖𝑖𝑇2 •Let h 𝑥;𝑎,𝑏 = 𝑎 𝑥+𝑏 , where 𝑏~𝑈(0,1) and 𝑤 𝑤 is user specified parameter 𝑡𝑑 𝑡 − 1 𝑡 𝑓 1 𝑤׬ = 𝑏 ,𝑎 ; 𝑜 Pr h 𝑜 ; 𝑎, 𝑏 = h • 𝑤 𝑜1,𝑜2 𝑝 𝑜1,𝑜2 0 2 1 • 𝑓 ⋅ is the pdf of the absolute value of normal variable 𝑝 21 p-stable LSH •Intuition of p-stable LSH • Similar points have higher chance to be hashed together 22 Pr h 𝑥 = h 𝑦 for different Hash Functions MinHash SimHash p-stable LSH 23 Problem of Single Hash Function •Hard to distinguish if two pairs have distances close to each other • Pr h 𝑜 = h 𝑜 ≥ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 ≤ 𝑟 121121 • Pr h 𝑜 = h 𝑜 ≤ 𝑝 , if 𝑑𝑖𝑠𝑡 𝑜 , 𝑜 > 𝑟 122122
•We also want to control where the drastic change happens…
• Close to 𝑑𝑖𝑠𝑡(𝑜∗, 𝑞) • Given range
24

AND-OR Composition
•Recall for a single hash function, we have
𝑝 𝑜1,𝑜2
•Now we consider two scenarios:
• Combine 𝑘 hashes together, using AND operation • One must match all the hashes
𝑘
• Combine 𝑙 hashes together, using OR operation
• One need to match at least one of the hashes
• Pr 𝐻 𝑜 = 𝐻 𝑜 = 1 − (1 − 𝑝 )𝑙 𝑂𝑅 1 𝑂𝑅 2 𝑜1,𝑜2
• Not match only when all the hashes don’t match 25
• Pr h 𝑜 = h 𝑜 = 𝑝(𝑑𝑖𝑠𝑡(𝑜 , 𝑜 )), denoted as
1212
•Pr𝐻 𝑜 =𝐻 𝑜 =𝑝 𝐴𝑁𝐷 1 𝐴𝑁𝐷 2 𝑜1,𝑜2

AND-OR Composition
•Example with minHash, 𝑘 = 5, 𝑙 = 5
Prh𝑜1 =h𝑜2
Pr𝐻 𝑜 =𝐻 𝑜 𝐴𝑁𝐷 1 𝐴𝑁𝐷 2
Pr𝐻 𝑜 =𝐻 𝑜 𝑂𝑅 1 𝑂𝑅 2
26

AND-OR Composition in LSH •Let h𝑖,𝑗 be LSH functions, where 𝑖 ∈
1,2,…,𝑙 ,𝑗 ∈ {1,2,…,𝑘}
•Let𝐻𝑜=[h 𝑜,h 𝑜,…,h 𝑜] 𝑖 𝑖,1 𝑖,2 𝑖,𝑘
• super-hash
•𝐻𝑜 =𝐻𝑜 ⇔∀𝑗∈1,2,…,𝑘,h 𝑜 =h 𝑜
𝑖 1 𝑖 2 𝑖,𝑗 1 𝑖,𝑗 2 •Consider query 𝑞 and any data point 𝑜, 𝑜 is a
nearest neighbor candidate of 𝑞 if •∃𝑖∈ 1,2,…,𝑙,𝐻 𝑜 =𝐻 𝑞
𝑖𝑖
•The probability of 𝑜 is a nearest neighbor
candidate of 𝑞 is
•1−(1−𝑝𝑘 )𝑙 𝑞,𝑜
27

The Effectiveness of LSH
• 1 − (1 − 𝑝𝑘 )𝑙 changes with 𝑝 , (𝑘 = 20, 𝑙 =
5)
𝑞,𝑜 𝑞,𝑜
𝑝𝑞,𝑜
1−(1−𝑝𝑘 )𝑙 𝑞,𝑜
0.2
0.002
0.4
0.050
0.6
0.333
0.7
0.601
0.8
0.863
0.9
0.988
•E.g., we are expected to retrieve 98.8% of the data with Jaccard > 0.9
28

False Positives and False Negatives •False Positive:
• returned data with dist o, q > 𝑟 2
•False Negative
• not returned data with dist o, q < 𝑟 1 •They can be controlled by carefully chosen 𝑘 and 𝑙 • It’s a trade-off between space/time and accuracy 29 The Framework of NNS using LSH • Pre-processing • Generate LSH functions • minHash: random permutations • simHash: random normal vectors • p-stable: random normal vectors and random uniform values • Index • Compute 𝐻 𝑜 for each data object 𝑜, 𝑖 ∈ {1,...,𝑙} 𝑜 as key in the 𝑖-th hash table • Compute 𝐻 𝑞 for query 𝑞, 𝑖 ∈ {1,...,𝑙} 𝑖 • Index 𝑜 using 𝐻 • Query 𝑖 𝑖 •Generatecandidateset 𝑜∃𝑖∈ 1,...,𝑙,𝐻 𝑞 =𝐻 𝑜 𝑖𝑖 • Compute the actual distance for all the candidates and return the nearest one to the query 30 The Drawback of LSH •Concatenate k hashes is too “strong” • h 𝑜 ≠ h 𝑜 ⇒ 𝐻 𝑞 ≠ 𝐻 𝑜 for any 𝑗 𝑖,𝑗1 𝑖,𝑗2 𝑖 𝑖 •Not adaptive to the distribution of the distances • What if not enough candidates? • Need to tune w (or build indexes different w’s) to handle different cases 31 Multi-Probe LSH • Observation: • If 𝑞’s nearest neighbor does not falls into 𝑞’s hash bucket, then most likely it will fall into the adjacent bucket to 𝑞’s •Why?σ𝑑𝑎 ⋅𝑥 −σ𝑑𝑎 ⋅𝑞 ~𝑁(0, 𝑥,𝑞 2) •Idea: 1 𝑖 𝑖 1 𝑖 𝑖 2 • Not only look at the hash bucket where 𝑞 falls into, but also those adjacent to it • Problem: • How many such bucket? 2𝑘 • And they are not equally important! 32 Multi-Probe LSH •Consider the case when 𝑘 = 2: •Notethat𝐻(𝑜)=(h 𝑜 ,h (𝑜)) 𝑖 𝑖,1 𝑖,2 •The ideal probe order would be: •h𝑖,1 𝑞 ,h𝑖,2 𝑞 :0.315 •h𝑖,1 𝑞 ,h𝑖,2 𝑞 −1:0.284 •h𝑖,1 𝑞 +1,h𝑖,2 𝑞 :0.150 h𝑖,1 •h𝑖,1 𝑞 −1,h𝑖,2 𝑞 :0.035 •h𝑖,1 𝑞 ,h𝑖,2 𝑞 +1:0.019 We don’t have to compute the integration, but use the offset between f 𝑞 and the boundaries. 33 h𝑖,2 Multi Probe LSH • Pros: • Requires less 𝑙 • Because we use hash tables more efficiently • More robust against the unlucky points • Cons: • Lose the theoretical guarantee about the results • Not parallel-friendly 34 Collision Counting LSH (C2LSH) •C2LSH (SIGMOG’12 paper) • Which one is closer to 𝑞? 𝒒 1 1 1 1 𝑜 1 1 1 1 2 𝑜2 1 1 1 1 1 1 1 1 1 2 1 1 2 2 3 4 1 1 1 1 2 1 1 1 1 2 3 4 • We will omit the theoretical parts hence leads to a slightly different version to the paper. • But the essential ideas are the same • Project 1 is to implement C2LSH using PySpark! 35 Counting the Collisions •Collision: match on a single hash function • Use number of collisions to determine the candidates • Match one of the super hash with 𝑞 → collides at least 𝛼𝑚 hash values with 𝑞 •Recall in LSH, The probability of 𝑜 with dist 𝑜, 𝑞 ≤ 𝑟 is a nearest neighbor 1𝑘𝑙 candidateof𝑞is1−(1−𝑝 ) 1 •Now we compute the case with collision counting... 36 The Collision Probability • ∀𝑜 with dist 𝑜, 𝑞 ≤ 𝑟 , we have 1𝑚𝑚𝑖 𝑚−𝑖 •Pr#𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛𝑜 ≥𝛼𝑚 =σ𝑖=𝛼𝑚 𝑖 𝑝 1−𝑝 •𝑝=Prh 𝑜 =h 𝑞 ≥𝑝 𝑗𝑗1 • We define 𝑚 Bernoulli random variables 𝑋 ∼ 𝐵(𝑚, 1 − 𝑝) with 1 ≤ 𝑖 ≤ 𝑚. • Let 𝑋 equal 1 if 𝑜 does not collide with 𝑞 𝑖 𝑖 • i.e., Pr 𝑋𝑖 = 1 = 1 − 𝑝 • Let 𝑋 equal 0 if 𝑜 collides with 𝑞 𝑖 • i.e., Pr 𝑋𝑖 = 0 = 𝑝 •HenceE𝑋 =1−𝑝 ത 𝑖 ത σ𝑚 𝑋 • Thus𝐸(𝑋)=1−𝑝,where𝑋= 𝑖=1 𝑖. • Let 𝑡 = 𝑝 − 𝛼 > 0, we have:𝑚
തത σ𝑚𝑋
•Pr𝑋−𝐸𝑋 ≥𝑡 =Pr 𝑖=1 𝑖− 1−𝑝 ≥𝑡 =Pr[σ𝑋𝑖≥(1−𝛼)𝑚]
𝑚
37

The Collision Probability
• From Hoeffding’s Inequality, we have
2 𝑝−𝛼 2𝑚2 •Pr𝑋−𝐸𝑋 ≥𝑡 =Prσ𝑋𝑖≥ 1−𝛼𝑚 ≤exp −σ𝑚 1−02 =
exp−2𝑝−𝛼2𝑚 ≤exp−2𝑝 −𝛼2𝑚 𝑖=1 1
• Since the event “#collision(𝑜) ≥ 𝛼𝑚” is equivalent to the event “𝑜 misses the collision with 𝑞 less than (1 − 𝛼)𝑚 times”,
≥ 𝑟 in 2
തത
• Pr #𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 𝑜 ≥ 𝛼𝑚 = Pr σ𝑋 < 1 − 𝛼 𝑚 ≥ 1 − 𝑖 • Now you can compute the case for 𝑜 with dist 𝑜,𝑞 a similar way... exp −2 𝑝 −𝛼 2𝑚 1 • Then we can accordingly set 𝛼 to control false positives and false negatives 38 Virtual Rehashing •When we are not getting enough candidates... • E.g., # of candidates < top-k • Observation: • A close point 𝑜 usually falls into adjacent hash buckets of 𝑞’s if it does not collide with 𝑞 • Why? •σ𝑑𝑎 ⋅𝑥 −σ𝑑𝑎 ⋅𝑞 ~𝑁(0, 𝑥,𝑞 2) 1𝑖𝑖1𝑖𝑖 2 • Idea: • Include the adjacent hash buckets into consideration • So you don’t need to re-hash them again... 39 Virtual Rehashing •Atfirstconsiderh𝑜 =h𝑞 𝒒 1 1 1 1 1 1 1 1 1 1 𝑜 1 1 0 2 -1 1 2 4 -3 0 -1 •Considerh 𝑜 =h 𝑞 ±1ifnotenough candidates •Thenh 𝑜 =h 𝑞 ±2andsoon... 40 𝒒 1 1 1 1 1 1 1 1 1 1 𝑜 1 1 0 2 -1 1 2 4 -3 0 -1 𝒒 1 1 1 1 1 1 1 1 1 1 𝑜 1 1 0 2 -1 1 2 4 -3 0 -1 The Framework of NNS using C2LSH • Pre-processing • Generate LSH functions • Random normal vectors and random uniform values • Index • Compute and store h𝑖 𝑜 for each data object 𝑜, 𝑖 ∈ {1,...,𝑚} • Query • Compute h𝑖 𝑞 for query 𝑞, 𝑖 ∈ {1,...,𝑚} • Take those 𝑜 that shares at least 𝛼𝑚 hashes with 𝑞 as candidates • Relax the collision condition (e.g., virtual rehashing) and repeat the above step, until we got enough candidates 41 Pseudo code of Candidate Generation in C2LSH candGen(data_hashes, query_hashes, 𝛼𝑚, 𝛽𝑛): offset ← 0 cand ← ∅ while true: for each (id, hashes) in data_hashes: if count(hashes, query_hashes, offset) ≥ 𝛼𝑚: cand ← cand ∪ {id} if 𝑐𝑎𝑛𝑑 < 𝛽𝑛: offset ← offset + 1 else: break return cand 42 Pseudo code of Candidate Generation in C2LSH count(hashes_1, hashes_2, offset): counter ← 0 for each h𝑎𝑠h1, h𝑎𝑠h2 in hashes_1, hashes_2: if h𝑎𝑠h1 − h𝑎𝑠h2 ≤ offset: counter ← counter + 1 return counter 43 Project 1 •Spec has been released, deadline: 18 Jul, 2020 • Late Penalty: 10% on day 1 and 30% on each subsequent day. •Implement a light version of C2LSH (i.e., the one we introduced in the lecture) •Start working ASAP •Evaluation: Correctness and Efficiency •Must use PySpark, some python modules and PySpark functions are banned. • E.g., numpy, pandas, collect(), take(), ... • Use transformations! 44 Project 1 •There will be a bonus part (max 20 points) to encourage efficient implementations. • Details in the spec •Make sure you have valid output •Make your own test cases, a real dataset would be more desirable • Toy example in the spec is a real “toy” (e.g., for babies...) •Won’t accept excuses like “it works on my own computer” •Don’t violate the Student Conduct!!! 45 Product Quantization and K-Means Clustering Recall: NNS in High Dimensional Euclidean Space •Naïve (but exact) solution: • Linear scan: compute 𝑑𝑖𝑠𝑡(𝑜, 𝑞) for all 𝑜 ∈ 𝐷 •𝑑𝑖𝑠𝑡𝑜,𝑞=σ𝑑 𝑜𝑖−𝑞𝑖 𝑖=1 •𝑂 𝑛𝑑 • 𝑛 times (𝑑 subtractions + 𝑑 − 1 additions + 𝑑 multiplications) • Storage is also costly: 𝑂 𝑛𝑑 • Could be problematic in DBMS and distributed systems •This motivates the idea of compression 47 2 Vector Quantization •Idea: compressed representation of vectors • Each vector 𝑜 is represented by a representative • Denoted as 𝑄𝑍(𝑜) • We will discuss how to get the representatives later • We control the total number of representatives for the dataset (denoted as 𝑘) • One representative represents multiple vectors • Instead of store 𝑜, we store its representative id • 𝑑 floats => 1 integer
• Instead of compute 𝑑𝑖𝑠𝑡(𝑜, 𝑞), we compute
𝑑𝑖𝑠𝑡(𝑄𝑍(𝑜), 𝑞)
• We only need k computations of distance! 48

How to Generate Representatives
•Assigning representatives is essentially a partition problem
• Construct a “good” partition of a database of 𝑛 objects into a set of 𝑘 clusters
•How to measure the “goodness” of a given partitioning scheme?
• Cost of a cluster
•𝐶𝑜𝑠𝑡𝐶 =σ 𝑜−𝑐𝑒𝑛𝑡𝑒𝑟𝐶
2
• Cost of 𝑘 clusters: sum of 𝐶𝑜𝑠𝑡 𝐶 𝑖
2
𝑖𝑜𝑗∈𝐶𝑖𝑗 𝑖
49

Partitioning Problem: Basic Concept •It’s an optimization problem!
•Global optimal:
• NP-hard (for a wide range of cost functions)
• Requires exhaustively enumerate all • Stirling numbers of the second kind
• 𝑛 ~𝑘𝑛 when𝑛→∞ 𝑘 𝑘!
•Heuristic methods: • k-means
• Many variants
𝑛 partitions 𝑘
50

The k-Means Clustering Method
• Given 𝑘, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets (randomly)
2. Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
51


Partition objects into k nonempty subsets
An Example of k-Means Clustering
The initial data set
Loop if needed
Reassign objects
◼ Repeat
◼ Compute centroid (i.e., mean
point) for each partition
◼ Assign each object to the cluster of its nearest centroid
Update the cluster centroids
◼ Until no change
52
K=2
Arbitrarily partition objects into k groups
Update the cluster centroids

Vector Quantization •Encode the vectors
•Generate a codebook 𝑊 = {𝑐 ,…,𝑐 } via k-
1𝑘 • Assign 𝑜 to its nearest codeword in 𝑊
means
•E.g.,𝑄𝑍 𝑜 =𝑐 𝑖∈1…𝑘 suchthat𝑑𝑖𝑠𝑡 𝑜,𝑐 ≤𝑑𝑖𝑠𝑡 𝑜,𝑐 ∀𝑗
𝑖𝑖𝑗
• Represent each vector 𝑜 by its assigned codeword
16 •Assume 𝑑 = 256,𝑘 = 2
• Before: 4 bytes * 256 = 1024 bytes for each vector
• Now:
• data: 16 bits = 2 bytes • codebook: 4 * 256 * 2
16
53

Vector Quantization – Query Processing •Given query 𝑞, how to find a
point close to 𝑞?
• Algorithm:
• Compute 𝑄𝑍(𝑞)
• Candidate set 𝐶 = all data vectors associated with 𝑄𝑍(𝑞)
• Verification: compute distance between 𝑞 and 𝑜𝑖 ∈ 𝐶
• Requires loading the vectors in C
• Any problem/improvement? 54
Inverted index: a hash table that maps 𝑐 to a list of o that
are associated with 𝑐 𝑗
𝑗i

Limitations of VQ
•To achieve better accuracy, fine-grained
quantizer with large 𝑘 is needed
•Large 𝑘
• Costly to run K-means
• Computing 𝑄𝑍(𝑞) is expensive: 𝑂(𝑘𝑑) • May need to look beyond 𝑄𝑍(𝑞) cell
• Solution:
• Product Quantization
55

Product Quantization • Idea
• Partition the dimension into m partitions • Accordingly a vector => subvectors
• Use separate VQ with 𝑘 codewords for each chunk
• Example:
• 8-dim vector decomposed into 𝑚 = 2 subvectors
• Each codebook has 𝑘 = 4 codewords, (i.e., 𝑐𝑖,𝑗 )
• Total space in bits: • Data:n⋅𝑚⋅𝑙𝑜𝑔(𝑘)
• Codebook:𝑚⋅ 𝑑 ⋅𝑘⋅32 𝑚
56

Example of PQ
2
4
6
5
-2
6
4
1
1
2
1
4
9
-1
2
0
























2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3








3.3
4.1
2.7
1.4












00
00
01
11






57
𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1 𝑐2,2 𝑐2,3

Distance Estimation
q
t
1
3
5
4
-3
7
3
2
• Euclidean distance between a query point q and a data point encoded as t
• Restore the virtual joint center by looking up each partition of t in the corresponding codebooks => p
•𝑑2 𝑞,𝑡 =σ𝑑 𝑞 −𝑝 2 𝑖=1 𝑖 𝑖
• Known as Asymmetric Distance Computation (ADC)
•𝑑2𝑞,𝑡=σ𝑚𝑞−𝑐 2 𝑖=1 (𝑖) 𝑖,𝑡(𝑖)
𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1 𝑐2,2 𝑐2,3
01
00
2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3








3.3
4.1
2.7
1.4












1.2
1.5
2.4
3.3
3.3
4.1
2.7
1.4
p
58

Query Processing
•Compute ADC for every point in the database
• How?
•Candidate = those with the 𝑙 smallest AD
•[Optional] Reranking (if 𝑙 > 1):
• Load the data vectors and compute the actual
Euclidean distance
• Return the one with the smallest distance
59

Query Processing
1
3
5
4
-3
7
3
2
q
t1 t2
𝑞−𝑐 2+𝑞−𝑐 2 (1) 1,𝑡1(1) (2) 2,𝑡1(2)
𝑞−𝑐 2+𝑞−𝑐 2
01
00
11
10
(1) 1,𝑡2(1)
(2) 2,𝑡2(2)


2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3








3.3
4.1
2.7
1.4












𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1
𝑐2,2
𝑐2,3


60

Framework of PQ
• Pre-processing:
• Step 1: partition data vectors
• Step 2: generate codebooks (e.g., k-means) • Step 3: encode data
• Query
• Step 1: compute distance between q and
codewords
• Step 2: compute AD for each point and return the candidates
• Step 3: re-ranking (optional) 61