Recall: NNS in High Dimensional Euclidean Space •Naïve (but exact) solution:
• Linear scan: compute 𝑑𝑖𝑠𝑡(𝑜, 𝑞) for all 𝑜 ∈ 𝐷
•𝑑𝑖𝑠𝑡𝑜,𝑞=σ𝑑 𝑜𝑖−𝑞𝑖 𝑖=1
•𝑂 𝑛𝑑
• 𝑛 times (𝑑 subtractions + 𝑑 − 1 additions + 𝑑 multiplications)
• Storage is also costly: 𝑂 𝑛𝑑
• Could be problematic in DBMS and distributed systems
•This motivates the idea of compression 47
2
Vector Quantization
•Idea: compressed representation of vectors
• Each vector 𝑜 is represented by a representative • Denoted as 𝑄𝑍(𝑜)
• We will discuss how to get the representatives later
• We control the total number of representatives for the dataset (denoted as 𝑘)
• One representative represents multiple vectors
• Instead of store 𝑜, we store its representative id
• 𝑑 floats => 1 integer
• Instead of compute 𝑑𝑖𝑠𝑡(𝑜, 𝑞), we compute
𝑑𝑖𝑠𝑡(𝑄𝑍(𝑜), 𝑞)
• We only need k computations of distance! 48
How to Generate Representatives
•Assigning representatives is essentially a partition problem
• Construct a “good” partition of a database of 𝑛 objects into a set of 𝑘 clusters
•How to measure the “goodness” of a given partitioning scheme?
• Cost of a cluster
•𝐶𝑜𝑠𝑡𝐶 =σ 𝑜−𝑐𝑒𝑛𝑡𝑒𝑟𝐶
2
• Cost of 𝑘 clusters: sum of 𝐶𝑜𝑠𝑡 𝐶 𝑖
2
𝑖𝑜𝑗∈𝐶𝑖𝑗 𝑖
49
Partitioning Problem: Basic Concept •It’s an optimization problem!
•Global optimal:
• NP-hard (for a wide range of cost functions)
• Requires exhaustively enumerate all • Stirling numbers of the second kind
• 𝑛 ~𝑘𝑛 when𝑛→∞ 𝑘 𝑘!
•Heuristic methods: • k-means
• Many variants
𝑛 partitions 𝑘
50
The k-Means Clustering Method
• Given 𝑘, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets (randomly)
2. Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
51
◼
Partition objects into k nonempty subsets
An Example of k-Means Clustering
The initial data set
Loop if needed
Reassign objects
◼ Repeat
◼ Compute centroid (i.e., mean
point) for each partition
◼ Assign each object to the cluster of its nearest centroid
Update the cluster centroids
◼ Until no change
52
K=2
Arbitrarily partition objects into k groups
Update the cluster centroids
Vector Quantization •Encode the vectors
•Generate a codebook 𝑊 = {𝑐 ,…,𝑐 } via k-
1𝑘 • Assign 𝑜 to its nearest codeword in 𝑊
means
•E.g.,𝑄𝑍 𝑜 =𝑐 𝑖∈1…𝑘 suchthat𝑑𝑖𝑠𝑡 𝑜,𝑐 ≤𝑑𝑖𝑠𝑡 𝑜,𝑐 ∀𝑗
𝑖𝑖𝑗
• Represent each vector 𝑜 by its assigned codeword
16 •Assume 𝑑 = 256,𝑘 = 2
• Before: 4 bytes * 256 = 1024 bytes for each vector
• Now:
• data: 16 bits = 2 bytes • codebook: 4 * 256 * 2
16
53
Vector Quantization – Query Processing •Given query 𝑞, how to find a
point close to 𝑞?
• Algorithm:
• Compute 𝑄𝑍(𝑞)
• Candidate set 𝐶 = all data vectors associated with 𝑄𝑍(𝑞)
• Verification: compute distance between 𝑞 and 𝑜𝑖 ∈ 𝐶
• Requires loading the vectors in C
• Any problem/improvement? 54
Inverted index: a hash table that maps 𝑐 to a list of o that
are associated with 𝑐 𝑗
𝑗i
Limitations of VQ
•To achieve better accuracy, fine-grained
quantizer with large 𝑘 is needed
•Large 𝑘
• Costly to run K-means
• Computing 𝑄𝑍(𝑞) is expensive: 𝑂(𝑘𝑑) • May need to look beyond 𝑄𝑍(𝑞) cell
• Solution:
• Product Quantization
55
Product Quantization • Idea
• Partition the dimension into m partitions • Accordingly a vector => subvectors
• Use separate VQ with 𝑘 codewords for each chunk
• Example:
• 8-dim vector decomposed into 𝑚 = 2 subvectors
• Each codebook has 𝑘 = 4 codewords, (i.e., 𝑐𝑖,𝑗 )
• Total space in bits: • Data:n⋅𝑚⋅𝑙𝑜𝑔(𝑘)
• Codebook:𝑚⋅ 𝑑 ⋅𝑘⋅32 𝑚
56
Example of PQ
2
4
6
5
-2
6
4
1
1
2
1
4
9
-1
2
0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3
…
…
…
…
…
…
…
…
3.3
4.1
2.7
1.4
…
…
…
…
…
…
…
…
…
…
…
…
00
00
01
11
…
…
…
…
…
…
57
𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1 𝑐2,2 𝑐2,3
Distance Estimation
q
t
1
3
5
4
-3
7
3
2
• Euclidean distance between a query point q and a data point encoded as t
• Restore the virtual joint center by looking up each partition of t in the corresponding codebooks => p
•𝑑2 𝑞,𝑡 =σ𝑑 𝑞 −𝑝 2 𝑖=1 𝑖 𝑖
• Known as Asymmetric Distance Computation (ADC)
•𝑑2𝑞,𝑡=σ𝑚𝑞−𝑐 2 𝑖=1 (𝑖) 𝑖,𝑡(𝑖)
𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1 𝑐2,2 𝑐2,3
01
00
2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3
…
…
…
…
…
…
…
…
3.3
4.1
2.7
1.4
…
…
…
…
…
…
…
…
…
…
…
…
1.2
1.5
2.4
3.3
3.3
4.1
2.7
1.4
p
58
Query Processing
•Compute ADC for every point in the database
• How?
•Candidate = those with the 𝑙 smallest AD
•[Optional] Reranking (if 𝑙 > 1):
• Load the data vectors and compute the actual
Euclidean distance
• Return the one with the smallest distance
59
Query Processing
1
3
5
4
-3
7
3
2
q
t1 t2
𝑞−𝑐 2+𝑞−𝑐 2 (1) 1,𝑡1(1) (2) 2,𝑡1(2)
𝑞−𝑐 2+𝑞−𝑐 2
01
00
11
10
(1) 1,𝑡2(1)
(2) 2,𝑡2(2)
…
…
2.1
3.6
5.3
6.6
1.2
1.5
2.4
3.3
…
…
…
…
…
…
…
…
3.3
4.1
2.7
1.4
…
…
…
…
…
…
…
…
…
…
…
…
𝑐1,0 𝑐1,1 𝑐1,2 𝑐1,3 𝑐2,0
𝑐2,1
𝑐2,2
𝑐2,3
…
…
60
Framework of PQ
• Pre-processing:
• Step 1: partition data vectors
• Step 2: generate codebooks (e.g., k-means) • Step 3: encode data
• Query
• Step 1: compute distance between q and
codewords
• Step 2: compute AD for each point and return the candidates
• Step 3: re-ranking (optional) 61