Clustering
Reference book: Bishop: “Pattern Recognition and Machine Learning” Chapter 9.1
Clustering
• Unsupervised learning
• Generating “classes”
• Distance/similarity measures • Agglomerative methods
• Divisive methods
Unsupervised Learning
• No labels/responses. Finding structure in data. • Dimensionality Reduction.
Clustering Subspace Learning 𝑇:R$ → {1,2,…,𝐾} 𝑇:R$ → R-,𝑚 < 𝐷
Uses of Unsupervised Learning • Data compression
RGB
(12,165,73) (11,167,79) :
Labels
3 43
:
Dictionary
1 ~ (10, 160, 70) 2 ~ (40, 240, 20) :
Uses of Unsupervised Learning
• To improve classification/regression (semi-supervised learning)
1. Fromunlabeleddata,learnagoodfeature𝑇:R$→R-. 2. Tolabeleddata,applytransformation𝑇:R$→R-.
𝑇𝒙2 ,𝒚2 ,..., 𝑇𝒙4 ,𝒚4
3. Performclassification/regressionontransformedlow- dimensional data.
What is Clustering?
• Unsupervised learning - no information from teacher
• The process of partitioning a set of data into a set of meaningful
(hopefully) sub-classes, called clusters • Cluster:
• collection of data points that are “similar” to one another and collectively should be treated as group (it is not the group we
learned in linear algebra)
• as a collection, are sufficiently different from other groups
What is Clustering
Clustering Problem.
Input. Training data 𝒮4 = 𝒙2,𝒙7,...,𝒙4 , 𝒙8 ∈ R$. Integer 𝐾
Output. Clusters 𝒞2,𝒞7,...,𝒞; ⊂ 1,2,...,𝑁 such that every data point is in one and only one cluster.
Clusterrepresentatives 𝝁2,...,𝝁;
Some clusters could be empty!
Clusters
How to Specify a Cluster
• By listing all its elements
𝒞2 = 1,3,4,7,8,11 𝒞7 = 2,5,6,9,10
1
8
7
4
3 11
10
6 9
5 2
How to Specify a Cluster
• Using a representative
a. A point in center of cluster (centroid) b. A point in the training data (exemplar)
Each point 𝒙G will be assigned the closest representative.
1
8
7
4
3 11
10
6 9
5 2
1
8
7
4
3 11
10
6 9
5 2
centroid
exemplar
Voronoi Diagram
We can partition all the points in the space into regions, according to their closest representative.
Distance/Similarity Measures
(sometimes called loss functions)
A measure of how close two data points are. Nearby points (i.e. distance is small) are more likely to belong to the same cluster.
• Euclidean Distance
dist𝒙,𝒚=𝒙−𝒚7∶= N𝑥G−𝑦G
8
7 GO2
Distance/Similarity Measures
A measure of how alike two data points are. Similar points (i.e. similarity is large) are more likely to belong to the same cluster.
𝒙T𝒚 𝒙 ‖𝒚‖
• Cosine Similarity cos 𝒙, 𝒚 =
Distance (Similarity) Matrix
• Similarity (Distance) Matrix
• based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values)
• (i, j)th entry in the matrix is the distance (similarity) between items i and j
I1 I2 ! In I1
I2 "
In
d =similarity(ordistance)ofD toD ij ij
• d12 ! d1n d21 •!d2n
""•" dn1 dn2 ! •
Note that dij = dji (i.e., the matrix is symmetric). So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
Example: Term Similarities in Documents
T1
T2
T3
T4
T5
T6
T7
T8
Doc1
0
4
0
0
0
2
1
3
Doc2
3
1
4
3
1
2
0
1
Doc3
3
0
0
0
3
0
3
0
Doc4
0
1
0
3
0
0
2
0
Doc5
2
2
2
3
1
4
0
2
sim(T,T)=å(w ×w ) i j k=1 ik jk
N
Term-Term Similarity Matrix
T1
T2
T3
T5
T7
T2
T3
T4
T5
7
16
15
8
12
18
T4
T6
T6
T7
14
14
9
3
18
6
16
6
18
6
7
6
17
0
8
6
9
9
3
2
T8
16
3
Similarity (Distance) Thresholds
• A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similar
T1
T2
T3
T4
T5
T6
T7
T2
7
T3
16
8
T4
15
12
18
T5
14
3
6
6
T6
14
18
16
18
6
T7
9
6
0
6
9
2
T8
7
17
8
9
3
16
3
Using a threshold value of 10 in the previous example
T1
T2
T3
T4
T5
T6
T7
T2
0
T3
1
0
T4
1
1
1
T5
1
0
0
0
T6
1
1
1
1
0
T7
0
0
0
0
0
0
T8
0
1
0
0
0
1
0
Graph Representation
• The similarity matrix can be visualized as an undirected graph • each item is represented by a node, and edges represent the fact
that two items are similar (a 1 in the similarity threshold matrix) T1 T3
T1
T2
T3
T4
T5
T6
T7
T2
0
T3
1
0
T4
1
1
1
T5
1
0
0
0
T6
1
1
1
1
0
T7
0
0
0
0
0
0
T8
0
1
0
0
0
1
0
T5
T4
T2
If no threshold is used, then matrix can be represented as a weighted graph
T6
T7 T8
Training Loss
Sum of squared distances to closest representative.
loss≈11× 1 7 =11
assume length of each edge is about 1
Training Loss
Sum of squared distances to closest representative (cluster center).
loss≈9× 0.1 7 =0.09
assume length of each edge is about 0.1
Training loss
Optimizing both clusters and representatives.
where
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
𝒮8 = 𝒙2,𝒙7,...,𝒙4 , 𝒙G∈R$
𝑟8^ ∈ 0,1 denotes which of the 𝐾 clusters data point 𝒙8 is assigned
to.If𝒙8 isassignedtocluster𝑘then𝑟8^ =1,and𝑟8a =0if𝑗≠𝑘. 𝑹 = 𝑟8^ ∈ 0,1 8×^,𝑛 = 1,...,𝑁,𝑘 = 1,...,𝐾
𝜧= 𝝁2,...,𝝁; e
Goal: find values for 𝑹 and 𝜧 so as to minimize L.
Basic Clustering Methodology Two approaches:
Agglomerative: pairs of items/clusters are successively linked to produce larger clusters
Divisive (partitioning): items are initially placed in one cluster and then divided into separate groups
Hierarchical/Agglomerative Methods
• Based on some methods of representing hierarchy of data points
• One idea: hierarchical dendogram (connects points based on similarity)
T1
T2
T3
T4
T5
T6
T7
T2
7
T3
16
8
T4
15
12
18
T5
14
3
6
6
T6
14
18
16
18
6
T7
9
6
0
6
9
2
T8
7
17
8
9
3
16
3
T5 T1 T3 T4 T2 T6 T8 T7
Hierarchical Agglomerative
• Compute distance matrix
• Put each data point in its own cluster
• Find most similar pair of clusters
• merge pairs of clusters (show merger in dendogram) • update similarity matrix
• repeat until all data points are in one cluster
K-Means
Optimization Algorithm Goal. Minimize L 𝑥, 𝑦 .
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
Coordinate Descent (Optimization).
Repeat until convergence:
1. Find optimal 𝑥 while holding 𝑦 constant. 2. Find optimal 𝑦 while holding 𝑥 constant.
Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function.
Optimization Algorithm Coordinate Descent (Optimization)
Repeat until convergence:
• Find best clusters given centres • Find best centres given clusters
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
Optimization Algorithm
1. Initializecenters𝝁2,...,𝝁;fromthedata.
2. Repeatuntilnofurtherchangeintrainingloss:
a. For each 𝑛 ∈ {1,...,𝑁},
𝑟 =f1,if𝑘=argmin 𝒙8−𝝁a 8^ a
7
0, otherwise.
We assign the 𝑛th data point to the closest cluster centre.
b. Foreach𝑗∈{1,...,𝑘}, ∑8𝑟8^𝒙8
𝝁^= ∑8𝑟8^
We recompute cluster means.
Convergence
• Training loss always decreases in each step (coordinate descent).
• Converges to local minimum, not necessarily global minimum.
Repeat algorithm over many initial points, and pick the configuration with the smallest training loss.
An example – kmeans clustering
• Suppose we have 4 boxes of different sizes and we want to
divide them into 2 classes
• Each box represents one point with two attributes (w,h):
AB
D C
• Initial centers: suppose we choose points A and B as the initial centers, so c1 = (10, 10) and c2 = (20, 10)
• Object - centre distance: calculate the Euclidean distance between cluster centres and the objects. For example, the distance of object C from the first center is:
40−10 7+ 30−10 7 =36.06
• We obtain the following distance matrix:
Centre 1 Centre 2
0
10
36.06
50
10
0
28.28
43.43
• Object clustering: We assign each object to one of the clusters based on the minimum distance from the centre:
Centre 1 Centre 2
• Determine centres: Based on the group membership, we compute the new centers
• 𝑐2 = 10, 10 , 𝑐7 = 7stustvs , 2stwstus = (36.7, 26.7) ww
1
0
0
0
0
1
1
1
• Recompute the object-centre distances: We compute the distances of each data point from the new centres:
Centre 1
Centre 2
• Object clustering: We reassign the objects to the clusters based on the minimum distance from the centre:
Centre 1 Centre 2
• Determine the new centres:
0
10
36.06
50
31.4
23.6
4.7
18.9
1
1
0
0
0
0
1
1
𝑐2 = 10+20,10+10 =(15,10) 22
𝑐7 = 40+50,30+40 =(45,35) 22
• Recompute the object-centres distances:
5
5
32
46.1
43
35.4
7.1
7.1
Centre 1 Centre 2
• Object clustering:
Centre 1 Centre 2
• The cluster membership did not change from one iteration to another and so the k-means computation terminates.
1
1
0
0
0
0
1
1
Discussion
Initialization
• Empty clusters
o Pick data points to initialize clusters • Bad local minima
o Initialize many times and pick solution with smallest training loss
o Pick good starting positions
Initialization
Starting position of centroids
Final position of centroids
Problem. How to choose good starting positions? Solution. Place them far apart with high probability.
Number of Clusters
Number of Clusters
How do we choose 𝑘, the optimal number of clusters?
• Elbow method • Training Loss
• Validation Loss
Elbow
Training Loss
Check your understanding
• Clustering gives you an idea of how the data is distributed.
• You have 1000 data vectors that are very similar to each other (e.g., Eu distance less than 0.0001). You can only divide them into a few clusters.
• When you use kmeans, you usually obtain a global minimum of the loss function
• When you use kmeans, you can never achieve global minimum of the loss function.
• The number of clusters is a parameter that can be optimized by kmeans.
• In K-fold cross validation, K is a hyperparameter.
• Agglomerative clustering as we introduced in this lecture, has a fixed clustering result given distance measurement and the number of clusters desired, and tie breaker.