Clustering
Reference book: Bishop: “Pattern Recognition and Machine
Learning” Chapter 9.1
Clustering
• Unsupervised learning
• Generating “classes”
• Distance/similarity measures
• Agglomerative methods
• Divisive methods
Unsupervised Learning
• No labels/responses. Finding structure in data.
• Dimensionality Reduction.
Clustering Subspace Learning
𝑇:ℝ$ → {1, 2, … , 𝐾} 𝑇:ℝ$ → ℝ-,𝑚 < 𝐷
Uses of Unsupervised Learning
• Data compression
RGB
(12,165,73)
(11,167,79)
:
Labels
3
43
:
Dictionary
1 ~ (10, 160, 70)
2 ~ (40, 240, 20)
:
Uses of Unsupervised Learning
• To improve classification/regression (semi-supervised learning)
1. From unlabeled data, learn a good feature 𝑇:ℝ$ → ℝ-.
2. To labeled data, apply transformation 𝑇:ℝ$ → ℝ-.
𝑇 𝒙2 , 𝒚2 , … , 𝑇 𝒙4 , 𝒚4
3. Perform classification/regression on transformed low-
dimensional data.
What is Clustering?
• Unsupervised learning - no information from teacher
• The process of partitioning a set of data into a set of meaningful
(hopefully) sub-classes, called clusters
• Cluster:
• collection of data points that are “similar” to one another and
collectively should be treated as group (it is not the group we
learned in linear algebra)
• as a collection, are sufficiently different from other groups
What is Clustering
Clustering Problem.
Input. Training data 𝒮4 = 𝒙2, 𝒙7, … , 𝒙4 , 𝒙8 ∈ ℝ$.
Integer 𝐾
Output. Clusters 𝒞2, 𝒞7, … , 𝒞; ⊂ 1, 2, … ,𝑁 such that
every data point is in one and only one cluster.
Cluster representatives 𝝁2,… , 𝝁;
Some clusters could
be empty!
Clusters
How to Specify a Cluster
• By listing all its elements
𝒞2 = 1,3,4,7,8,11
𝒞7 = 2,5,6,9,10
1
2
3
11
10
6
7
8
9
5
4
How to Specify a Cluster
• Using a representative
a. A point in center of cluster (centroid)
b. A point in the training data (exemplar)
centroid exemplar
1
2
3
11
10
6
7
8
9
5
4
1
2
3
11
10
6
7
8
9
5
4
Each point 𝒙G will be
assigned the closest
representative.
We can partition all the points in the space into regions,
according to their closest representative.
Voronoi Diagram
Distance/Similarity Measures
(sometimes called loss functions)
A measure of how close two data points are.
Nearby points (i.e. distance is small) are
more likely to belong to the same cluster.
• Euclidean Distance
dist 𝒙, 𝒚 = 𝒙 − 𝒚 7 ∶= N
GO2
8
𝑥G − 𝑦G 7
Distance/Similarity Measures
A measure of how alike two data points are.
Similar points (i.e. similarity is large) are
more likely to belong to the same cluster.
• Cosine Similarity cos 𝒙, 𝒚 = 𝒙
T𝒚
𝒙 ‖𝒚‖
Distance (Similarity) Matrix
similarity (or distance) of to ij i jd D D=
Note that dij = dji (i.e., the matrix is
symmetric). So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s
(distance)
1 2
1 12 1
2 21 2
1 2
n
n
n
n n n
I I I
I d d
I d d
I d d
•
•
•
•
!
!
!
" " " "
!
• Similarity (Distance) Matrix
• based on the distance or similarity measure we can construct a
symmetric matrix of distance (or similarity values)
• (i, j)th entry in the matrix is the distance (similarity) between
items i and j
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2
sim T T w wi j jkik
k
N
( , ) ( )= ×
=
å
1
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
Term-Term
Similarity Matrix
Similarity (Distance) Thresholds
• A similarity (distance) threshold may be used to mark pairs that
are “sufficiently” similar
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
Using a threshold
value of 10 in the
previous example
Graph Representation
• The similarity matrix can be visualized as an undirected graph
• each item is represented by a node, and edges represent the fact
that two items are similar (a 1 in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
T1 T3
T4
T6
T8
T5
T2
T7
If no threshold is used, then
matrix can be represented as
a weighted graph
Training Loss
Sum of squared distances to closest representative.
loss ≈ 11× 1 7 = 11
assume length of each
edge is about 1
Training Loss
Sum of squared distances to closest representative (cluster
center).
loss ≈ 9× 0.1 7 = 0.09
assume length of each
edge is about 0.1
Training loss
Optimizing both clusters and representatives.
ℒ 𝑹,𝜧; 𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^ 7
where
𝒮8 = 𝒙2, 𝒙7, … , 𝒙4 , 𝒙G∈ ℝ$
𝑟8^ ∈ 0,1 denotes which of the 𝐾 clusters data point 𝒙8 is assigned
to. If 𝒙8 is assigned to cluster 𝑘 then 𝑟8^ = 1, and 𝑟8a = 0 if 𝑗 ≠ 𝑘.
𝑹 = 𝑟8^ ∈ 0,1 8×^, 𝑛 = 1,… ,𝑁, 𝑘 = 1,… , 𝐾
𝜧 = 𝝁2,… , 𝝁; e
Goal: find values for 𝑹 and 𝜧 so as to minimize ℒ.
Basic Clustering Methodology
Two approaches:
Agglomerative: pairs of items/clusters are successively linked to produce
larger clusters
Divisive (partitioning): items are initially placed in one cluster and then
divided into separate groups
Hierarchical/Agglomerative Methods
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
• Based on some methods of representing hierarchy of
data points
• One idea: hierarchical dendogram (connects points
based on similarity)
T5 T1 T3 T4 T2 T6 T8 T7
Hierarchical Agglomerative
• Compute distance matrix
• Put each data point in its own cluster
• Find most similar pair of clusters
• merge pairs of clusters (show merger in dendogram)
• update similarity matrix
• repeat until all data points are in one cluster
K-Means
Optimization Algorithm
Goal. Minimize ℒ 𝑥, 𝑦 .
Coordinate Descent (Optimization).
Repeat until convergence:
1. Find optimal 𝑥 while holding 𝑦 constant.
2. Find optimal 𝑦 while holding 𝑥 constant.
ℒ 𝑹,𝜧; 𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^ 7
Coordinate descent is an optimization algorithm that
successively minimizes along coordinate directions to find
the minimum of a function.
Optimization Algorithm
Coordinate Descent (Optimization)
Repeat until convergence:
• Find best clusters given centres
• Find best centres given clusters
ℒ 𝑹,𝜧; 𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^ 7
Optimization Algorithm
1. Initialize centers 𝝁2,… , 𝝁; from the data.
2. Repeat until no further change in training loss:
a. For each 𝑛 ∈ {1, … , 𝑁},
𝑟8^ = f
1, if 𝑘 = arg min
a
𝒙8 − 𝝁a
7
0, otherwise.
We assign the 𝑛th data point to the closest cluster centre.
b. For each 𝑗 ∈ {1, … , 𝑘},
𝝁^ =
∑8 𝑟8^𝒙8
∑8 𝑟8^
We recompute cluster means.
Convergence
• Training loss always decreases in each step (coordinate
descent).
• Converges to local minimum, not necessarily global
minimum.
Repeat algorithm
over many initial
points, and pick the
configuration with
the smallest training
loss.
An example – kmeans clustering
• Suppose we have 4 boxes of different sizes and we want to
divide them into 2 classes
• Each box represents one point with two attributes (w,h):
A B
C
D
• Initial centers: suppose we choose points A and B as the
initial centers, so c1 = (10, 10) and c2 = (20, 10)
• Object - centre distance: calculate the Euclidean distance
between cluster centres and the objects. For example, the
distance of object C from the first center is:
40 − 10 7 + 30 − 10 7 = 36.06
• We obtain the following distance matrix:
0 10 36.06 50
10 0 28.28 43.43
Centre 1
Centre 2
• Object clustering: We assign each object to one of the
clusters based on the minimum distance from the
centre:
• Determine centres: Based on the group membership,
we compute the new centers
• 𝑐2 = 10, 10 , 𝑐7 =
7stustvs
w
, 2stwstus
w
= (36.7, 26.7)
1 0 0 0
0 1 1 1Centre 2
Centre 1
• Recompute the object-centre distances: We compute the
distances of each data point from the new centres:
• Object clustering: We reassign the objects to the clusters
based on the minimum distance from the centre:
• Determine the new centres:
𝑐2 =
10 + 20
2
,
10 + 10
2
= (15, 10)
𝑐7 =
40 + 50
2
,
30 + 40
2
= (45, 35)
0 10 36.06 50
31.4 23.6 4.7 18.9Centre 2
Centre 1
1 1 0 0
0 0 1 1Centre 2
Centre 1
• Recompute the object-centres distances:
• Object clustering:
• The cluster membership did not change from one
iteration to another and so the k-means computation
terminates.
5 5 32 46.1
43 35.4 7.1 7.1Centre 2
Centre 1
1 1 0 0
0 0 1 1Centre 2
Centre 1
Discussion
Initialization
• Empty clusters
o Pick data points to initialize clusters
• Bad local minima
o Initialize many times and pick solution with smallest training
loss
o Pick good starting positions
Initialization
Starting position of centroids
Final position of centroids
Problem. How to choose good starting positions?
Solution. Place them far apart with high probability.
Number of Clusters
Number of Clusters
How do we choose 𝑘, the optimal number of clusters?
• Elbow method
• Training Loss
• Validation Loss
Tr
ai
ni
ng
L
os
s
Elbow
Check your understanding
• Clustering gives you an idea of how the data is distributed.
• You have 1000 data vectors that are very similar to each other
(e.g., Eu distance less than 0.0001). You can only divide them
into a few clusters.
• When you use kmeans, you usually obtain a global minimum of
the loss function
• When you use kmeans, you can never achieve global minimum
of the loss function.
• The number of clusters is a parameter that can be optimized by
kmeans.
• In K-fold cross validation, K is a hyperparameter.
• Agglomerative clustering as we introduced in this lecture, has a
fixed clustering result given distance measurement and the
number of clusters desired, and tie breaker.