Clustering
Reference book: Bishop: “Pattern Recognition and Machine
Learning” Chapter 9.1
Clustering
• Unsupervised learning
• Generating “classes”
• Distance/similarity measures
• Agglomerative methods
• Divisive methods
Unsupervised Learning
• No labels/responses. Finding structure in data.
• Dimensionality Reduction.
Clustering Subspace Learning
𝑇:ℝ$ → {1, 2, … ,𝐾} 𝑇:ℝ$ → ℝ-,𝑚 < 𝐷
Uses of Unsupervised Learning
• Data compression
RGB
(12,165,73)
(11,167,79)
:
Labels
3
43
:
Dictionary
1 ~ (10, 160, 70)
2 ~ (40, 240, 20)
:
Uses of Unsupervised Learning
• To improve classification/regression (semi-supervised learning)
1. From unlabeled data, learn a good feature 𝑇:ℝ$ → ℝ-.
2. To labeled data, apply transformation 𝑇:ℝ$ → ℝ-.
𝑇 𝒙2 ,𝒚2 , … , 𝑇 𝒙4 ,𝒚4
3. Perform classification/regression on transformed low-
dimensional data.
What is Clustering?
• Unsupervised learning - no information from teacher
• The process of partitioning a set of data into a set of meaningful
(hopefully) sub-classes, called clusters
• Cluster:
• collection of data points that are “similar” to one another and
collectively should be treated as group (it is not the group we
learned in linear algebra)
• as a collection, are sufficiently different from other groups
What is Clustering
Clustering Problem.
Input. Training data 𝒮4 = 𝒙2,𝒙7, … ,𝒙4 , 𝒙8 ∈ ℝ
$.
Integer 𝐾
Output. Clusters 𝒞2,𝒞7, … ,𝒞; ⊂ 1, 2, … ,𝑁 such that
every data point is in one and only one cluster.
Cluster representatives 𝝁2, … ,𝝁;
Some clusters could
be empty!
Clusters
How to Specify a Cluster
• By listing all its elements
𝒞2 = 1,3,4,7,8,11
𝒞7 = 2,5,6,9,10
1
2
3
11
10
6
7
8
9
5
4
How to Specify a Cluster
• Using a representative
a. A point in center of cluster (centroid)
b. A point in the training data (exemplar)
centroid exemplar
1
2
3
11
10
6
7
8
9
5
4
1
2
3
11
10
6
7
8
9
5
4
Each point 𝒙G will be
assigned the closest
representative.
We can partition all the points in the space into regions,
according to their closest representative.
Voronoi Diagram
Distance/Similarity Measures
(sometimes called loss functions)
A measure of how close two data points are.
Nearby points (i.e. distance is small) are
more likely to belong to the same cluster.
• Euclidean Distance
dist 𝒙,𝒚 = 𝒙 − 𝒚 7 ∶= N
GO2
8
𝑥G − 𝑦G 7
Distance/Similarity Measures
A measure of how alike two data points are.
Similar points (i.e. similarity is large) are
more likely to belong to the same cluster.
• Cosine Similarity cos 𝒙,𝒚 =
𝒙T𝒚
𝒙 ‖𝒚‖
Distance (Similarity) Matrix
similarity (or distance) of to ij i jd D D=
Note that dij = dji (i.e., the matrix is
symmetric). So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s
(distance)
1 2
1 12 1
2 21 2
1 2
n
n
n
n n n
I I I
I d d
I d d
I d d
•
•
•
•
!
!
!
" " " "
!
• Similarity (Distance) Matrix
• based on the distance or similarity measure we can construct a
symmetric matrix of distance (or similarity values)
• (i, j)th entry in the matrix is the distance (similarity) between
items i and j
Example: Term Similarities in Documents
T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2
sim T T w wi j jkik
k
N
( , ) ( )= ×
=
å
1
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
Term-Term
Similarity Matrix
Similarity (Distance) Thresholds
• A similarity (distance) threshold may be used to mark pairs that
are “sufficiently” similar
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
Using a threshold
value of 10 in the
previous example
Graph Representation
• The similarity matrix can be visualized as an undirected graph
• each item is represented by a node, and edges represent the fact
that two items are similar (a 1 in the similarity threshold matrix)
T1 T2 T3 T4 T5 T6 T7
T2 0
T3 1 0
T4 1 1 1
T5 1 0 0 0
T6 1 1 1 1 0
T7 0 0 0 0 0 0
T8 0 1 0 0 0 1 0
T1 T3
T4
T6
T8
T5
T2
T7
If no threshold is used, then
matrix can be represented as
a weighted graph
Training Loss
Sum of squared distances to closest representative.
loss ≈ 11× 1 7 = 11
assume length of each
edge is about 1
Training Loss
Sum of squared distances to closest representative (cluster
center).
loss ≈ 9× 0.1 7 = 0.09
assume length of each
edge is about 0.1
Training loss
Optimizing both clusters and representatives.
ℒ 𝑹,𝜧;𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^
7
where
𝒮8 = 𝒙2,𝒙7, … ,𝒙4 , 𝒙G∈ ℝ
$
𝑟8^ ∈ 0,1 denotes which of the 𝐾 clusters data point 𝒙8 is assigned
to. If 𝒙8 is assigned to cluster 𝑘 then 𝑟8^ = 1, and 𝑟8a = 0 if 𝑗 ≠ 𝑘.
𝑹 = 𝑟8^ ∈ 0,1
8×^,𝑛 = 1,… ,𝑁, 𝑘 = 1,… ,𝐾
𝜧 = 𝝁2, … ,𝝁;
e
Goal: find values for 𝑹 and 𝜧 so as to minimize ℒ.
Basic Clustering Methodology
Two approaches:
Agglomerative: pairs of items/clusters are successively linked to produce
larger clusters
Divisive (partitioning): items are initially placed in one cluster and then
divided into separate groups
Need to define an update rule for how to combine two clusters together.
Single-link Clustering: Distance between clusters to be shortest distance between any two points in each cluster:
Complete-link Clustering: Distance between clusters to be longest distance between two points from each cluster:
Average link clustering: Distance is average distance between clusters
Centroid clustering: Distance is distance between cluster centroids
Hierarchical/Agglomerative Methods
T1 T2 T3 T4 T5 T6 T7
T2 7
T3 16 8
T4 15 12 18
T5 14 3 6 6
T6 14 18 16 18 6
T7 9 6 0 6 9 2
T8 7 17 8 9 3 16 3
• Based on some methods of representing hierarchy of
data points
• One idea: hierarchical dendogram (connects points
based on similarity)
T5 T1 T3 T4 T2 T6 T8 T7
Hierarchical Agglomerative
• Compute distance matrix
• Put each data point in its own cluster
• Find most similar pair of clusters
• merge pairs of clusters (show merger in dendogram)
• update similarity matrix
• repeat until all data points are in one cluster
K-Means
Optimization Algorithm
Goal. Minimize ℒ 𝑥,𝑦 .
Coordinate Descent (Optimization).
Repeat until convergence:
1. Find optimal 𝑥 while holding 𝑦 constant.
2. Find optimal 𝑦 while holding 𝑥 constant.
ℒ 𝑹,𝜧;𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^
7
Coordinate descent is an optimization algorithm that
successively minimizes along coordinate directions to find
the minimum of a function.
Coordinate optimisation
Optimization Algorithm
Coordinate Descent (Optimization)
Repeat until convergence:
• Find best clusters given centres
• Find best centres given clusters
ℒ 𝑹,𝜧;𝒮8 = N
8O2
4
N
^O2
;
𝑟8^ 𝒙8 − 𝝁^
7
Optimization Algorithm
1. Initialize centers 𝝁2, … ,𝝁; from the data.
2. Repeat until no further change in training loss:
a. For each 𝑛 ∈ {1, … ,𝑁},
𝑟8^ = f
1, if 𝑘 = arg min
a
𝒙8 − 𝝁a
7
0, otherwise.
We assign the 𝑛th data point to the closest cluster centre.
b. For each 𝑗 ∈ {1, … , 𝑘},
𝝁^ =
∑8 𝑟8^𝒙8
∑8 𝑟8^
We recompute cluster means.
Convergence
• Training loss always decreases in each step (coordinate
descent).
• Converges to local minimum, not necessarily global
minimum.
Repeat algorithm
over many initial
points, and pick the
configuration with
the smallest training
loss.
Training loss always decreases after every update step. Kmeans must terminate in finite time.
Training loss always decreases after every update step. Kmeans must terminate in finite time.
suppose k-means could run forever.
Finitely many configurations, (K^N) many.
Eventually, choose R now, and choose R again in the future, for some matrix R
Discussion
Initialization
• Empty clusters
o Pick data points to initialize clusters
• Bad local minima
o Initialize many times and pick solution with smallest training
loss
o Pick good starting positions
Choose points in your dataset as the representative mu_k to
begin with.
Initialization
Starting position of centroids
Final position of centroids
Problem. How to choose good starting positions?
Solution. Place them far apart with high probability.
Number of Clusters
Number of Clusters
How do we choose 𝑘, the optimal number of clusters?
• Elbow method
• Training Loss
• Validation Loss
Tr
ai
ni
ng
L
os
s
Elbow
Check your understanding
• Clustering gives you an idea of how the data is distributed.
• You have 1000 data vectors that are very similar to each other
(e.g., Eu distance less than 0.0001). You can only divide them
into a few clusters.
• When you use kmeans, you usually obtain a global minimum of
the loss function
• When you use kmeans, you can never achieve global minimum
of the loss function.
• The number of clusters is a parameter that can be optimized by
kmeans.
• In K-fold cross validation, K is a hyperparameter.
• Agglomerative clustering as we introduced in this lecture, has a
fixed clustering result given distance measurement and the
number of clusters desired, and tie breaker.
Increasing K for K-means always ensures the optimal solution for that K has smaller loss.