Data Mining and Machine Learning
Clustering I Peter Jančovič
Slide 1
Data Mining and Machine Learning
Data Mining
Objective of Data Mining is to find structure and patterns in large, abstract data sets
– Is the data homogeneous or does it consist of several separately identifiable subsets?
– Are there patterns in the data?
– If so, do these patterns have an intuitive interpretation?
– Are there correlations in the data?
– Is there redundancy in the data?
Slide 2
Data Mining and Machine Learning
Partitioning data into “clusters”
In this lecture we will start to develop tools to understand the structure of data that can be partitioned into (more or less) distinct subsets
Can think of these subsets as arising from distinct “sources”
We will consider three different techniques: – Clustering
– Multi-modal statistical modelling (Gaussian Mixture Models – GMMs)
– Decision trees
Slide 3
Data Mining and Machine Learning
Clustering – Objectives
To explain the motivation for clustering
To introduce the ideas of distance and distortion
To describe agglomerative and divisive clustering
To explain the relationships between clustering and decision trees
Slide 4
Data Mining and Machine Learning
What does the data look like?
14 13 12 11 10
9 8 7 6
Slide 5
6 7 8 9 10 11 12 13 14
Data Mining and Machine Learning
Structure of data
Typical real data is not uniformly distrubuted
It has structure
Variables might be correlated
The data might be grouped into natural ‘clusters’ – it may have been generated by several different “sources”
The purpose of cluster analysis is to find this underlying structure automatically
Slide 6
Data Mining and Machine Learning
Clusters and centroids
Assume clusters are spherical – determined by centres
Cluster centres are called centroids
Questions: How many centroids do we need? Where should we put them?
Slide 7
Data Mining and Machine Learning
Distance
A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies:
– d(x,y) 0 and d(x,y) = 0 if and only if x = y
– d(x,y) = d(y,x) for all points x and y (symmetry)
– d(x,z) d(x,y) + d(y,z) for all points x, y and z (triangle inequality)
Slide 8
Data Mining and Machine Learning
Example metrics
The most common metric is the Euclidean metric
If x = [x1, x2,…,xN] and y = [y1,y2,…,yN] then: 222
dx,y xyxy…xy 1122NN
This is normal distance in Euclidean space
There are lots of others, but focus on this one
Slide 9
Data Mining and Machine Learning
The Lp Metrics
Euclidean distance is sometimes called the L2-metric
12 2
N
d x,y x y
2
nn n1
It is one of a family of metrics called the Lp-metrics
1 dpx,y xyp
N
p
nn n1
Slide 10
Data Mining and Machine Learning
Special Lp metrics
p=1 – the ‘City Block’ metric
N d1x,y x y
n1
p=∞
d x, y maxn1,…,N xn yn
nn
Slide 11
Data Mining and Machine Learning
Unit sphere
For a metric d defined on N dimensional space, the
unit sphere is the set of vectors x such that d(x,0) = 1
What do the unit spheres in 2D look like for these metrics?
Slide 12
Data Mining and Machine Learning
Example Unit Spheres (2D)
1 L2 unitsphere,x2 +y2 =1
L1 unit sphere |x|+|y|=1
-1 1
-1
L∞ unit sphere, max{x,y} = 1
Slide 13
Data Mining and Machine Learning
Distortion
Distortion is a measure of how well a set of centroids models a set of data
Suppose we have:
– data points y1, y2,…,yT – centroids c1,…,cM
For each data point yt let ci(t) be its closest centroid
In other words: d(yt, ci(t)) = minmd(yt,cm)
Slide 14
Data Mining and Machine Learning
Distortion
The distortion for the centroid set C = c1,…,cM is defined by:
T DistCdy,c
t it t1
In other words, the distortion is the sum of distances between each data point and its nearest centroid
The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised
Slide 15
Data Mining and Machine Learning
Types of Clustering
We will start with two types of cluster analysis:
– Agglomerative clustering, or ‘bottom-up’ hierarchical clustering
– Divisive clustering, or ‘top-down’ clustering
In the next lecture we will focus on a more sophisticated clustering method called k-means clustering
Slide 16
Data Mining and Machine Learning
Agglomerative clustering
Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster – each point is a centroid
Clusters are then combined until the required number of centroids is obtained
The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid
Slide 17
Data Mining and Machine Learning
14 13 12 11 10
9 8 7 6
Original data (302 points)
6 7 8 9 10 11 12 13 14
Slide 18
Data Mining and Machine Learning
14 13 12 11 10
9 8 7 6
252 centroids
6 7 8 9 10 11 12 13 14 6
Slide 19
Data Mining and Machine Learning
14 13 12 11 10
9 8 7 6
152 centroids
6 7 8 9 10 11 12 13 14
Slide 20
Data Mining and Machine Learning
14 13 12 11 10
9 8 7 6
52 centroids
6 7 8 9 10 11 12 13 14 6
Slide 21
Data Mining and Machine Learning
14 13 12 11 10
9
8
7
6 6
12 centroids
6 7 8 9 10 11 12 13 14 6
Slide 22
Data Mining and Machine Learning
Optimality of agglomerative clustering
The result of agglomerative clustering is not optimal Generally it does not result in a set of centroids C
such that
For example,
– Outliers may be given their own centroids
– Dense clusters may be given too few centroids
Slide 23
Data Mining and Machine Learning
Divisive Clustering
Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points
That point is replaced with 2 new centroids
Then each of these is replaced with 2 new centroids …
Slide 24
Data Mining and Machine Learning
Original data (302 points)
14
13
12
11 10
9 8 7 6
6 7 8 9 10 11 12 13 14
Slide 25
Data Mining and Machine Learning
14 13 12 11 10
9 8 7 6
Original data (302 points)
6 7 8 9 10 11 12 13 14
Slide 26
Data Mining and Machine Learning
Optimality of divisive clustering
The result of agglomerative clustering is not optimal
Generally it does not result in a set of centroids C such that
Sequential decision making is normally suboptimal
Slide 27
– Decisions are not reversible
– If a point goes to a particular half of a partition it will never be re-allocated to the other half
– Probably not how a human would do it Data Mining and Machine Learning
Top down clustering – divisive
Single centroid – whole set
Decision tree interpretation
. . . .
Multiple centroids – one per data point
Bottom up clustering – agglomerative
Slide 28
Data Mining and Machine Learning
Optimality
An ‘optimal’ set of centroids is one which minimises the distortion
In general, neither method gives optimal sets of centroids
A more principled approach would be to think of distortion as a function of the centroid set and minimize it
Slide 29
Data Mining and Machine Learning
Notation and method
N dimensional space
T data points X = {x1,…,xT}
K centroids C = {c1,…,cK}
Calculate
for each k and n, set to zero and solve Data Mining and Machine Learning
Slide 30
Summary
Distance metrics and distortion Agglomerative clustering
Divisive clustering
Decision tree interpretation
Slide 31
Data Mining and Machine Learning