程序代写代做代考 data mining algorithm GMM Data Mining and Machine Learning

Data Mining and Machine Learning
K-Means Clustering Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 To explain the need for K-means clustering
 To understand the K-means clustering algorithm
 To understand the relationships between:
– Clustering and statistical modelling using GMMs – K-means clustering and E-M estimation for GMMs
Slide 2
Data Mining and Machine Learning

Clustering so far  Agglomerative clustering
– Begin by assuming that every data point is a separate centroid
– Combine closest centroids until the desired number of clusters is reached
– See agglom.c on the course Canvas page
 Divisive clustering
– Begin by assuming that there is just one centroid/cluster
– Split clusters until the desired number of clusters is reached
Slide 3
Data Mining and Machine Learning

Optimality
 Neither agglomerative clustering nor divisive clustering is optimal
 In other words, the set of centroids which they give is not guaranteed to minimise distortion
Slide 4
Data Mining and Machine Learning

Optimality continued  For example:
– In agglomerative clustering, a dense cluster of data points will be combined into a single centroid – but to minimise distortion, need several centroids in a region where there are many data points
– A single ‘outlier’ may get its own cluster
 Agglomerative clustering provides a useful starting
point, but further refinement is needed
Slide 5
Data Mining and Machine Learning

14 13 12 11 10
9
8
7
6 6
12 centroids
6 7 8 9 10 11 12 13 14 6
Slide 6
Data Mining and Machine Learning

K-means Clustering
 Suppose that we have decided how many centroids
we need – denote this number by K
 Suppose that we have an initial estimate of suitable
positions for our K centroids
 K-means clustering is an iterative procedure for
moving these centroids to reduce distortion
Slide 7
Data Mining and Machine Learning

Derivation of the K-means clustering algorithm
 Based on direct minimization of distortion
 Given a set of centroids C0 = c1,…,cK, and a set of data Y = y1,…,yN, differentiating Dist(C0) with respect to the dth component of ck and setting the result to zero gives:
where Y(k) is the set of data points for which ck is the closest centroid
Slide 8
Data Mining and Machine Learning

Derivation of the K-means clustering algorithm (continued)
 The equation
is not closed – both the LHS and the RHS depend on
ck
Although this equation cannot give a direct solution for ckd, it can be used as the basis of an iterative algorithm
Slide 9
Data Mining and Machine Learning

K-means clustering – notation  Suppose there are T data points, denoted by:
Y  y ,y ,…,y ,…,y 12tT
 Suppose that the initial K clusters are denoted by: C0 c0,c0,…,c0,…,c0
12kK
 One iteration of K-means clustering will produce a
new set of clusters
C1 c1,c1,…,c1,…,c1
Such that Slide 10
12kK D i s t C 1   D i s t C 0 
Data Mining and Machine Learning

K-means clustering (1)
 For each data point yt let ci(t) be the closest centroid
 In other words: d(yt, ci(t)) = minmd(yt,cm)
 Now, for each centroid c0k define:
Y0 y :itk kt
 In other words, Y0k is the set of data points which are closer to c0k than any other centroid
Slide 11
Data Mining and Machine Learning

K-means clustering (2)
 Now define a new kth centroid c1k by: c1  1 y
k
y Y 0
t k
Y0
tk
where |Yk0| is the number of samples in Yk0
 In other words, c1k is the average value of the samples which were closer to c0k than to any other centroid
Slide 12
Data Mining and Machine Learning

K-means clustering (3)
 Now repeat the same process starting with the new
centroids:
C1 c1,c1,…,c1,…,c1
12kK to create a new set of centroids:
C2  c2,c2,…,c2,…,c2 12kK
… and so on until the process converges
 Each new set of centroids has smaller distortion than the previous set
Slide 13
Data Mining and Machine Learning

Initialisation
 An outstanding problem is to choose the initial
centroid set C0
 Possibilities include:
– Choose C0 randomly
– Choose C0 using agglomerative clustering – Choose C0 using divisive clustering
 Choice of C0 can be important
– K-means clustering is a “hill-climbing” algorithm – Finds a local minimum of the distortion function – This local minimum is determined by C0
Slide 14
Data Mining and Machine Learning

Local optimality Distortion
Local minimum
Dist(C0)
Global minimum
Cluster set
C0 C1 ..Cn
N.B: I’ve drawn the cluster set space as 1 dimensional for simplicity. In reality it is a very high dimensional space
Slide 15
Data Mining and Machine Learning

Example
14 13 12 11 10
9 8 7
data 0
1
2
3
4
6 8 10 12 14
Slide 16
Data Mining and Machine Learning

Example – distortion
12 11.8 11.6 11.4 11.2 11 10.8
0 2 4 6 8 10
Number of Iterations
Slide 17
Data Mining and Machine Learning
Distortion

C programs on Canvas
 agglom.c
– Agglomerative clustering
agglom dataFile centFile numCent
– Runs agglomerative clustering on the data in dataFile until the number of centroids is numCent. Writes the centroid (x,y) coordinates to centFile
Slide 18
Data Mining and Machine Learning

C programs on Canvas
 k-means.c
– K-means clustering
k-means dataFile centFile opFile
– Runs 10 iterations of k-means clustering on the data in dataFile starting with the centroids in centFile.
– After each iteration writes distortion and new centroids to opFile
Slide 19
Data Mining and Machine Learning

Relationship with GMMs
 The set of centroids in clustering corresponds to the set of means in a GMM
 Measuring distances using Euclidean distance in clustering corresponds to assuming that the GMM variances are all equal to 1
 k-means clustering corresponds to the mean estimation part of the E-M algorithm, but:
– In k-means samples are allocated 100% to the closest centroid
– In E-M samples are shared between GMM components according to posterior probabilities
Slide 20
Data Mining and Machine Learning

K-means clustering – example
Data
1.2 1.7 5 1 1.1
1.5 2.5 4.5 2 2.1
1.3 3.1 1.8 1.9 0.9 1.5 0.2 1.2
2 1.1 2.5 3.7 2.4 4.2 3.1 3.9 2.8 4.5
4
3.5
3
2.5
2
Data Centroids (0)
1.6 2.1 1.5
Centroids (0)
0.7 1.7
0.75 2.5 3 1.5
1.75 4
1 0.5 0
0 0.5 1 1.5 2 2.5 3 3.5
Slide 21
Data Mining and Machine Learning

Data
1.2 1.7 1 1.1 1.5 2.5 2 2.1 1.3 3.1 1.8 1.9 0.9 1.5 0.2 1.2 2 1.1 2.5 3.7 2.4 4.2 3.1 3.9 2.8 4.5 1.6 2.1 0.7 1.7
0.75 2.5 3 1.5
1.75 4
2.36 1 3.00 1 1.52 1 1.92
1.01 1 2.10 1 2.64 1 3.20 1 2.91
0.81
0.68
1.35
1.16
1.91 1 2.53 1
First iteration of k-means Distance to centroids
Closest centroid c(1) c(2)
d(x(n),c(1))
0.92 1.42 0.75 1.31 0.81 1.21 1.01 1.41 1.88 2.12 2.37 2.74 2.86 0.94 0.80
d(x(n),c(2))
1.81 2.04 1.80 1.17 2.33 1.26 2.10 2.82 1.08 2.26 2.77 2.40 3.01 1.52 2.31
d(x(n),c(3))
c(3)
Totals
9
2
1 1 1
4
1
1
1
Centroids (0)
Distortion(0)
15.52
Slide 22
Data Mining and Machine Learning

Data 1.2 1.7 1 1.1 1.5 2.5 2 2.1 1.3 3.1 1.8 1.9 0.9 1.5 0.2 1.2 2 1.1 2.5 3.7 2.4 4.2 3.1 3.9 2.8 4.5 1.6 2.1 0.7 1.7
0.92 1.42 0.75 1.31 0.81 1.21 1.01 1.41 1.88 2.12 2.37 2.74 2.86 0.94 0.80
1.81 2.36 1
2.04 3.00 1
1.80 1.52 1
1.17 1.92 1 2.33 1.01 1
1.26 2.10 1
2.10 2.64 1
2.82 3.20 1
1.08 2.91 1 2.26 0.81
2.77 0.68
2.40 1.35
3.01 1.16
1.52 1.91 1 2.31 2.53 1
Totals 9 2
Dist’n(0) 15.52
1.70 1.10 2.50
3.10 1.90 1.50 1.20
2.00
2.00
2.10
1.10
First iteration of k-means
Distance to centroids
d(x(n),c(1)) d(x(n),c(2)) d(x(n),c(3)) c(1) c(2)
c(1)
c(2)
c(3)
Closest centroid c(3)
x
1.20 1.00 1.50
1.30 1.80 0.90 0.20
y x
y x
y
1 1 1 1
4
2.50 2.40 3.10 2.80
3.70 4.20 3.90 4.50
16.3
1.60 0.70 10.2
2.10
1.70
16.8 4
3.2 10.8
Centroids (0)
0.75 2.5 3 1.5
1.75 4
Slide 23
Data Mining and Machine Learning

Data
1.2 1.7 1 1.1 1.5 2.5 2 2.1 1.3 3.1 1.8 1.9
First iteration of k-means
5
0.9 1.5 4.5
0.2 1.2 2 1.1 2.5 3.7 2.4 4.2 3.1 3.9 2.8 4.5 1.6 2.1 0.7 1.7
0.75 2.5 3 1.5
1.75 4
1.133333 1.866667 2 1.6 2.7 4.075
4 3.5 3 2.5 2 1.5 1 0.5 0
Data Centroids(0) Centroids(1)
Centroids (0)
Centroids (1)
0 0.5 1 1.5 2 2.5 3 3.5
Slide 24
Data Mining and Machine Learning

Second iteration of k-means
Distance to centroids d(x(n),c(1)) d(x(n),c(2)) d(x(n),c(3))
Closest centroid c(1) c(2)
1 1 1
1 1
1 1
1
1
c(3)
Data 1.2
1 1.1 0.78 1.12 3.43
1.5 2.5 0.73 1.03 1.98 2 2.1 0.90 0.50 2.10 1.3 3.1 1.24 1.66 1.71 1.8 1.9 0.67 0.36 2.35 0.9 1.5 0.43 1.10 3.14 0.2 1.2 1.15 1.84 3.81 2 1.1 1.16 0.50 3.06 2.5 3.7 2.29 2.16 0.43 2.4 4.2 2.65 2.63 0.33 3.1 3.9 2.83 2.55 0.44 2.8 4.5 3.12 3.01 0.44 1.6 2.1 0.52 0.64 2.26 0.7 1.7 0.46 1.30 3.10
Centroids(1) 1.133333 1.866667 2 1.6 2.7 4.075
1.7 0.18 0.81 2.81
1
1 834
1 1 1 1
Slide 25
Data Mining and Machine Learning

Data
1.2 1 1.5 2 1.3 1.8 0.9 0.2 2 2.5 2.4 3.1 2.8 1.6 0.7
1.133333 2 2.7
1.05 1.933333 2.7
1.7 1.1 2.5 2.1 3.1 1.9 1.5 1.2 1.1 3.7 4.2 3.9 4.5 2.1 1.7
Second iteration of k-means 5
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Data Centroids(0) Centroids(1) Centroids(2)
Centroids(1)
Centroids (2)
1.866667 1.6 4.075
1.8625 1.7 4.075
18 16 14 12 10
8 6 4 2 0
Distortion
0 0.5 1 1.5 2 2.5 3 3.5
123
Slide 26
Data Mining and Machine Learning

Examples
 Three example 2-dimensional datasets
 For each data set, and for k=1,…,10:
– Create k centroids using agglomerative clustering
– Run k-means algorithm for 15 iterations for these initial centroid values
– Plot distortion after 15 iterations of k-means as a function of number of centroids
Slide 27
Data Mining and Machine Learning

Example 1
 Gaussian distributed data: single 2D Gaussian,
centre (0,0), variance 16 in x and y directions
Slide 28
Data Mining and Machine Learning

Example 2
 Five 2D Gaussians, centres (0,0), (1,2), (4,4), (-2,-5)
and (-3,1), variance 0.5 in x and y directions
Slide 29
Data Mining and Machine Learning

Example 3
 Two 2D Gaussians, centres (2,2), (-2,-2), variance 4
in x and y directions
Slide 30
Data Mining and Machine Learning

Summary
 The need for k-means clustering  The k-means clustering algorithm  Example of k-means clustering
 Choosing k empirically
Slide 31
Data Mining and Machine Learning