STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 1
A Famous Clustering Example
, a physician plotted the location of cholera deaths on a map during an outbreak in the 1850s.
The locations of the cases were clustered around certain locations, and further investigation shows the root problem was around the polluted water sources.
Clustering is still an important statistical tool with real world applications:
Planning e cient fiber network nodes,
Location to put services, eg: hospitals and supermarkets. Election zoning, and etc.
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 1
Cluster Analysis
Cluster analysis is a broad class of unsupervised learning techniques for discovering unknown subgroups or clusters in data.
There is no associated response variable.
Cluster analysis is more subjective than supervised learning because there is no simple goal for the analysis.
We must define what it means to be similar or di erent. This is often a domain specific consideration.
We need a measure.
, University of Canterbury 2021
STAT318 — Data Mining ,3 / 1
Similarity Measures (qualitative features)
Similarity measures satisfy (dissimilarity is 1 – similarity)
1 0Æsim(x,z)Æ1
2 sim(x,z)=1 i x=z
3 sim(x, z) = sim(z, x) (symmetry)
Nominal features:
sim(x,z) = number of feature matches total number of features
Ordinal features: replace the M feature values with (i≠0.5)/M for i=1,2,…,M,
in the prescribed order of their original values. Then treat them as quantitative features.
, University of Canterbury 2021
STAT318 — Data Mining ,4 / 1
For example, let Then sim(x, z) = 2/3.
x = (Blue, Male, 82071),
z = (Blue, Female, 82071).
If we have an ordinal feature with levels low, medium and high, for example, we could code the levels using
low = 1/6 medium = 3/6
high = 5/6, and treat the feature as a quantitative feature.
Gower similarity (for example) can be used to measure the similarity between observations with both qualitative and quantitative features (not covered here).
Similarity Measures (binary features)
Symmetric: the simple matching coe cient (SMC) SMC(x,z)= f00 +f11 ,
f00 +f01 +f10 +f11 where fab is the number of features with xi = a and zi = b.
Asymmetric: the Jaccard coe cient (JC) JC(x,z) = f11
f01 + f10 + f11 Example: Calculate SMC(x,z) and JC(x,z) for
x = (1,0,0,0,0,0,0,1,0,0) z = (0,0,1,0,0,0,0,0,0,1)
, University of Canterbury 2021
STAT318 — Data Mining ,5 / 1
If 0 and 1 are equally important (e.g. male = 0 and female = 1), then the variables are called symmetric binary variables. If only 1 is important (e.g. item purchased = 1, 0 otherwise), then the variable is called asymmetric.
Example:
f00 =6,f01 =2,f10 =2,f11 =0. SMC(x, z) = 6/10
JC(x,z) = 0/4 = 0.
If x and z are asymmetric, then JC gives a meaningful result (they have nothing in common). Otherwise, SMC shows the vectors are similar with six matching zeros.
Similarity Measures (document term vectors)
A document term vector is a vector containing frequencies of particular words from a document.
Cosine similarity: the cosine of the angle between x and z cos(x,z)= x·z .
ÎxÎÎzÎ z
x
, University of Canterbury 2021
STAT318 — Data Mining ,6 / 1
Recall:
cos(x,z)= x·z = xTz =qpi=1xizi. ÎxÎÎzÎ ÎxÎÎzÎ ÎxÎÎzÎ
The closer cos(x , z ) is to one, the more similar the vectors. This similarity measure ignores 0-0 matches (like Jaccard), but it can handle non-binary vectors. For example, let
x = (2,5,0,0,2) z = (4,1,0,0,3).
Then cos(x,z) ¥ 0.65. If we convert the data to binary data (present/absent), we get
x = (1,1,0,0,1) z = (1,1,0,0,1),
which has a Jaccard similarity of one (the di erences between the vectors are lost).
Distance Measures (quantitative features)
Manhattan Distance (1-norm):
ÿp Îx≠zÎ1 =j=1|xj ≠zj|
Euclidean Distance (2-norm):
ˆıÿp Îx≠zÎ2 =Ùj=1|xj ≠zj|2
Supreme Distance (Œ-norm):
Îx≠zÎŒ =max{|xj ≠zj|:j=1,2,…,p}
, University of Canterbury 2021
STAT318 — Data Mining ,7 / 1
II x – z II = 8.94 II x – z II = 12 II x – z II = 8 2.1..
10 z 10 z 10 z …
2x2x2x IIIIII
262626 111
…
-1 (0,0) 1 -1 (0,0) 1 -1 (0,0) 1
-1 -1 -1
II x II2 < 1 II x II1 < 1 II x II < 1
8
II
8
II
II
What is a cluster?
Although the definition of a cluster is imprecise, algorithms exist to extract clusters of certain types.
(1) Well-separated (Natural): a well-separated cluster is a set of objects that are more similar to objects within the cluster than to any other cluster.
, University of Canterbury 2021
STAT318 — Data Mining ,8 / 1
What is a cluster?
(2) Centre-based: a centre-based cluster is a group of objects that are more similar to their centroid than to any other cluster centroid.
, University of Canterbury 2021
STAT318 — Data Mining ,9 / 1
What is a cluster?
(3) Contiguity-based: a contiguity-based cluster is a group of connected objects that are closer to an object in its cluster than to any other cluster.
, University of Canterbury 2021
STAT318 — Data Mining ,10 / 1
What is a cluster?
(4) Density-based: a density-based cluster is a region of dense objects surrounded by a less dense region.
, University of Canterbury 2021
STAT318 — Data Mining ,11 / 1
Di erent clustering algorithms are used to find of clusters of di erent types. We will consider two popular approaches in this course.
K-means clustering
K-means is a partitional clustering algorithm:
1 Each observation is assigned to a cluster;
2 An observation is assigned to one cluster (the intersection of two or more clusters is empty).
The number of ways to partition n observations into K clusters is enormous:
K1!ÿK (≠1)K≠kAKkBkn. k=1
, University of Canterbury 2021
STAT318 — Data Mining ,12 / 1
The number of ways to partition n observations into K clusters is called a Stirling number. If n = 100 (small!) and K = 5, there are ¥ 1067 possible clusterings. Hence we need to use algorithms to find a good partitional clustering, rather than considering all possible clusterings and choosing the best one.
K-means clustering
K=2 K=3 K=4
, University of Canterbury 2021
STAT318 — Data Mining ,13 / 1
The colouring in this figure is arbitrary and was not used to form the clustering.
• Partitional clustering algorithms (like K-means) will find K clusters in the data, even if there are no natural clusters (i.e. random noise).
K-means Clustering Algorithm (quantitative features)
1
2
Initialize: Choose the number of clusters, K. Randomly assign each observation to one of the K clusters. Let C denote the cluster assignment such that C(i) = k if xi is in the kth cluster.
Iterate until cluster assignments remain fixed:
(a) For each of the K clusters, compute the cluster mean x ̄k = (x ̄1k,x ̄2k,...,x ̄pk).
(b) Assign each observation to the cluster whose mean is closest (in terms of squared
Euclidean distance):
C(i)=argminÎxi ≠x ̄kÎ2. k
, University of Canterbury 2021
STAT318 — Data Mining ,14 / 1
For example, the mean of is
(1,2),(2,5), and(3,5)
x ̄ = 31+2+3,2+5+54
33 = (2, 4).
Properties of the Algorithm
The idea behind K-means clustering is that a good clustering is one for which the within-cluster variation is minimized.
The within-cluster variation in the kth cluster is defined as W(Ck) = n1 ÿ ÿ Îxi≠xjÎ2
k C(i)=k C(j)=k
= 2 ÿ Î x i ≠ x ̄ k Î 2 ,
C(i)=k
where nk is the number of observations in the kth cluster.
Hence, we want to solve Y] ÿK ÿ min 2
C [ k=1C(i)=k
Îxi≠x ̄kÎ . (1) \
2Z^
, University of Canterbury 2021
STAT318 — Data Mining ,15 / 1
K-means searches for a good centre-based clustering. It is not possible to consider all possible clusterings (far too many!) and choose the one that minimizes (1).
• K-means is a search algorithm that converges to a local minimizer of (1).
• Unfortunately, (1) is not convex so the local minimizer we find may not be a global minimizer of (1).
Properties of the Algorithm
Each step of the K-means algorithm either decreases the value of (??) or leaves it unchanged (the algorithm has converged).
In step 2(a), the cluster assignment is fixed and
ÿ Îxi≠mÎ2 C(i)=k
is minimized at m = x ̄k .
In step 2(b), the means are fixed. Reallocating the observations cannot increase
the value of (??) because
is minimized by choosing the closest mean.
Î x i ≠ x ̄ Î 2
, University of Canterbury 2021
STAT318 — Data Mining ,16 / 1
Properties of the Algorithm
The K-means algorithm is guaranteed to converge to a local minimum of (??), but we cannot guarantee convergence to the global minimum.
K-means is a stochastic algorithm, so we can run it many times and select the best clustering (the one with the smallest value of (??)). Although this clustering may not be a global minimizer of (??), it will be the best local minimum we are likely to find.
, University of Canterbury 2021
STAT318 — Data Mining ,17 / 1
K-means clustering
Data Step 1 Iteration 1, Step 2a
, University of Canterbury 2021
STAT318 — Data Mining ,18 / 1
Step 1: random cluster assignment, with approximately the same number of points in each cluster.
Step 2a: compute the cluster means.
K-means clustering
Iteration 1, Step 2b Iteration 2, Step 2a Final Results
, University of Canterbury 2021
STAT318 — Data Mining ,19 / 1
Step 2b: assign each point to the cluster whose mean is closest. Step 2a: compute the cluster means.
K-means clustering, di erent starting values
320.9 235.8 235.8
235.8 235.8 310.9
, University of Canterbury 2021
STAT318 — Data Mining ,20 / 1
• Di erent initial assignments have given three di erent clusterings.
• Each clustering corresponds to a local minimizer of (1), with the least value shown in red.
K-means clustering
Strengths:
Computationally e cient.
Can be applied to any data type with the notion of centre.
Weaknesses:
Can produce empty clusters.
Can struggle to find natural clusters that are non-globular in shape, have varying sizes or di erent densities (see R example).
Need to choose K (not unique to K-means).
, University of Canterbury 2021
STAT318 — Data Mining ,21 / 1
• K-means can also be sensitive to outliers.
• K-medoids is another clustering algorithm that is similar to K-means (not covered here, but useful to know it exists). The only di erence is that medoids (a generalization of median to d-dimensions) are used instead of means. This algorithm is less sensitive to outliers and each cluster’s centre is an observation (unlike K-means).
• After a high dimensional clustering has been formed, a low dimensional visualisation can be useful for subjectively evaluating the clustering (and possibly choosing K). A popular choice is t-distributed stochastic neighborhood embedding (this is not a clustering algorithm, it’s a dimensionality reduction technique), but this technique is not covered in this course.
K-means Clustering Application: Essential Services to NZ Land Titles
Fibre Internet Central K - Nodes, where to e ciently put them to service NZ land parcels? We want to minimise the local fiber cable distance from the central node to each land title.
Land titles in , there are 2,062,451 land titles (22/June/2019). Data Source: https://data.linz.govt.nz/
Using Google map API and K-means algorithm in R to seperate it out into four regions.
, University of Canterbury 2021
STAT318 — Data Mining ,22 / 1
Essential Services to NZ Land Titles: Visualisation
Land titles plotted in blue.
High concentration among the four largest cities.
Using latitude and longitude coordinates.
, University of Canterbury 2021
STAT318 — Data Mining ,23 / 1
Essential Services to NZ Land Titles: R program code
# Read in data, load in package, and Google map of NZ.
data <- read.csv("aims-address-position.csv")
library("ggmap")
register_google(key=" *** YOUR API KEY *** ")
NZMap <- get_map(" ", zoom=5)
# Performing k-means clustering, and then plotting
clusters <- kmeans(data[,8:9], 4)
data$Region <- as.factor(clusters$cluster)
cent = data.frame(clusters$centers)
ggmap(NZMap) +
geom_point(aes(x=shape_X,y=shape_Y,colour=as.factor(Region)),data = data) +
geom_point(aes(x=shape_X,y=shape_Y),data=cent,colour=c(5,4,2,1),pch=8,size=10) +
ggtitle(" Land Title Location Using KMean") +
xlim(165,180) + ylim(-47,-34)
, University of Canterbury 2021
STAT318 — Data Mining ,24 / 1
Essential Services to NZ Land Titles: Results
K-means clustering has clustered the region around four major cities (Auckland, Wellington, Dunedin, and Christchurch).
Is it good clustering?
Good clustering is very subjective, as cost of cabling depends on many factors: passing land/water, mountain/plain etc.
Need to talk to the experts in the field.
, University of Canterbury 2021
STAT318 — Data Mining ,25 / 1
Silhouette Coe cient
Let x be an observation in the kth cluster.
Cohesion: the average distance to all other points in the kth cluster containing x:
a(x) = qC(i)=k Îxi ≠ xÎ. nk ≠1
Separation: the minimum average distance between x and points within a cluster not containing x:
b(x)= min IqC(i)=j Îxi ≠xÎJ. j =1,...,K ,j ”=k nj
The Silhouette coe cient of x is
s(x)= b(x)≠a(x) . max{a(x ), b(x )}
, University of Canterbury 2021
STAT318 — Data Mining ,26 / 1
The Silhouette coe cient of x satisfies ≠1 Æ s(x) Æ 1. Values close to one indicate that x is more similar points in its cluster than points in another cluster.
. Cohesion .
.. .x
..
. Separation . . ..
.... .x
Silhouette Coe cient
We want the Silhouette coe cient to be positive (good separation) with a(x) as close to zero as possible (good cohesion).
To calculate the validity of a clustering, we compute the average Silhouette coe cient for the observations.
To choose a suitable k, we can plot the average Silhouette coe cient verses k and choose the k value that corresponds to a maximum (peak) in the plot — best cohesive, well-separated clustering for the data (see R example).
, University of Canterbury 2021
STAT318 — Data Mining ,27 / 1
2 4 6 8 10
k
0 2 4
In this example, the maximum average silhouette coe cient is achieved when K = 2 (best cohesive, well-separated clustering for the data). The right panel shows the K = 2 clustering for the data. This method of choosing K is not fail-safe and can give undesirable clusterings.
silval
0.40 0.45 0.50 0.55 0.60
0 2
Hierarchical Clustering
We will consider agglomerative (bottom-up) hierarchical clustering. Divisive (top down) methods are not as popular because there are too many possible splits to consider at each iteration.
Basic Algorithm
Treat each observation as its own cluster.
Repeat until all observations are in a single cluster:
(a) Identify the two closest clusters and merge them together.
1 2
, University of Canterbury 2021
STAT318 — Data Mining ,28 / 1
Hierarchical Clustering
AB C
D
E
, University of Canterbury 2021
STAT318 — Data Mining ,29 / 1
Fives points in R2 to be hierarchically clustered using Euclidean distance.
Hierarchical Clustering
D
E
AB C
, University of Canterbury 2021
STAT318 — Data Mining ,30 / 1
The two closest points are merged together to give four clusters:
{A,C},{B},{D},{E}
Hierarchical Clustering
D
E
AB C
, University of Canterbury 2021
STAT318 — Data Mining ,31 / 1
Points D and E are merged together to give three clusters: {A,C},{B},{D,E}
Hierarchical Clustering
D
E
AB C
, University of Canterbury 2021
STAT318 — Data Mining ,32 / 1
Cluster {A,C} and point B are merged together to give two clusters: {A,B,C},{D,E}
Hierarchical Clustering
D
E
AB C
, University of Canterbury 2021
STAT318 — Data Mining ,33 / 1
The two clusters are merged to give one all inclusive cluster:
{A,B,C,D,E}
Dendrogram
D
E
AB C
ACBDE
, University of Canterbury 2021
STAT318 — Data Mining ,34 / 1
Hierarchical clusterings can be displayed in tree structures called Dendrograms. These are useful for seeing when points are merged together for the first time. The distance at which two points are merged into the same cluster for the first time is called a Cophenetic distance.
9
2 1
8
7 5
3
6 4
−1.5 −1.0 −0.5 0.0 0.5 1.0
X1
The cophenetic distance between points 1 and 6 is ¥ 0.4 and the cophenetic
distance between points 1 and 2 is ¥ 3.
1 6
0.0 0.5 1.0 1.5 2.0 2.5 3.0
8 5
7
−1.5
−1.0 −0.5
0.0 0.5
3 4
9 2
X2
Hierarchical Clustering
Complete Linkage (Max): the maximum distance (dissimilarity) between any two observations in two di erent clusters:
d(Ck,CkÕ)=max{Îxi ≠xjÎ:C(i)=k,C(j)=kÕ withk”=kÕ}. i,j
, University of Canterbury 2021
STAT318 — Data Mining ,35 / 1
• Pro: Will tend to produce compact clusters with relatively small maximum distances (dissimilarities) between points in the same cluster.
• Con: Some points can be closer to points in other clusters than they are to points in their own cluster.
Hierarchical Clustering
Single Linkage (Min): the minimum distance (dissimilarity) between any two observations in two di erent clusters:
d(Ck,CkÕ)=min{Îxi ≠xjÎ:C(i)=k,C(j)=kÕ withk”=kÕ}. i,j
, University of Canterbury 2021
STAT318 — Data Mining ,36 / 1
• Con: Can produce clusters that are not compact with relatively large maximum distances (dissimilarities) between points in the same cluster.
• Con: Relatively low threshold for merging (points can be added one-by-one, rather than merging groups together).
.. ..
.
... ...
• Con: Chaining can occur: .. ..
.
.
. ...
. ........
...
Hierarchical Clustering
Group Average: the average pair-wise distance (dissimilarity) between all observations in two di erent clusters:
d(Ck,CkÕ)=qC(i)=kqC(j)=kÕÎxi≠xjÎ (k”=kÕ). NkNkÕ
, University of Canterbury 2021
STAT318 — Data Mining ,37 / 1
• A compromise between single linkage and complete linkage.
Hierarchical Clustering (complete linkage)
−6 −4 −2 0 2
X1
, University of Canterbury 2021
STAT318 — Data Mining ,38 / 1
Data to be clustered.
X2
−2 0 2 4
Hierarchical Clustering (complete linkage)
, University of Canterbury 2021
STAT318 — Data Mining ,39 / 1
• Left: Dendrogram for a complete linkage clustering using Euclidean distance.
• Centre: The dendrogram is cut at a height of 9 to give two distinct clusters (shown in di erent colours).
• Right: The dendrogram is cut at a height of 5 to give three distinct clusters (shown in di erent colours).
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
Hierarchical Clustering (complete linkage)
−6 −4 −2 0 2
X1
, University of Canterbury 2021
STAT318 — Data Mining ,40 / 1
Three clusters found using complete linkage hierarchical clustering.
X2
−2 0 2 4
Hierarchical Clustering (complete linkage)
How do we read a dendrogram?
9
2 1
8
7 5
3
6 4
−1.5 −1.0 −0.5 0.0 0.5 1.0
X1
, University of Canterbury 2021
STAT318 — Data Mining ,41 / 1
• Points 5 and 7 are as similar as points 1 and 6 (similar merging distances ¥ 0.4).
• Point 9 is no more similar to point 2 than it is to points 8,5 and 9 even though points 9 and 2 are close together on the dendrogram. This is because points 2,8,5 and 7 all merge with 9 at the same distance (¥ 1.8).
1 6
0.0 0.5 1.0 1.5 2.0 2.5 3.0
8 5
7
−1.5
−1.0 −0.5
0.0 0.5
3 4
9 2
X2
Hierarchical Clustering
Example: Hierarchically cluster the points
(1, 1), (2, 1), (2, 4), (3, 5) and (6, 3)
and produce a dendrogram using the 1-norm distance measure with:
(a) Single linkage; (b) Complete linkage.
, University of Canterbury 2021
STAT318 — Data Mining ,42 / 1
Single Linkage
5 4
3 2 1
5 4
3 2 1
5 4
3 2 1
7 6 5 4 3 2 1
12345
.
. 4
3
.
5
..
12
123456
Complete Linkage
.
. 4
3
.
5
..
12
123456
12345
Hierarchical Clustering
Strengths:
Can handle clusters of arbitrary shape and size.
Hierarchical structure can be useful for the underlying application.
Weaknesses:
Computationally expensive. Merging decisions are final.
, University of Canterbury 2021
STAT318 — Data Mining ,43 / 1
Conclusions
• Unsupervised learning is intrinsically more di cult than supervised learning.
• There is no gold standard (like a response variable) to access how well we
are doing.
• There is usually no single objective (like reducing the testing error).
• You should consider di erent parameter choices and look for patterns that consistently emerge. You can also cluster di erent samples to see how variable the clusterings are.
Practical Issues
• Should the data be scaled? (mean zero, variance one?)
• Which linkage should be used?
• How many clusters?
...