Unsupervised learning: Clustering
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 31
Discover structure
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.
The observations within each group are quite similar to each other.
Like PCA, this is unsupervised learning, but
PCA looks for a low-dimensional representation of the observations that explains a good fraction of the variance.
Clustering looks for homogeneous subgroups among the observations:
Stocks that have similar behavior,
Shoppers characterized by their browsing and purchase histories, Similar households based on data on median household income, occupation, distance from nearest urban area …
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 31
Today’s outline
Unsupervised learning: Clustering
1 Hierarchical clustering
2 K-means clustering
3 Density-based methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020
3 / 31
Hierarchical clustering
Outline
1 Hierarchical clustering
2 K-means clustering
3 Density-based methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 31
Similarity
Hierarchical clustering
First: we need to be able to quantify similarity.
Euclidean distance between observations with metric characteristics;
Special distances for binary outcomes; may also add weights;
Note that weights and scaling do make a difference, as does preliminary feature engineering. E.g. simulated data set, n = 500:
x2
−3 −2 −1 0 1 2 3 4
x2
−5 0 5 10
−4 −2 0 2 4
x2
−3 −2 −1 0 1 2 3 −5 0 5 10 −2 0 2 4
Measuring similarity is often a domain-specific consideration that must be
made based on knowledge of the data being studied.
Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 31
Hierarchical clustering
Building a hierarchy in a “bottom-up fashion”
Building on such measures of similarity, we start with bottom-up or agglomerative clustering.
This is the most common type of hierarchical clustering, and
Refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk.
We do that in a way that reflects closeness of the data points. We may then use the dendrogram as we see fit.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 31
Hierarchical clustering
BuHiliedrairnchgicalaCluhstierirnag:rtchehiydeHainerarachic“albColutsteroinmg:-thuepideHafiaersarhchiiocanlC”lustering:theidea uilds a hierarchy in a “bottom-up” fashBiuonil.d..s a hierarchy in a “bottom-up” fashBiuonil.d..s a hierarchy in a “bottom-up” fashion…
DDD EEE
ABABAB CCC
Hierarchical Clustering: the ideHaierarchical Clustering: the idea
uilds a hierarchy in a “bottom-up” fashBiuonil.d..s a hierarchy in a “bottom-up” fashion…
38/52 38/52
DD EE
AB AB CC
38/52 38/52
Statistics and Econometrics (CAU Kiel)
Summer 2020
7 / 31
Hierarchical clustering
Hierarchical clustering in words
Hierarchical Clustering Algorithm
The approach in words:
1 Start with each point in its own cluster.
• Start with each point in its own cluster.
2 Identify the closest two clusters and merge them; need to assess • Identify the closest two clusters and merge them.
so-called linkage (dissimilarity between clusters).
• Repeat.
3 Repeat steps 1&2 until all points are in a single cluster.
• Ends when all points are in a single cluster. Dendrogram
A
B C
DE
Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 31
D E B A C
01234
39/52
Hierarchical clustering
Types of Linkage
Linkage Description
Complete
Single
Average
Maximal inter-cluster dissimilarity. Compute all pairwi- se dissimilarities between observations in cluster A and observations in cluster B, and record largest dissimilarity. Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between observations in cluster A and ob- servations in cluster B, and record the smallest of them. Mean inter-cluster dissimilarity. Compute all pairwise dis- similarities between observations in cluster A and obser- vations in cluster B, and record the average of these dis- similarities.
Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. (Can be a bit unstable)
Centroid
Statistics and Econometrics
(CAU Kiel) Summer 2020
9 / 31
Hierarchical clustering
An example
An Example
−6 −4 −2 0 2
X
ns generated in 2-dimensional space. In reality ee distinct classes, shown in separate colors.
Let’s cluster with complete linkage and Euclidean distance.
will treat these class labels as unknown and will
45 observations generated in 2-dimensional space.
There are three distinct classes, shown in separate colors.
Without color information, can we (approximately) recover the classes?
Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 31
X2
−2 0 2 4
o
Application of hier
Hierarchical clustering
archical clustering
… and the result
Dendrogram obtained from hierarchically clustering the above data. Height serves as indication of (dis)similarity (or linkage) between clusters.
Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 31
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
Another
Hierarchical clustering
Example
Another Example
Observations 5 and 7, and 1 and 6, are quite similar to each other. Observation 9 is no more similar to observation 2 than it is to
3
6 4
2 1
−1.5 −1.0 −0.5 0.0 0.5 1.0
X1
tration of how to properly interpret a dendro
observations 8, 5, and 7;
This is because observations 2, 8, 5, and 7 all fuse with observation 9
ervations in two-dimensional space. The raw
at the same height, approximately 1.8.
s used to generate the dendrogram on the lef
9
8
7 5
Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 31
1 6
0.0 0.5 1.0 1.5 2.0 2.5 3.0
8 5
7
−1.5
−1.0 −0.5
0.0 0.5
3 4
9 2
X2
sg s
at
Hierarchical clustering
Merges in previous example
3
4
7 85
2 1
6
9
3
4
2 1
6
9
8
−1.5 −1.0 −0.5 0.0
X1
0.5
1.0
−1.5
−1.0
−0.5
0.0 0.5 1.0
X1
1 6
9
3
4
2
8
7 5
3
4
2
9
7 5
1 6
8
7 5
−1.5 −1.0 −0.5 0.0
0.5
1.0
−1.5
−1.0
−0.5
0.0 0.5 1.0
X1
X1
Statistics and Econometrics (CAU Kiel) Summer 2020 44 /1532/ 31
Merges in previous example
−1.5 −1.0 −0.5
0.0 0.5
−1.5 −1.0 −0.5
0.0 0.5
−1.5 −1.0 −0.5
0.0 0.5
−1.5 −1.0 −0.5
0.0 0.5
X2
X2
X2
X2
Hierarchical clustering
Building a hierarchy in a “bottom-up fashion”
Clustering is somewhat similar to classification, but:
In classification, you learn a classifier function from data; goal is to
classify new data.
In clustering, you have no idea about how classes (should there be
any) are characterized; goal is to classify existing data. More generally, unsupervised learning
Helps understanding the variation and grouping structure of a set of unlabeled data;
Can be a useful pre-processor for supervised learning;
Is intrinsically more difficult than supervised learning because there is no gold standard (like an outcome variable) and no single objective (like accuracy of predictions/classifications).
Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 31
K-means clustering
Outline
1 Hierarchical clustering
2 K-means clustering
3 Density-based methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 31
K-means clustering
K-means clustering
Hierarchical clustering is agnostic. But we may become more efficient if we
know how many clusters there are (or how many we want).
rbind(X)[,2]
−3 −2 −1 0 1 2 3 4
rbind(X)[,2]
−3 −2 −1 0 1 2 3 4
rbind(X)[,2]
−3 −2 −1 0 1 2 3 4
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Recall the simulated data set with 500 observations in 2-dimensional space. Panels show the results of applying K-means clustering with different values of K, the number of clusters.
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 31
K-means clustering
Details of K-means clustering
The idea behind K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.
Let C1, . . . , CK denote disjoint sets containing the indices of all n observations in each cluster. For instance, if the ith observation is in the kth cluster, then i ∈ Ck (and only Ck). Solve
K
min WCV(Ck) .
C1 ,…,CK
k=1
Typically we use (squared) Euclidean distance
1 p
(xij − xi′j)2,
where |Ck| denotes the number of observations in the kth cluster. (But we
may use any other distance between Xi and Xi′ .)
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 31
WCV (Ck) = |Ck|
i,i′∈Ck j=1
K-means clustering
Optimization
The optimization problem that defines K-means clustering is combinatorial in nature and typically hard (difficult).
So we resort to heuristics:
1 Randomly assign an initial cluster, 1 to K, to each observation.
2 Iterate until the cluster assignments stop changing:
For each of the K clusters, compute the cluster centroid.1
Assign each observation to the cluster whose centroid is closest (where closest is defined using the relevant distance).
This algorithm is guaranteed to decrease the value of the objective at each step. However it is not guaranteed to give the global minimum.
1The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 31
K-means clustering Progress of the K-means algorithm
Random initialization
Find centroids
Change cluster affiliation
Find centroids
−3 −2 −1 0 1 2 3
Change cluster affiliation
−3 −2 −1 0 1 2 3
Find centroids
−3 −2 −1 0 1 2 3
Change cluster affiliation
−3 −2 −1 0 1 2 3
Final clusters
rbind(X)[,2]
rbind(X)[,2]
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
rbind(X)[,2]
rbind(X)[,2]
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
rbind(X)[,2]
rbind(X)[,2]
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
rbind(X)[,2]
rbind(X)[,2]
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Summer 2020 19 / 31
K-means clustering reassigns to closest centroid. Statistics and Econometrics (CAU Kiel)
K-means clustering With different starting values
Final clusters
Final clusters
Final clusters
rbind(X)[,2]
−2 −1 0 1 2 3
rbind(X)[,2]
−2 −1 0 1 2 3
rbind(X)[,2]
−2 −1 0 1 2 3
−3 −2 −1 0 1 2 3 4
rbind(X)[,1]
−3 −2 −1 0 1 2 3 4
rbind(X)[,1]
−3 −2 −1 0 1 2 3 4
rbind(X)[,1]
The target function has many local minima, leading to sometimes similar but not identical clusters (in this case, the left outcome is more common among many re-starts). The differences increase in the number of clusters, especially when a smaller number of clusters would have sufficed.
Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 31
Density-based methods
Outline
1 Hierarchical clustering
2 K-means clustering
3 Density-based methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 31
Density-based methods
(Gaussian) Mixture models
Many data sets arise from merging or pooling heterogeneous groups – which are then painstakingly recovered via clustering.
But we have statistical models for dealing with such data!
Mixture distributions result indeed when
pooling heterogeneous populations (e.g. normals with different parameters) with
a given probability of drawing from either (sub)population.
Let’s keep it simple.2 The density of a 2-component mixture is f X ( x ) = α 1 f x ; μ 1 , σ 12 + α 2 f x ; μ 2 , σ 2 2
where f is the base density (here, φ is Gaussian) and α2 = 1 − α1.
2This is easily extended to K components and multivariate X.
Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 31
Density-based methods
Reformulate this as a problem with missing data
The parameter vector is θ = α1, α2, μ1, σ12, μ2, σ2′, and we have
n
l (θ) = ln α1φ xi; μ1, σ12 + α2φ xi; μ2, σ2 .
i=1
If we were only interested in θ, we could stop here. To capture the pooling
sampling mechanism,
Define iid Bernoulli random variables Zi , with P (Z = 1) = α1, Draw independent X ̃1 ∼ φ x; μ1, σ12 and X ̃2 ∼ φ x; μ2, σ2 and Set X = X ̃1I (Z = 1) + I (Z = 0) X ̃2.
(Use the law of total probability to check that the distribution of X is fX .) In this formulation we also have the latent variables Zi, and estimating
them delivers a clustering scheme!
Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 31
Density-based methods
The principle
Let p(x, θ) = P (Zi = 1|Xi = x). We could use the so-called membership weights pi = p(Xi, θ) for clustering:
If pi > 0.5 then put observation i in cluster 1, Otherwise put observation i in cluster 2.
In fact, given X, Zi are independent Bernoulli but with probability α1φ x; μ1, σ12
P(Zi =1|Xi =x)= α1φx;μ1,σ12+(1−α1)φx;μ2,σ2
(see the Bayes formula). May use any other density f if φ is not suitable. Now we just need θˆ to compute estimated membership weights pˆi.
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 31
Density-based methods
Iterate!
Rather than optimizing lθ, we build on the K-means idea! So:
Initialize estimated weight memberships pˆi (e.g. with 0/1 outcomes from K-means clustering), and repeat:
1 n ni=1 pˆiXi 2 ni=1 pˆi(Xi−μˆ1)2
1 Compute αˆ1 = n i=1 pˆi, μˆ1 = ni=1 pˆi and σˆ1 = ni=1 pˆi .
(Analogously for μˆ2 and σˆ2.)
2 Update now membership weights pˆi using θˆ updated in step 1.
Stop when changes in pˆi are below a small threshold.
The method still requires specifying the number of clusters, but having a
statistical model in fact allows us to use information criteria to select K!
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 31
Mixtures can be more informative
After iteration = 0
After iteration = 30
−3 −2 −1 0 1 2 3
−3 −2 −1 0 1 2 3
Statistics and Econometrics (CAU Kiel)
Summer 2020 26 / 31
Density-based methods
rbind(X)[,2]
−3 −2 −1 0 1 2 3
rbind(X)[,2]
−3 −2 −1 0 1 2 3
Density-based methods
Cluster to the mode
The above procedure essentially clusters observations around modes of the mixture distribution.
Let’s now say that the density is multimodal, but there’s no specific mixture representation.
Still, we may find the modes by numerical optimization: From some initial value,
Move uphill until we reach the mode.
Starting points that lead to the same mode should be in the same cluster! They lie in the so-called basin of attraction of the respective mode, … which means that they’re closer to this mode than to others.
So we take each data point as start and see where they end – and we don’t have to fix the number of clusters.
Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 31
Density-based methods
Mean-shift clustering
To make this idea operational,
We move along the gradient of an estimate of the unknown density, … where the unknown density is estimated locally.
In its most basic form, the resulting clustering algorithm is: For each data point i, set X ̃i = Xi and iterate as follows:
̃ Find all points Xj for which Xi − Xj ≤ h
Set X ̃i = Xj /#{Xj }, i.e. find the centroid of all data points within a radius h from the current X ̃i
Stop when changes in X ̃i are below a tolerance threshold.
Group all Xi with the same X ̃i in the same cluster.
(More complicated versions add weights to enhance density estimation; see
Advanced Statistics III.)
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 31
Density-based methods
Tuning parameters matter
The only tuning parameter is the so-called bandwidth h – and it clearly influences clustering outcomes.
X[,2]
−3 −2 −1 0 1 2 3 4
X[,2]
−3 −2 −1 0 1 2 3 4
X[,2]
−3 −2 −1 0 1 2 3 4
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4
The bandwidth may be chosen via a special form of cross-validation; see Advanced Statistics III again.
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 31
Up next
Outline
1 Hierarchical clustering
2 K-means clustering
3 Density-based methods
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 31
Up next
Coming up
return()
Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 31