Unsupervised Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 1 / 76
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from figures for the book
“Python Data Science Handbook” by J. VanderPlas O’Reilly Media (2017) http://shop.oreilly.com/product/0636920034919.do
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 2 / 76
Aims
Aims
This lecture will develop your understanding of unsupervised learning methods. Following it you should be able to:
• compare supervised with unsupervised learning
• describe the problem of dimensionality reduction
• outline the method of Principal Component Analysis
• describe k-means clustering
• outline the role of the EM algorithm in k-means clustering • describe hierarchical clustering
• discuss inductive bias in clustering
• describe methods of evaluation for unsupervised learning
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 3 / 76
Introduction
Supervised vs. Unsupervised Learning
Supervised learning — classes are known and need a “definition”, in terms of the data. Methods are known as: classification, discriminant analysis, class prediction, supervised pattern recognition.
Unsupervised learning — classes are initially unknown and need to be “discovered” with their definitions from the data. Methods are known as: cluster analysis, class discovery, unsupervised pattern recognition.
So: unsupervised learning methods, such as clustering, address the problem of assigning instances to classes given only observations about the instances, i.e., without being given class “labels” for instances by a “teacher”.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 4 / 76
Introduction
Unsupervised learning
Why do we need unsupervised learning ?
• most of the world’s data is unlabelled • getting a human to label data is often
• difficult (what are the classes ?)
• time-consuming (labelling requires thinking) • expensive (see above)
• error-prone (mistakes, ambiguity)
• in principle, can use any feature as the “label”
• unfortunately, often the class is not a known feature
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 5 / 76
Introduction
Unsupervised learning
What is unsupervised learning good for ?
• simplifying a problem, e.g., by dimensionality reduction
• exploratory data analysis, e.g., with visualization
• data transformation to simplify a classification problem
• to group data instances into subsets
• to discover structure, like hierarchies of subconcepts
• to learn new “features” for later use in classification
• to track “concept drift” over time
• to learn generative models for images, text, video, speech, etc.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 6 / 76
Dimensionality Reduction
Dimensionality Reduction
What is this and why would we need it ?
• each numeric feature in a dataset is a dimension
• in general, no restrictions on the number of dimensions • however, many features could be related
• do we need them all in our dataset ?
• including them is all unlikely to improve models • feature selection may return arbitrary features
• so,whattodo?
• one solution would be to find set of new features
• should be fewer than the original set
• should preserve information in original set (as far as possible)
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 7 / 76
Dimensionality Reduction
Principal Component Analysis (PCA)
Key idea: look for features in a transformed space so that each dimension in the new space captures the most variation in the original data when it is projected onto that dimension.
Any new features should be highly correlated with (some of) the original features, but not with any of the other new features.
This suggests an approach: consider using the variance-covariance matrix we recall from correlation and regression
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 8 / 76
Dimensionality Reduction
PCA Example
PCA looks for linear combinations of the original features. This dataset of 200 points seems to show such a relationship between two feature dimensions.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 9 / 76
Dimensionality Reduction
PCA Example
PCA finds two new features on which the original data can be projected, rotated and scaled. These explain respectively 0.75 and 0.02 of the variance.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 10 / 76
Dimensionality Reduction
PCA Algorithm
This algorithm can be presented in several ways. Here are the basic steps in terms of the variance reduction idea:
1 take the data as an n×m matrix X
2 “centre” the data by subtracting the mean of each column
3 construct covariance matrix C from centred matrix
4 compute eigenvector matrix V (rotation) and eigenvalue matrix S (scaling) such that V−1CV = S, and S is a diagonal m × m matrix
5 sort columns of S in decreasing order (decreasing variance)
6 remove columns of S below some minimum threshold
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 11 / 76
Dimensionality Reduction
PCA Example
By rejecting the second component we reduce the dimensionality by 50% while preserving much of the original variance, seen by plotting the inverse transform of this component along with the original data.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 12 / 76
Dimensionality Reduction
PCA and friends
• PCA typically computed using the Singular Value Decomposition (SVD)
• complexity is cubic in number of original features
• this is not feasible for high-dimensional datasets
• alternatively, approximate the sort of projection found by PCA
• for example, can use Random Projections
• more scalable, but what about quality of components ?
• can be shown to preserve distance relations from the original data
• many other methods use essentially the same matrix decomposition idea, such as finding “topics” in text using Latent Semantic Analysis, finding hidden “sub-groups” for recommender systems, and so on
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 13 / 76
Dimensionality Reduction
Latent Factor Representations for Recommendation
Recommender systems typically based on a large sparse ratings matrix. Every item is rated (or purchased) by a user
Ratings are typically on a scale of 1 to 5
Matrix is m users ×n items
Elements contain rating ru,i of user u for item i
Since most values ru,i are missing, apply matrix decomposition1.
1See, e.g., Koren and Bell (2011).
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 14 / 76
Dimensionality Reduction
Example: Latent Factors for Recommendation I
Consider the following matrix:
1010 0222 0 0 0 1 1 2 3 2 1 0 1 1 0223
Imagine these represent ratings by six different people (in rows), on a scale of 0 to 3, of four different films – say The Shawshank Redemption, The Usual Suspects, The Godfather, and The Big Lebowski, (in columns, from left to right). The Godfather seems to be the most popular of the four with an average rating of 1.5, and The Shawshank Redemption is the least appreciated with an average rating of 0.5.
Can you see any structure in this matrix?
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 15 / 76
Dimensionality Reduction
Example: Latent Factors for Recommendation II
1010 100
0222 010 100 1010 0 0 0 1 0 0 1 1232=110×020×0111 001 0001 1011 101
0223 011
• The right-most matrix associates films (in columns) with genres (in rows): The Shawshank Redemption and The Usual Suspects belong to two different genres, say drama and crime, The Godfather belongs to both, and The Big Lebowski is a crime film and also introduces a new genre (say comedy).
• The tall, 6-by-3 matrix then expresses people’s preferences in terms of genres.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 16 / 76
Dimensionality Reduction
Example: Latent Factors for Recommendation III
• Finally, the middle matrix states that the crime genre is twice as important as the other two genres in terms of determining people’s preferences.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 17 / 76
Dimensionality Reduction
Learning Latent Factor Representations
• latent representations in deep learning known as “embeddings” • embedding learns a “distance function” on the training data
• can be used for recommendation, graph data, and so on
• dimensionality reduction
• based on an “encoder-decoder” architecture
• unsupervised loss function is difference between output and input
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 18 / 76
Dimensionality Reduction
Dimensionality Reduction Summary
• PCA will transform original features to new space
• every new feature is a linear combination of original features
• aim for new dimensions to maximise variance
• order by decreasing variance and remove those below a threshold
• this reduces dimensionality
• algorithm applies matrix operations to translate, rotate and scale
• based on covariance matrix
• can be “kernelised” (KernelPCA)
• new feature space with non-linear axes
• many alternatives, e.g., Random Projections, Independent Component Analysis, Multi-dimensional Scaling, Word2Vec, etc.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 19 / 76
Clustering
Clustering
• is finding groups of items that are similar
• clustering is unsupervised
• the class of any data instance is not known
• success of clustering often measured subjectively • OK for exploratory data analysis (EDA) . . .
• but problematic if you need quantitive results . . . • some visual and statistical approaches
• a dataset for clustering is just like a dataset for classification, but without the class !
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 20 / 76
Clustering
Simple 2D representations of clustering
Clusters form a partition Venn diagram (overlapping clusters)
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 21 / 76
Clustering
Other representations of clustering
Probabilistic assignment Dendrogram
COMP9417 ML & DM Unsupervised Learning
Term 2, 2021 22 / 76
Clustering
Cluster analysis
Clustering algorithms form two broad categories: hierarchical methods and partitioning methods.
Hierarchical algorithms are either agglomerative i.e. bottom-up or divisive i.e. top-down.
In practice, hierarchical agglomerative methods often used – efficient exact algorithms available, but more importantly to users the dendrogram, or tree, can be visualized.
Partitioning methods usually require specification of the number of clusters, then try to construct the clusters and fit objects to them.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 23 / 76
Clustering
Representation
Let N = {e1, . . . , en} be a set of elements, i.e. instances. Let C = (C1,…,Cl) be a partition of N into subsets. Each subset is called a cluster, and C is called a clustering. Input data can have two forms:
1 each element is associated with a real-valued vector of p features e.g. measurement levels for different features
2 pairwise similarity data between elements, e.g. correlation, distance (dissimilarity)
Feature-vectors have more information, but similarity is generic (given the appropriate function). Feature-vector matrix: N × p, similarity matrix
N × N. In general, often N ≫ p.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 24 / 76
Clustering
Clustering framework
• goal of clustering: find a partition of N elements (instances) into homogeneous and well-separated clusters
• elements from same cluster should have high similarity, i.e, form a homogeneous cluster, while elements from different clusters should have low similarity, i.e., be well-separated
• note: homogeneity and separation need to be defined
• in practice, use a distance measure appropriate to the problem
• also note: typically there are interactions between homogeneity and separation – usually, high homogeneity is linked with low separation, and vice versa, unless there is clear structure in the data
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 25 / 76
Clustering
A bad clustering
!”#$%&'($)*+#,-%.#/’0)*$%1/)”%!2/3/-*,*#)4%0,5% 6*70+08/,%7+#,’*$%
630”%5#$)0,&*$%1*)9**,% 7/#,)$%#,%$*70+0)*%&'($)*+$%
:0+-*%5#$)0,&*$%1*)9**,% 7/#,)$%#,%)”*%$03*%&'($)*+%
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 26 / 76
Clustering
A good clustering
!”#$%&'($)*+#,-%$./$0*$%12)”!3242-*,*#)5%.,6% 7*8.+./2,%8+#,’*$%
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 27 / 76
k-means Clustering
k-means Clustering
Simple, but often effective, algorithm to cluster numerical data around k centroids, i.e., means.
Need to choose value for k, and initialise the centroids. Procedure:
Find a “good” clustering by iteratively assigning each data point to the nearest of the k centroids, then updating each centroid to reflect any points that have been re-assigned
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 28 / 76
k-means Clustering
k-means Clustering
P(i) is the cluster assigned to element i, c(j) is the centroid of cluster j, d(v1,v2) the Euclidean distance between feature vectors v1 and v2.
The goal is to find a partition P for which the error (distance) function EP =ni=1d(i,c(P(i))isminimum.
Centroid is the mean or weighted average of the points in the cluster.
k-means is an important clustering method, widely-used in many different areas, that can be viewed in terms of the EM (Expectation-Maximization) algorithm.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 29 / 76
k-means Clustering
k-means Clustering algorithm
Algorithm k-means
/* feature-vector matrix M(ij) is given */
1 Start with an arbitrary partition P of N into k clusters
2 for each element i and cluster j ̸= P (i) let Eij be the
cost of a solution in which i is moved to j:
1 if Ei∗j∗ = minijEij < EP then move i∗ to cluster j∗ and
repeat step 2 else halt.
PP
P
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 30 / 76
k-means Clustering
k-means Clustering
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 31 / 76
k-means Clustering
k-means Clustering
Previous diagram shows three steps to convergence in k-means with k = 3
• means move to minimize squared-error criterion
• approximate method of obtaining maximum-likelihood estimates for means
• each point assumed to be in exactly one cluster
• if clusters “blend”, fuzzy k-means (i.e., overlapping clusters)
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 32 / 76
k-means Clustering
k-Means Clustering – steps in EM algorithm
https://jakevdp.github.io/PythonDataScienceHandbook/05.11- k- means.html.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 33 / 76
k-means Clustering
k-means clustering: initialisation
X1 X2 Centroid 14- 16- 25- 34- 36- 51- 52- 61- 62- 63- 72-
Centroid locations centroid 1: (3.2, 9.8) centroid 2: (9.3, 7.1)
COMP9417 ML & DM
Unsupervised Learning
Term 2, 2021
34 / 76
k-means Clustering
k-means clustering: assign to centroids
X1 X2 Centroid 141 161 251 341 361 512 522 612 622 632 722
Centroid locations centroid 1: (3.2, 9.8) centroid 2: (9.3, 7.1)
COMP9417 ML & DM
Unsupervised Learning
Term 2, 2021
35 / 76
k-means Clustering
k-means clustering: recompute centroids
X1 X2 Centroid 141 161 251 341 361 512 522 612 622 632 722
Centroid locations centroid 1: (2.0, 5.0) centroid 2: (5.8, 1.8)
COMP9417 ML & DM
Unsupervised Learning
Term 2, 2021
36 / 76
k-means Clustering
k-means clustering: solution found
Shown on the 3 previous slides are the initialization and the two main steps of the k-means algorithm on the given dataset.
In this simple example k-means clustering has found a solution (the two centroids) after a single iteration, and the algorithm will not change it on further iterations.
By inspection, we can see the solution is a “good clustering”, in the sense that the two “natural” clusters in the dataset have been identified.
In general, the quality of the solution will depend on
• the distribution of the points in the dataset
• the choice of k
• the choice of the location to initialise the centroids.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 37 / 76
k-means Clustering
k-means Clustering
What about the number of clusters k ?
Next diagrams show convergence in k-means with k = 3 for data with two
clusters not well separated
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 38 / 76
k-means Clustering
k-means Clustering
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 39 / 76
k-means Clustering
k-means Clustering
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 40 / 76
k-means Clustering
Practical k-means
• Algorithm can get trapped in a local minimum • For example:
• place four instances at the vertices of a two-dimensional rectangle
• local minimum: two cluster centers at the midpoints of the rectangle’s
long sides
• Result can vary significantly based on initial choice of seeds
• Simple way to increase chance of finding a global optimum: restart with different random seeds
• can be time-consuming
• Or use the k-means++ algorithm, which initialises k centroids to be
maximally distant from each other
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 41 / 76
The EM Algorithm
Expectation Maximization (EM)
When to use:
• Data is only partially observable
• Unsupervised learning (e.g., clustering – class value “unobservable”) • Supervised learning (e.g., some instance values “unobservable”)
Some uses:
• Missing value estimation
• ”Latent” variables (e.g., clusters in k-means)
• Learning Hidden Markov Models (Baum-Welch algorithm) • Train Bayes Nets
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 42 / 76
The EM Algorithm
Finite mixtures
Each instance x generated by
1 Choosing one of the k Gaussians with uniform probability
2 Generating an instance at random according to that Gaussian
Called finite mixtures because there is only a finite number of generating distributions being represented.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 43 / 76
The EM Algorithm
Generating Data from Mixture of k Gaussians
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 44 / 76
x
p(x)
The EM Algorithm
EM for Estimating Means of k Gaussians
Given:
• Instances from X generated by mixture of k Gaussian distributions • Unknown means ⟨μ1, . . . , μk⟩ of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
Determine:
• Maximum likelihood estimates of ⟨μ1, . . . , μk⟩
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 45 / 76
The EM Algorithm
EM for Estimating Means of k Gaussians
Think of full description of each instance as yi = ⟨xi, zi1, zi2⟩, where • zij is 1 if xi generated by jth Gaussian, otherwise zero
• xi observable, from instance set x1, x2, . . . , xm
• zij unobservable
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 46 / 76
The EM Algorithm
EM for Estimating Means of k Gaussians
Initialise: Pick random initial h = ⟨μ1, μ2⟩ Iterate:
E-step:
Calculate expected value E[zij] of each hidden variable zij, assuming current hypothesis h = ⟨μ1, μ2⟩ holds:
E[zij]
p(x = xi|μ = μj)
= 2n=1 p(x = xi|μ = μn)
=
− 1 (x −μ )2 e2σ2 i j
2 − 1 (xi−μn)2 n=1 e 2σ2
COMP9417 ML & DM
Unsupervised Learning
Term 2, 2021
47 / 76
The EM Algorithm
EM for Estimating Means of k Gaussians
M-step:
Calculate new maximum likelihood hypothesis h′ = ⟨μ′1, μ′2⟩, assuming value taken on by each hidden variable zij is
the expected value E[zij] calculated before.
Replace h = ⟨μ1, μ2⟩ by h′ = ⟨μ′1, μ′2⟩.
i.e.
mi=1E[zij] xi μj← mi=1E[zij]
1 m
μj ← m
Unsupervised Learning
i=1
E[zij]xi
COMP9417 ML & DM
Term 2, 2021
48 / 76
The EM Algorithm
EM Algorithm
Converges to local maximum likelihood h
and provides estimates of hidden variables zij
In fact, local maximum in E[ln P (Y |h)]
• Y is complete (observable plus unobservable variables) data
• Expected value taken over possible values of unobserved variables in Y
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 49 / 76
Hierarchical Clustering
Hierarchical clustering
• Bottom up: at each step join the two closest clusters (starting with single-instance clusters)
• Design decision: distance between clusters
• E.g. two closest instances in clusters vs. distance between means
• Top down: find two clusters and then proceed recursively for the two subsets
• Can be very fast
• Both methods produce a dendrogram (tree of “clusters”)
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 50 / 76
Hierarchical Clustering
Hierarchical clustering
Algorithm Hierarchical agglomerative /* dissimilarity matrix D(ij) is given */
1 Find minimal entry dij in D and merge clusters i and j
2 Update D by deleting column i and row j, and adding new
row and column i∪j
3 Revise entries using
dk,i∪j = di∪j,k = αidki + αjdkj + γ|dki − dkj|
4 If there is more than one cluster then go to step 1.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 51 / 76
Hierarchical Clustering
Hierarchical clustering
The algorithm relies on a general updating formula. With different operations and coefficients, many different versions of the algorithm can be used to give variant clusterings.
Single linkage dk,i∪j = min(dki,dkj) and αi = αj = 12 and γ = −12. Complete linkage dk,i∪j = max(dki,dkj) and αi = αj = 12 and γ = 12.
Average linkage dk,i∪j = nidki + njdkj and αi = ni ,αj = nj ni +nj ni +nj ni +nj ni +nj
and γ = 0.
Note: dissimilarity computed for every pair of points with one point in the
first cluster and the other in the second.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 52 / 76
Hierarchical Clustering
Hierarchical clustering
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 53 / 76
Hierarchical Clustering
Hierarchical clustering
Represent results of hierarchical clustering with a dendrogram See next diagram
• at level 1 all points in individual clusters
• x6 and x7 are most similar and are merged at level 2
• dendrogram drawn to scale to show similarity between grouped clusters
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 54 / 76
Hierarchical Clustering
Hierarchical clustering
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 55 / 76
Hierarchical Clustering
Hierarchical clustering
An alternative representation of hierarchical clustering based on sets shows hierarchy (by set inclusion), but not distance.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 56 / 76
Hierarchical Clustering
Dendrograms
Two things to beware of:
1 tree structure is not unique for given clustering - for each bottom-up merge the sub-tree to the right or left must be specified - 2n−1 ways to permute the n leaves in a dendrogram
2 hierarchical clustering imposes a bias - the clustering forms a dendrogram despite the possible lack of a implicit hierarchical structuring in the data
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 57 / 76
Hierarchical Clustering
Dendrograms
Next diagram: average-linkage hierarchical clustering of microarray data. For this dataset the class of each instance is shown in each leaf of dendrogram to illustrate how clustering has grouped similar tissue samples coincides with the labelling of samples by cancer subtype.
Followed on next slide by diagram showing:
• average-linkage based on average dissimilarity between groups
• complete-linkage based on dissimilarity of furthest pair between groups
• single-linkage based on dissimilarity of closest pair between groups
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 58 / 76
Hierarchical Clustering
Dendrograms
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 59 / 76
Hierarchical Clustering
Dendrograms
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 60 / 76
Hierarchical Clustering
Dendrograms – hierarchical clustering of data by rows or columns
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 61 / 76
Hierarchical Clustering
Inductive bias of clustering
• k-means: clusters spherical around centroid
• hierarchical agglomerative: clusters form a tree (dendrogram)
These assumptions may not be justified, and can be relaxed, e.g., kernel k-means.
Many other clustering algorithms are available . . .
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 62 / 76
Hierarchical Clustering
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN)
Defines clusters as dense regions of data instances
Density is based on having more than a minimum number of points (MinPts) within a radius ε
Does not force all points in the dataset to be members of clusters
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 63 / 76
Hierarchical Clustering
DBSCAN
• a core point has more than MinPts neighbours within ε
• a border point has fewer neighbours than MinPts within ε, but is
within ε of a core point
• all other points are noise points
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 64 / 76
Hierarchical Clustering
DBSCAN
First, label all points as core, border or noise
1 make a cluster for each core point, or set of connected core points
• core points are connected if they are within ε of each other
2 assign each border point to the cluster of the corresponding core point
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 65 / 76
Hierarchical Clustering
DBSCAN example in SKLearn
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 66 / 76
Evaluating clustering
Cluster quality visualisation - Silhouette plot
Key idea: compare each object’s separation from other clusters relative to the homogeneity of its cluster.
For each object i, define its silhouette width s(i) ∈ [−1, 1]:
Let a(i) be the average dissimilarity between i and elements of P(i), i.e.,
cluster to which i belongs,
Let d(i,C) be the average dissimilarity of i to elements of some other
cluster C.
Let b(i) = minC d(i, C ). The silhouette width is
b(i) − a(i) max(a(i), b(i))
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 67 / 76
Evaluating clustering
Cluster quality visualisation - Silhouette plot
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 68 / 76
Evaluating clustering
Cluster quality visualisation - Silhouette plot
How can we intepret a Silhouette plot ?
• say for some object i we have b(i) ≫ a(i)
• then it will be well-separated from all the other clusters, and its cluster will probably be homogeneous
• in such cases s(i) will tend to be close to 1 for object i, and we can take it to be “well-classified” by its cluster
• conversely, if s(i) is close to −1 we can view it as “misclassified” by its cluster
• can see which clusters are “good” or otherwise, and estimate number of clusters
• can take average s(i) over different clusterings for comparison
Silhouette plots available in R.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 69 / 76
Evaluating clustering
How many clusters ?
Example: in k-means clustering, trying to minimize a loss function in which the goal of clustering is not met:
• running on microarray data of 6830 × 64 matrix
• total within-cluster “dispersion” (i.e., sum of squareid distances from
each point to the cluster centroid) is reduced for k = 1 to 10 • look for “elbow”, but no obvious “correct” k
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 70 / 76
Evaluating clustering
How many clusters ?
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 71 / 76
Evaluating clustering
How many clusters ?
Many methods of estimating the “correct” number of clusters have been proposed. Here we mention two using some clustering criterion (such as“dispersion”i.e.,sum of squared distances from each point to the cluster centroid).
• compare value for single cluster to sum of values for breaking cluster into two
• if there is a “significant” reduction then keep two clusters2
• formalise “elbow” detection
• Gap statistic3
• compares graph of within-cluster dispersion against k to a random reference value
2“Pattern Recognition” Duda & Hart (1973).
3“Estimating the number of clusters in a data set via the gap statistic” R. Tibshirani et al. J. R. Statist. Soc. B (2001) 63, Part 2, 411–423.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 72 / 76
Evaluating clustering
Clustering Summary
• many techniques available – may not be single “magic bullet” rather different techniques useful for different aspects of data
• hierarchical clustering gives a view of the complete structure found, without restricting the no. of clusters, but can be computationally expensive
• different linkage methods can produce very different dendrograms
• higher nodes can be very heterogeneous
• problem may not have a “real” hierarchical structure
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 73 / 76
Evaluating clustering
Clustering summary
• k-means avoids some of these problems, but also has drawbacks
• cannot extract “intermediate features” e.g., a subset of features in
which a subset of objects is co-expressed
• for all of these methods, can cluster objects or features, but not both together (coupled two-way clustering)
• should all the points be clustered ? modify algorithms to allow points to be discarded
• visualization is important: dendrograms and alternatives like Self-Organizing Maps (SOMs) are good but further improvements would help; see also T-SNE for visualization
• algorithms like DBSCAN avoid some of these issues
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 74 / 76
Evaluating clustering
Clustering summary
• how can the quality of clustering be estimated ?
• if clusters known, measure proportion of disagreements to agreements
• if unknown, measure homogeneity (average similarity between feature
vectors in a cluster and the centroid) and separation (weighted average similarity between cluster centroids) with aim of increasing homogeneity and decreasing separation
• sihouette method, etc.
• clustering is only the first step - mainly exploratory; classification,
modelling, hypothesis formation, etc.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 75 / 76
Summary
Unsupervised Learning Summary
Unsupervised and supervised learning are at different ends of a continuum of “degrees of supervision”. Between these extremes many other approaches are possible.
• semi-supervised learning, e.g.,
• train with small labelled sample then improve with large unlabelled
sample
• train with large unlabelled sample then learn classes with small labelled
sample
• reinforcement learning can be viewed as unsupervised
• “reward” is a signal from the “environment”
• learning is designed to optimize function of reward • active learning
• learning system acts to generate its own examples
Note: unsupervised learing an increasingly active research area, particularly in neural nets, such as Autoencoders. For more see, e.g., Yann LeCun: ”How Could Machines Learn as Efficiently as Animals and Humans?”
http://www.youtube.com/watch?v=uYwH4TSdVYs
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 76 / 76
References
Koren, Y. and Bell, R. (2011). Advances in Collaborative Filtering. In Ricci, F., Rokach, L., Shapira, B., and Kantor, P., editors, Recommender Systems Handbook. Springer.
COMP9417 ML & DM Unsupervised Learning Term 2, 2021 76 / 76