Unsupervised Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will develop your understanding of unsupervised learning methods. Following it, you should be able to:
• describe the problem of unsupervised learning
• describe k-means clustering
• describe Gaussian Mixture Models (GMM)
• Outline Expectation Maximization (EM) algorithm for k-means which is a specific case of EM
• describe hierarchical clustering
• describe evaluation methods for clustering
• describe the problem of dimensionality reduction
• outline the method of Principal Component Analysis
• outline semi-supervised learning
COMP9417 T1, 2021 1
Supervised vs. Unsupervised Learning
Supervised learning — classes are known and need a “definition”, in terms of the data. Methods are known as: classification, discriminant analysis, class prediction, supervised pattern recognition.
Unsupervised learning — classes are initially unknown and need to be “discovered” with their definitions from the data. Methods are known as: cluster analysis, class discovery, unsupervised pattern recognition.
So: unsupervised learning methods, such as clustering, address the problem of assigning instances to classes given only observations about the instances, i.e., without being given class “labels” for instances by a “teacher”.
COMP9417 T1, 2021 2
Unsupervised Learning
Why do we need unsupervised learning ?
§ most of the world’s data is unlabelled
§ getting a human to label data is often
• difficult (what are the classes?)
• time-consuming (labelling requires thinking)
• expensive (see above)
• error-prone (mistakes, ambiguity)
§ in principle, can use any feature as the “label”
§ unfortunately, often the class is not a known feature
COMP9417 T1, 2021 3
Unsupervised Learning
What is unsupervised learning good for ?
§ simplifying a problem, e.g., by dimensionality reduction
§ exploratory data analysis, e.g., with visualization
§ data transformation to simplify a classification problem
§ to group data instances into subsets
§ to discover structure, like hierarchies of subconcepts
§ to learn new “features” for later use in classification
§ to track “concept drift” over time
§ to learn generative models for images, text, video, speech, etc.
COMP9417 T1, 2021 4
Clustering
Finding groups of items that are similar Clustering is unsupervised
§ the class of any data instance is not known Success of clustering often measured subjectively
§ OK for exploratory data analysis (EDA) . . .
§ but problematic if you need quantitive results . . . § some visual and statistical approaches
A dataset for clustering is just like a dataset for classification, but without the class labels
COMP9417 T1, 2021 5
Simple 2D representations of clustering
Clusters form a partition Venn diagram (overlapping clusters)
COMP9417 T1, 2021 6
Other representations of clustering
Probabilistic assignment Dendrogram
COMP9417 T1, 2021 7
Cluster Analysis
Clustering algorithms form two broad categories:
§ partitioning methods § hierarchical methods
• Partitioning methods usually require specification of the number of clusters, then try to construct the clusters and fit objects to them.
• Hierarchical algorithms are either agglomerative i.e., bottom-up or divisive i.e., top-down.
• In practice, hierarchical agglomerative methods often used – efficient exact algorithms available, but more importantly to users the dendrogram, or tree, can be visualized.
COMP9417 T1, 2021 8
Representation
Let X = {x1, . . . , xn} be a set of instances.
Let 𝐶 = (𝐶1, . . . , 𝐶𝑙) be a partition of n elements into 𝑙 subsets.
Each subset is called a cluster, and 𝐶 is called a clustering. Input data can have two forms:
§ eachelementisassociatedwithareal-valuedvectorof𝑝featurese.g. measurement levels for different features
§ pairwisesimilaritydatabetweenelements,e.g.,correlation,distance (dissimilarity)
Feature-vectors have more information, but similarity is generic (given the appropriate function). Feature-vector matrix: 𝑛 × 𝑝, similarity matrix
𝑛 × 𝑛 .Ingeneral,often𝑛 ≫ 𝑝.
COMP9417 T1, 2021 9
Clustering Framework
§ Goal of clustering: find a partition of n elements (instances) into homogeneous and well-separated clusters
§ Elements from same cluster should have high similarity, i.e, form a homogeneous cluster, while elements from different clusters should have low similarity, i.e., be well-separated
§ Note: homogeneity and separation need to be defined
§ In practice, use a distance measure appropriate to the problem
§ Also note: typically, there are interactions between homogeneity and separation – usually, high homogeneity is linked with low separation, and vice versa, unless there is clear structure in the data
COMP9417 T1, 2021 10
A bad clustering
This clustering violates both homogeneity and separation principles
COMP9417 T1, 2021 11
A good clustering
This clustering satisfies both homogeneity and separation principles
COMP9417 T1, 2021 12
𝑘-means Clustering
Set value for 𝑘, the number of clusters (by prior knowledge or via search)
Initialise: choose points for centres (means) of 𝑘 clusters (at random)
Procedure:
1) assign each instance 𝑥 to the closest of the 𝑘 points to form 𝑘 clusters
2) re-assign the 𝑘 points to be the means of each of the 𝑘 clusters
3) repeat 1 and 2 until convergence to a reasonably stable clustering
COMP9417 T1, 2021 13
𝑘-means Clustering
𝑃! is the cluster assigned to element i (or x!), 𝑐” is the centroid of cluster
j, 𝑑 (𝑣1, 𝑣2) is the Euclidean distance between feature vectors 𝑣1 and 𝑣2. The goal is to find a partition P for which the error (distance) function is
minimum:
&
𝐸# =4𝑑(x!,𝑐’!) !$%
Centroid is the mean or weighted average of the points in the cluster.
• 𝑘-means minimizes the within-cluster sum of squares.
• 𝑘-means is an important clustering method, widely-used in many different areas, that can be viewed in terms of the EM (Expectation- Maximization) algorithm.
COMP9417 T1, 2021 14
𝑘-means Clustering
Here is another view of k-means clustering algorithm:
COMP9417 T1, 2021 15
𝑘-means Clustering
COMP9417 T1, 2021 16
𝑘-means Clustering
Previous diagram shows three steps to convergence in 𝑘-means with 𝑘 = 3
§ means move to minimize squared-error criterion
§ approximate method of obtaining maximum-likelihood estimates for means
§ each point assumed to be in exactly one cluster
§ if clusters “blend”, fuzzy 𝑘-means (i.e., overlapping clusters)
COMP9417 T1, 2021 17
𝑘-means Clustering: initialisation
COMP9417 T1, 2021 18
𝑘-means Clustering: assign to centroids
COMP9417 T1, 2021 19
𝑘-means Clustering: recompute centroids
COMP9417 T1, 2021 20
𝑘-means Clustering: solution found Shown on the 3 previous slides are the initialization and the two main steps
of the 𝑘-means algorithm on the given dataset.
In this simple example 𝑘-means clustering has found a solution (the two centroids) after a single iteration, and the algorithm will not change it on further iterations.
By inspection, we can see the solution is a “good clustering”, in the sense that the two “natural” clusters in the dataset have been identified.
In general, the quality of the solution will depend on § the distribution of the points in the dataset
§ the choice of 𝑘
§ outliers
§ the choice of the location to initialise the centroids.
COMP9417 T1, 2021 21
𝑘-means Clustering: parameter What about the number of clusters 𝑘 ?
Next diagrams show convergence in 𝑘-means clustering with 𝑘 = 3 for data with two clusters not well separated.
COMP9417 T1, 2021 22
𝑘-means Clustering: parameter
COMP9417 T1, 2021 23
𝑘-means Clustering: parameter
COMP9417 T1, 2021 24
𝑘-means Clustering: outliers
Undesirable clusters
Desirable clusters
COMP9417 T1, 2021 25
𝑘-means Clustering: outliers Deal with outliers:
§ Remove some data points that are much further away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them.
§ Perform random sampling: by choosing a small subset of the data points, the chance of selecting an outlier is much smaller
– Assign the rest of the data points to the clusters by distance or similarity comparison, or classification
COMP9417 T1, 2021 26
𝑘-means Clustering: initial seeds
COMP9417 T1, 2021 27
In Practice
Algorithm can get trapped in a local minimum, toy example:
§ Place four instances at the vertices of a two-dimensional rectangle § Local minimum: two cluster centers at the midpoints of the
rectangle’s long sides
Result can vary significantly based on initial choice of seeds
Simple way to increase chance of finding a global optimum: restart with different random seeds
§ can be time-consuming
Or use the 𝑘-means++ algorithm, which initialises 𝑘 centroids to be
maximally distant from each other
COMP9417 T1, 2021 28
Remarks
Despite weaknesses, 𝑘-means is still the most popular algorithm due to its simplicity and efficiency
No clear evidence that any other clustering algorithm performs better in general
Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
COMP9417 T1, 2021 29
Example: Image Segmentation
COMP9417 T1, 2021 30
Example: Image Segmentation
COMP9417 T1, 2021 31
Example: Bag of Words
Feature extraction
COMP9417 T1, 2021 32
Example: Bag of Words
Dictionary learning
COMP9417 T1, 2021 33
Example: Bag of Words
Dictionary learning
COMP9417 T1, 2021 34
Example: Bag of Words
Example visual words
COMP9417 T1, 2021 35
Example: Bag of Words
Image representation
COMP9417 T1, 2021 36
Example: Bag of Words
Application – image retrieval
COMP9417 T1, 2021 37
Expectation Maximization (EM)
When to use:
§ Data is only partially observable
§ Unsupervised learning, e.g., clustering (class value “unobservable”) § Supervised learning (some instance attributes unobservable)
Some uses:
§ Train Bayesian Belief Networks
§ Unsupervised clustering (𝑘-means, AUTOCLASS)
§ Learning Hidden Markov Models (Baum-Welch algorithm)
Here we only focus on how to use EM for 𝑘-means which is a specific case of EM for Gaussian Mixture Models.
COMP9417 T1, 2021 38
Finite Mixtures
Each instance x generated by
§ Choosing one of the 𝑘 Gaussians with uniform probability
§ Generating an instance at random according to that Gaussian
Called finite mixtures because there is only a finite number of generating distributions being represented.
COMP9417 T1, 2021 39
Generate Data from Mixture of Gaussians
COMP9417 T1, 2021 40
Example: One Variable 2-means
COMP9417 T1, 2021 41
EM for Estimating 𝑘 means
Given:
§ Instances from 𝑋 generated by mixture of 𝑘 Gaussian distributions § Unknown means of the 𝑘 Gaussians
§ Assuming identity covariance matrix
§ Don’t know which instance x𝑖 was generated by which Gaussian
Determine Maximum likelihood estimates of: § means (μ1,…,μ𝑘)
COMP9417 T1, 2021 42
EM for Estimating 𝑘-means
COMP9417 T1, 2021 43
EM for Estimating 𝑘-means
COMP9417 T1, 2021 44
EM for Estimating 𝑘-means
For 𝑘 = 2, think of the full description of each instance as 𝑦𝑖 =
where
§𝑧𝑖𝑗 is 1 if x𝑖 generated by 𝑗th Gaussian, otherwise zero §x𝑖 isobservable,frominstancesetx1,x2,…,x𝑚
§𝑧𝑖𝑗 is unobservable
x𝑖, 𝑧𝑖1, 𝑧𝑖2 ,
COMP9417 T1, 2021 45
EM for Estimating 𝑘-means
COMP9417 T1, 2021 46
EM for Estimating 𝑘-means
COMP9417 T1, 2021 47
EM for Estimating 𝑘-means
E step: Calculate probabilities for unknown parameters for each instance
M step: Estimate parameters based on the probabilities
In 𝑘-means the probabilities are stored as instance weights.
EM produces soft assignments (probabilities) of data points into clusters.
COMP9417 T1, 2021 48
EM Algorithm
Converges to local maximum likelihood h and provides estimates of hidden variables 𝑧𝑖𝑗
In fact, local maximum in 𝐸[ ln 𝑃 (𝑌 |h)]
§ Y is complete (observable plus unobservable variables) data
§ Expected value taken over possible values of unobserved variables in Y
COMP9417 T1, 2021 49
𝑘-means Clustering with EM
https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html.
COMP9417 T1, 2021 50
𝑘-means Clustering vs. EM
https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
COMP9417 T1, 2021 51
Extending the Mixture Model
§ Using more than two distributions
§ Several attributes: easy if independence assumed
§ Correlated attributes: difficult
• Modeled jointly using a bivariate/multivariate normal
distribution with a (symmetric) covariance matrix
• With n attributes this requires estimating n + n(n -1)/2
parameters
COMP9417 T1, 2021 52
Extending the Mixture Model
§ Nominal attributes: easy if independence assumed
§ Correlated nominal attributes: difficult
• Two correlated attributes result in more parameters
§ Missing values: easy
§ Distributions other than the normal distribution can be used:
• “log-normal” if predetermined minimum is given
• “log-odds” if bounded from above and below
• Poisson for attributes that are integer counts
§ Cross-validation can be used to estimate 𝑘 – time consuming !
COMP9417 T1, 2021 53
General EM Problem
Given:
§Observeddata𝑋 = {𝑥1,…,𝑥𝑚} §Unobserveddata(latentvariables)𝑍 = {𝑧1,…,𝑧𝑚}
§ Parameterized probability distribution 𝑃 (𝑌 |h ), where
• 𝑌 = {𝑦1,…,𝑦𝑚}isthefulldata𝑦𝑖 = 𝑥𝑖 ∪𝑧𝑖 • h are the parameters
Determine:
§ h that (locally) maximizes 𝐸 [ ln 𝑃 (𝑌 |h)]
COMP9417 T1, 2021 54
General EM Method
Define likelihood function 𝑄(h′|h) which calculates 𝑌 = 𝑋 ∪ 𝑍 using observed X and current parameters h to estimate 𝑍
𝑄(h′|h) ← 𝐸[ln𝑃(𝑌|h′)|h,𝑋]
COMP9417 T1, 2021 55
General EM Method
§ Estimation (E) step: Calculate 𝑄(h′ |h) using the current hypothesis h and the observed data 𝑋 to estimate the probability distribution over 𝑌 .
𝑄(h′|h) ← 𝐸[ln𝑃(𝑌|h′)|h,𝑋]
§ Maximization (M) step: Replace hypothesis h by the hypothesis h’ that maximises this Q function.
h ← argmax𝑄(h′|h)
h’
COMP9417 T1, 2021 56
Example: WSI Analysis
Tumour classification in WSI
ht
COMP9417 T1, 2021 57
Example: WSI Analysis
Discriminative patch-based CNN
ht
Source: C. Zhang et al. Whole slide image classification via iterative patch labelling. ICIP, 2018.
COMP9417 T1, 2021 58
Example: WSI Analysis
Discriminative patch-based CNN
ht
Source: C. Zhang et al. Whole slide image classification via iterative patch labelling. ICIP, 2018.
COMP9417 T1, 2021 59
Hierarchical Clustering
COMP9417 T1, 2021 60
Hierarchical Clustering
§ Bottom up: at each step join the two closest clusters (starting with single-instance clusters)
• Design decision: distance between clusters
E.g., two closest instances in clusters vs. distance between means
§ Top down: find two clusters and then proceed recursively for the two subsets
• Can be very fast
§ Both methods produce a dendrogram (tree of “clusters”)
COMP9417 T1, 2021 61
Hierarchical Clustering
COMP9417 T1, 2021 62
Hierarchical Clustering
The algorithm relies on a general updating formula. With different operations andcoefficients,manydifferentversionsofthealgorithmcan beusedtogive variant clusterings.
Note: dissimilarity computed for every pair of points with one point in the first cluster and the other in the second.
COMP9417 T1, 2021 63
Hierarchical Clustering
Single link
Complete link
Average link
Centroid distance 64
COMP9417 T1, 2021
Hierarchical Clustering
Represent results of hierarchical clustering with a dendrogram See next diagram
§ at level 1 all points in individual clusters
§ 𝑥6 and 𝑥7 are most similar and are merged at level 2
§ dendrogram drawn to scale to show similarity between grouped clusters
COMP9417 T1, 2021 65
Hierarchical Clustering
COMP9417 T1, 2021 66
Hierarchical Clustering
An alternative representation of hierarchical clustering based on sets shows hierarchy (by set inclusion), but not distance.
COMP9417 T1, 2021 67
Hierarchical Clustering
Limitation to beware of:
§ hierarchical clustering imposes a bias – the clustering forms
a dendrogram despite the possible lack of a implicit hierarchical structuring in the data
COMP9417 T1, 2021 68
Dendrograms
Next diagram: average-linkage hierarchical clustering of microarray data. For this dataset the class of each instance is shown in each leaf of dendrogram to illustrate how clustering has grouped similar tissue samples coincides with the labelling of samples by cancer subtype.
Followed on next slide by diagram showing:
§ average-linkage based on average dissimilarity between groups § complete-linkage based on dissimilarity of furthest pair between
groups
§ single-linkage based on dissimilarity of closest pair between groups
COMP9417 T1, 2021 69
Dendrograms
COMP9417 T1, 2021 70
Dendrograms
COMP9417 T1, 2021 71
Number of Clusters
Many methods of estimating the “correct” number of clusters have been proposed, based on some clustering criterion:
§ Elbow method:
§ measure the within-cluster dispersion (total sum of squared distances
from each point to the cluster centroid)
§ compute this for various 𝑘 choices
§ choose the k that doesn’t improve the dispersion much
COMP9417 T1, 2021 72
Number of Clusters
COMP9417 T1, 2021 73
Number of Clusters
Another technique – Gap statistics:
§ Cluster the observed data and compute the corresponding total within-cluster
variation 𝑊𝑘.
§ Generate B reference data sets with a random uniform distribution. Cluster each of these reference data sets and compute the corresponding total within- cluster variation 𝑊𝑘𝑏.
§ Compute the estimated gap statistic as the deviation of the observed 𝑊𝑘 value from its expected value 𝑊𝑘𝑏 ∶ 𝐺𝑎𝑝(𝑘) = ∑𝑏𝑙𝑜𝑔( 𝑊./) − log(𝑊.). Compute also the standard deviation of the statistics.
§ Choose the number of clusters as the smallest value of 𝑘 such that the gap statistic is within one standard deviation of the gap at 𝑘 + 1:
𝐺𝑎𝑝(𝑘) ≥ 𝐺𝑎𝑝(𝑘 + 1) − 𝑠.0%.
COMP9417 T1, 2021 74
Number of Clusters
COMP9417 T1, 2021 75
Cluster quality – Silhouette plot
Key idea: compare each object’s separation from other clusters relative to the homogeneity of its cluster.
For each object 𝑖 (e.g x!), define its silhouette width 𝑠(𝑖) ∈ [−1, 1]:
Let 𝑎(𝑖) be the average dissimilarity between x! and elements of 𝑃!, i.e.,
cluster to which x! belongs,
Let 𝑑(𝑖, 𝐶 ) be the average dissimilarity of 𝑖 to elements of some other
cluster 𝐶.
Let 𝑏(𝑖) = 𝑚𝑖𝑛𝐶 𝑑(𝑖, 𝐶). The silhouette width (value) is
𝑠(𝑖) = !(#)%&(#) ‘()(&(#),!(#))
COMP9417 T1, 2021 76
Cluster quality – Silhouette plot
https://en.wikipedia.org/wiki/Silhouette_(clustering)
COMP9417 T1, 2021 77
Cluster quality – Silhouette plot
How can we intepret a Silhouette plot ?
§ say for some object i we have 𝑏(𝑖) ≫ 𝑎 𝑖 then it will be well-separated from all the other clusters, and its cluster will probably be homogeneous • in such cases 𝑠(𝑖) will tend to be close to 1 for object i, and we can
take it to be “well-classified” by its cluster
§ conversely, if 𝑠(𝑖) is close to −1 we can view it as “misclassified” by its
cluster
§ can see which clusters are “good” or otherwise, and estimate number of clusters
§ can take average 𝑠(𝑖) over different clusterings for comparison
COMP9417 T1, 2021 78
Cluster quality – Silhouette plot
COMP9417 T1, 2021 79
Cluster quality – Silhouette plot
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
COMP9417 T1, 2021 80
Cluster quality – Silhouette plot
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
COMP9417 T1, 2021 81
Cluster quality – Silhouette plot
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
COMP9417 T1, 2021 82
Cluster quality – Silhouette plot
For n_clusters = 6 The average silhouette_score is : 0.4504666294372765
COMP9417 T1, 2021 83
Cluster quality – external evaluation
Use another set of data with known labels for evaluation
§ Perform clustering
§ Evaluation classification performance from the clustering outputs
• Rand index
• F-measure
• Jaccard index
COMP9417 T1, 2021 84
Clustering summary
§ Many techniques available – may not be single “magic bullet” rather different techniques useful for different aspects of data
§ Hierarchical clustering gives a view of the complete structure found, without restricting the number of clusters, but can be computationally expensive
§ Different linkage methods can produce very different dendrograms
§ Problem may not have a “real” hierarchical structure
COMP9417 T1, 2021 85
Clustering summary
§ 𝑘-means avoids some of these problems, but also has drawbacks
§ The value for 𝑘 need to be set beforehand
§ Cannot extract “intermediate features” e.g., a subset of features in which a subset of objects is co-expressed
§ For all of these methods, can cluster objects or features, but not both together (coupled two-way clustering)
§ Should all the points be clustered ? modify algorithms to allow points to be discarded
§ Visualization is important: dendrograms and alternatives like Self-Organizing Maps (SOMs) are good but further improvements would help
COMP9417 T1, 2021 86
Clustering summary
§ How can the quality of clustering be estimated ?
• if clusters known, measure proportion of disagreements to
agreements
• if unknown, measure homogeneity (e.g., average similarity
between feature vectors in a cluster and the centroid) and separation (e.g., weighted average distances between cluster centroids) with aim of increasing homogeneity and separation
• silhouette method, etc.
§ Clustering is only the first step – mainly exploratory; classification, modelling, hypothesis formation, etc.
COMP9417 T1, 2021 87
Dimensionality Reduction
COMP9417 T1, 2021 88
Dimensionality Reduction
What is this and why would we need it ?
§ each numeric feature in a dataset is a dimension
§ in general, no restrictions on the number of dimensions § however, many features could be related
§ do we need them all in our dataset ?
• including them all is unlikely to improve models
• curse of dimensionality
• feature selection may return arbitrary features
§ so, what to do ?
§ one solution would be to find a set of new features
• should be fewer than the original set
• should preserve information in original set (as much as possible)
COMP9417 T1, 2021 89
Principal Component Analysis (PCA)
Key idea: look for features in a transformed space so that each dimension in the new space captures the most variation in the original data when it is projected onto that dimension.
Any new features should be highly correlated with (some of) the original features, but not with any of the other new features.
This suggests an approach: consider using the variance-covariance matrix we recall from correlation and regression
COMP9417 T1, 2021 90
PCA Example
PCA looks for linear combinations of the original features. This dataset of 200 points seems to show such a relationship between two feature dimensions.
COMP9417 T1, 2021 91
PCA Example
PCA finds two new features on which the original data can be projected, rotated and scaled. These explain respectively 0.75 and 0.02 of the variance.
COMP9417 T1, 2021 92
PCA Algorithm
This algorithm can be presented in several ways. Here are the basic steps in terms of the variance reduction idea:
1) takethedataasan𝑛×𝑚matrix𝑿
2) “centre” the data by subtracting the mean of each column
3) construct covariance matrix 𝑪 from centred matrix
4) compute eigenvector matrix 𝑉 (rotation) and eigenvalue matrix 𝑆
(scaling) such that 𝑉1%𝐶 𝑉 = 𝑆, and 𝑆 is a diagonal m × m matrix
5) sort columns of 𝑆 in decreasing order (decreasing variance) and their
corresponding eigenvectors
6) remove columns of 𝑆 and 𝑉 that the eigenvalues are below some minimum threshold
7) Transform the original data using the remaining eigenvectors
COMP9417 T1, 2021 93
PCA algorithm
COMP9417 T1, 2021 94
PCA Example
By rejecting the second component we reduce the dimensionality by 50% while preserving much of the original variance, seen by plotting the inverse transform of this component along with the original data.
COMP9417 T1, 2021 95
PCA example: Eigenfaces
§ Application – face recognition
§ Assume that most face images lie on a low-dimensional subspace
determined by the first 𝑘 (𝑘 << 𝑑) directions of maximum variance
§ Use PCA to determine the vectors or “eigenfaces” that span that
subspace
§ Represent all face images in the dataset as linear combinations of eigenfaces
COMP9417 T1, 2021 96
PCA example: Eigenfaces
COMP9417 T1, 2021 97
PCA example: Eigenfaces
COMP9417 T1, 2021 98
PCA example: Eigenfaces
COMP9417 T1, 2021 99
PCA example: Eigenfaces
COMP9417 T1, 2021 100
PCA example: Eigenfaces
COMP9417 T1, 2021 101
PCA and friends
§ PCA complexity is cubic in number of original features
§ this is not feasible for high-dimensional datasets
§ alternatively, approximate the sort of projection found by PCA
§ for example, can use Random Projections
§ more scalable, but what about quality of components ?
§ can be shown to preserve distance relations from the original data
§ many other methods use essentially the same matrix decomposition idea, such as finding “topics” in text using Latent Semantic, finding hidden “sub-groups” for recommender systems, and so on
COMP9417 T1, 2021 102
Autoencoders
A neural network model – the encoder transforms the data in a way the decoder can then interpret and reconstruct with minimum error.
COMP9417 T1, 2021 103
Autoencoders
COMP9417 T1, 2021 104
Dimensionality reduction summary
§ PCA will transform original features to new space
§ Every new feature is a linear combination of original features
§ Aim for new dimensions to maximise variance
§ Order by decreasing variance and remove those below a threshold
§ Algorithm applies matrix operations to translate, rotate and scale
§ Based on covariance matrix
• can be “kernelised” (KernelPCA)
• new feature space with non-linear axes
§ Many alternatives, e.g., Random Projections, Independent Component Analysis, Multi-dimensional Scaling, Word2Vec,etc.
§ Autoencoders and variants – sparse autoencoders, variational autoencoders, etc
COMP9417 T1, 2021 105
Semi-supervised Learning
§ Learn initial classifier using labelled set
§ Apply classifier to unlabelled set
§ Learn new classifier from now-labelled data
§ The most confident predictions of each classifier on the unlabeled data are retained
§ Repeat until convergence
§ Useful when there are not sufficient training data with ground truth labels
§ Can be generally applied with any supervised learning techniques
COMP9417 T1, 2021 106
Self-training algorithm
Given: labelled data (𝑥, 𝑦) and unlabelled data (𝑥) Repeat:
Train classifier h from labelled data using supervised learning Label unlabelled data using classifier h
Assumes: classifications by h will tend to be correct (especially high probability ones)
COMP9417 T1, 2021 107
Co-training
Blum & Mitchell (1998)
Key idea: two views of an instance, f1 and f2
§ assume f1 and f2 independent and compatible
§ if we have a good attribute set, leverage similarity between attribute values in each view, assuming they predict the class, to classify the unlabelled data
COMP9417 T1, 2021 108
Co-training
Multi-view learning
§ Given two (or more) perspectives on data, e.g., different attribute sets
§ Train separate models for each perspective on small set of labelled data § Use models to label a subset of the unlabelled data
§ Repeat until no more unlabelled examples
COMP9417 T1, 2021 109
Summary
§ Clustering is a typical unsupervised learning method
§ k-means clustering is one of the most well-known clustering techniques
§ EM algorithm can be used to estimate k-means
§ cannot extract “intermediate features” e.g., a subset of features in which a
subset of objects is co-expressed
§ hierarchical clustering gives a view of the complete structure found, without restricting the no. of clusters, but can be computationally expensive
§ different linkage methods can produce very different dendrograms § higher nodes can be very heterogeneous
§ problem may not have a “real” hierarchical structure
§ clustering is only the first step - mainly exploratory; classification, modelling, hypothesis formation, etc.
§ many techniques available – may not be single “magic bullet” rather different techniques useful for different aspects of data
COMP9417 T1, 2021 110
Summary
Unsupervised and supervised learning are at different ends of a continuum of “degrees of supervision”.
Between these extremes many other approaches are possible:
§ Semi-supervised learning, e.g.,
• train with small labelled sample then improve with large unlabelled
sample
• train with large unlabelled sample then learn classes with small
labelled sample § Active learning
• learning system acts to generate its own examples
Note:unsupervisedlearinganincreasinglyactiveresearcharea,particularly inneuralnets, e.g., Yann LeCun: ”How Could Machines Learn as Efficiently as Animals and Humans?” http://www.youtube.com/watch?v=uYwH4TSdVYs
COMP9417 T1, 2021 111
Acknowledgement
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~ tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012)
http://cs.bris.ac.uk/~ flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from figures for the book
“Python Data Science Handbook” by J. VanderPlas O’Reilly Media (2017) http://shop.oreilly.com/product/0636920034919.do
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
http://www.inf.ed.ac.uk/teaching/courses/iaml/2011/slides/pca.pdf http://vision.stanford.edu/teaching/cs131_fall1415/lectures/lecture17_face_recognition_cs131.pdf
http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture14.pdf http://www.mit.edu/~9.54/fall14/slides/Class13.pdf http://vision.stanford.edu/teaching/cs131_fall1718/files/14_BoW_bayes.pdf https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/lec15_16_handout.pdf
COMP9417 T1, 2021 112