代写代考 COMP9417 Machine Learning & Data Mining

Unsupervised Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael

Copyright By PowCoder代写 加微信 powcoder

This lecture will develop your understanding of unsupervised learning methods. Following it, you should be able to:
• describe the problem of unsupervised learning
• describe k-means clustering
• describe Gaussian Mixture Models (GMM)
• Outline Expectation Maximization (EM) algorithm for k-means which is a specific case of EM
• describe hierarchical clustering
• describe evaluation methods for clustering
• describe the problem of dimensionality reduction
• outline the method of Principal Component Analysis
• outline semi-supervised learning
COMP9417 T1, 2022 1

Supervised vs. Unsupervised Learning
Supervised learning — classes are known and need a “definition”, in terms of the data. Methods are known as: classification, discriminant analysis, class prediction, supervised pattern recognition.
Unsupervised learning — classes are initially unknown and need to be “discovered” with their definitions from the data. Methods are known as: cluster analysis, class discovery, unsupervised pattern recognition.
So: unsupervised learning methods, such as clustering, address the problem of assigning instances to classes given only observations about the instances, i.e., without being given class “labels” for instances by a “teacher”.
COMP9417 T1, 2022 2

Unsupervised Learning
Why do we need unsupervised learning ?
§ most of the world’s data is unlabelled
§ getting a human to label data is often
• difficult (what are the classes?)
• time-consuming (labelling requires thinking)
• expensive (see above)
• error-prone (mistakes, ambiguity)
§ in principle, can use any feature as the “label”
§ unfortunately, often the class is not a known feature
COMP9417 T1, 2022 3

Unsupervised Learning
What is unsupervised learning good for ?
§ simplifying a problem, e.g., by dimensionality reduction
§ exploratory data analysis, e.g., with visualization
§ data transformation to simplify a classification problem
§ to group data instances into subsets
§ to discover structure, like hierarchies of subconcepts
§ to learn new “features” for later use in classification
§ to track “concept drift” over time
§ to learn generative models for images, text, video, speech, etc.
COMP9417 T1, 2022 4

Clustering
Finding groups of items that are similar Clustering is unsupervised
§ the class of any data instance is not known
Success of clustering often measured subjectively
§ OK for exploratory data analysis (EDA) . . .
§ but problematic if you need quantitive results . . . § some visual and statistical approaches
A dataset for clustering is just like a dataset for classification, but without the class labels
COMP9417 T1, 2022 5

Simple 2D representations of clustering
Clusters form a partition Venn diagram (overlapping clusters)
COMP9417 T1, 2022 6

Other representations of clustering
Probabilistic assignment Dendrogram
COMP9417 T1, 2022 7

Cluster Analysis
Clustering algorithms form two broad categories:
§ partitioning methods § hierarchical methods
• Partitioning methods usually require specification of the number of clusters, then try to construct the clusters and fit objects to them.
• Hierarchical algorithms are either agglomerative i.e., bottom-up or divisive i.e., top-down.
• In practice, hierarchical agglomerative methods often used – efficient exact algorithms available, but more importantly to users the dendrogram, or tree, can be visualized.
COMP9417 T1, 2022 8

Representation
Let X = {x1, . . . , xn} be a set of instances.
Let 𝐶 = (𝐶1, . . . , 𝐶𝑙) be a partition of n elements into 𝑙 subsets.
Each subset is called a cluster, and 𝐶 is called a clustering. Input data can have two forms:
§ eachelementisassociatedwithareal-valuedvectorof𝑝featurese.g. measurement levels for different features
§ pairwisesimilaritydatabetweenelements,e.g.,correlation,distance (dissimilarity)
Feature-vectors have more information, but similarity is generic (given the appropriate function). Feature-vector matrix: 𝑛 × 𝑝, similarity matrix
𝑛 × 𝑛 .Ingeneral,often𝑛 ≫ 𝑝.
COMP9417 T1, 2022 9

Clustering Framework
§ Goal of clustering: find a partition of n elements (instances) into homogeneous and well-separated clusters
§ Elements from same cluster should have high similarity, i.e, form a homogeneous cluster, while elements from different clusters should have low similarity, i.e., be well-separated
§ Note: homogeneity and separation need to be defined
§ In practice, use a distance measure appropriate to the problem
§ Also note: typically, there are interactions between homogeneity and separation – usually, high homogeneity is linked with low separation, and vice versa, unless there is clear structure in the data
COMP9417 T1, 2022 10

A bad clustering
This clustering violates both homogeneity and separation principles
COMP9417 T1, 2022 11

A good clustering
This clustering satisfies both homogeneity and separation principles
COMP9417 T1, 2022 12

𝑘-means Clustering
Set value for 𝑘, the number of clusters (by prior knowledge or via search)
Initialise: choose points for centres (means) of 𝑘 clusters (at random)
Procedure:
1) assign each instance 𝑥 to the closest of the 𝑘 points to form 𝑘 clusters
2) re-assign the 𝑘 points to be the means of each of the 𝑘 clusters
3) repeat 1 and 2 until convergence to a reasonably stable clustering
COMP9417 T1, 2022 13

𝑘-means Clustering
𝑃! is the cluster assigned to element i (or x!), 𝑐” is the centroid of cluster
j, 𝑑 (𝑣1, 𝑣2) is the Euclidean distance between feature vectors 𝑣1 and 𝑣2. The goal is to find a partition P for which the error (distance) function is
𝐸# =6𝑑(x!,𝑐’7) !$%
Centroid is the mean or weighted average of the points in the cluster.
• 𝑘-means minimizes the within-cluster sum of squares.
• 𝑘-means is an important clustering method, widely-used in many different areas, that can be viewed in terms of the EM (Expectation- Maximization) algorithm.
COMP9417 T1, 2022 14

𝑘-means Clustering
Here is another view of k-means clustering algorithm:
COMP9417 T1, 2022 15

𝑘-means Clustering
COMP9417 T1, 2022 16

𝑘-means Clustering
Previous diagram shows three steps to convergence in 𝑘-means with 𝑘 = 3
§ means move to minimize squared-error criterion
§ approximate method of obtaining maximum-likelihood estimates for
§ each point assumed to be in exactly one cluster
§ if clusters “blend”, fuzzy 𝑘-means (i.e., overlapping clusters)
COMP9417 T1, 2022 17

𝑘-means Clustering: initialisation
COMP9417 T1, 2022 18

𝑘-means Clustering: assign to centroids
COMP9417 T1, 2022 19

𝑘-means Clustering: recompute centroids
COMP9417 T1, 2022 20

𝑘-means Clustering: solution found Shown on the 3 previous slides are the initialization and the two main steps
of the 𝑘-means algorithm on the given dataset.
In this simple example 𝑘-means clustering has found a solution (the two centroids) after a single iteration, and the algorithm will not change it on further iterations.
By inspection, we can see the solution is a “good clustering”, in the sense that the two “natural” clusters in the dataset have been identified.
In general, the quality of the solution will depend on § the distribution of the points in the dataset
§ the choice of 𝑘
§ outliers
§ the choice of the location to initialise the centroids.
COMP9417 T1, 2022 21

𝑘-means Clustering: parameter What about the number of clusters 𝑘 ?
Next diagrams show convergence in 𝑘-means clustering with 𝑘 = 3 for data with two clusters not well separated.
COMP9417 T1, 2022 22

𝑘-means Clustering: parameter
COMP9417 T1, 2022 23

𝑘-means Clustering: parameter
COMP9417 T1, 2022 24

𝑘-means Clustering: outliers
Undesirable clusters
Desirable clusters
COMP9417 T1, 2022 25

𝑘-means Clustering: outliers Deal with outliers:
§ Remove some data points that are much further away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them.
§ Perform random sampling: by choosing a small subset of the data points, the chance of selecting an outlier is much smaller
– Assign the rest of the data points to the clusters by distance or similarity comparison, or classification
COMP9417 T1, 2022 26

𝑘-means Clustering: initial seeds
COMP9417 T1, 2022 27

In Practice
Algorithm can get trapped in a local minimum, toy example:
§ Place four instances at the vertices of a two-dimensional rectangle § Local minimum: two cluster centers at the midpoints of the
rectangle’s long sides
Result can vary significantly based on initial choice of seeds
Simple way to increase chance of finding a global optimum: restart with different random seeds
§ can be time-consuming
Or use the 𝑘-means++ algorithm, which initialises 𝑘 centroids to be
maximally distant from each other
COMP9417 T1, 2022 28

Despite weaknesses, 𝑘-means is still the most popular algorithm due to its simplicity and efficiency
No clear evidence that any other clustering algorithm performs better in general
Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
COMP9417 T1, 2022 29

Example: Image Segmentation
COMP9417 T1, 2022 30

Example: Image Segmentation
COMP9417 T1, 2022 31

Example: Bag of Words
Feature extraction
COMP9417 T1, 2022 32

Example: Bag of Words
Dictionary learning
COMP9417 T1, 2022 33

Example: Bag of Words
Dictionary learning
COMP9417 T1, 2022 34

Example: Bag of Words
Example visual words
COMP9417 T1, 2022 35

Example: Bag of Words
Image representation
COMP9417 T1, 2022 36

Example: Bag of Words
Application – image retrieval
COMP9417 T1, 2022 37

Expectation Maximization (EM)
When to use:
§ Data is only partially observable
§ Unsupervised learning, e.g., clustering (class value “unobservable”) § Supervised learning (some instance attributes unobservable)
Some uses:
§ Train Bayesian Belief Networks
§ Unsupervised clustering (𝑘-means, AUTOCLASS)
§ Learning Hidden Markov Models (Baum-Welch algorithm)
Here we only focus on how to use EM for 𝑘-means which is a specific case of EM for Gaussian Mixture Models.
COMP9417 T1, 2022 38

Finite Mixtures
Each instance x generated by
§ Choosing one of the 𝑘 Gaussians with uniform probability
§ Generating an instance at random according to that Gaussian
Called finite mixtures because there is only a finite number of generating distributions being represented.
COMP9417 T1, 2022 39

Generate Data from Mixture of Gaussians
COMP9417 T1, 2022 40

Example: One Variable 2-means
COMP9417 T1, 2022 41

EM for Estimating 𝑘 means
§ Instances from 𝑋 generated by mixture of 𝑘 Gaussian distributions § Unknown means of the 𝑘 Gaussians
§ Assuming identity covariance matrix
§ Don’t know which instance x𝑖 was generated by which Gaussian
Determine Maximum likelihood estimates of: § means (μ1,…,μ𝑘)
COMP9417 T1, 2022 42

EM for Estimating 𝑘-means
COMP9417 T1, 2022 43

EM for Estimating 𝑘-means
COMP9417 T1, 2022 44

EM for Estimating 𝑘-means
For 𝑘 = 2, think of the full description of each instance as 𝑦𝑖 =
§𝑧𝑖𝑗 is 1 if x𝑖 generated by 𝑗th Gaussian, otherwise zero §x𝑖 isobservable,frominstancesetx1,x2,…,x𝑚
§𝑧𝑖𝑗 is unobservable
x𝑖, 𝑧𝑖1, 𝑧𝑖2 ,
COMP9417 T1, 2022 45

EM for Estimating 𝑘-means
COMP9417 T1, 2022 46

EM for Estimating 𝑘-means
COMP9417 T1, 2022 47

EM for Estimating 𝑘-means
E step: Calculate probabilities for unknown parameters for each instance
M step: Estimate parameters based on the probabilities
In 𝑘-means the probabilities are stored as instance weights.
EM produces soft assignments (probabilities) of data points into clusters.
COMP9417 T1, 2022 48

EM Algorithm
Converges to local maximum likelihood h and provides estimates of hidden variables 𝑧𝑖𝑗
In fact, local maximum in 𝐸[ ln 𝑃 (𝑌 |h)]
§ Y is complete (observable plus unobservable variables) data
§ Expected value taken over possible values of unobserved variables in Y
COMP9417 T1, 2022 49

𝑘-means Clustering with EM
https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html.
COMP9417 T1, 2022 50

𝑘-means Clustering vs. EM
https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
COMP9417 T1, 2022 51

Extending the Mixture Model
§ Using more than two distributions
§ Several attributes: easy if independence assumed
§ Correlated attributes: difficult
• Modeled jointly using a bivariate/multivariate normal
distribution with a (symmetric) covariance matrix
• With n attributes this requires estimating n + n(n -1)/2
parameters
COMP9417 T1, 2022 52

Extending the Mixture Model
§ Nominal attributes: easy if independence assumed
§ Correlated nominal attributes: difficult
• Two correlated attributes result in more parameters
§ Missing values: easy
§ Distributions other than the normal distribution can be used:
• “log-normal” if predetermined minimum is given
• “log-odds” if bounded from above and below
• Poisson for attributes that are integer counts
§ Cross-validation can be used to estimate 𝑘 – time consuming !
COMP9417 T1, 2022 53

General EM Problem
§Observeddata𝑋 = {𝑥1,…,𝑥𝑚} §Unobserveddata(latentvariables)𝑍 = {𝑧1,…,𝑧𝑚}
§ Parameterized probability distribution 𝑃 (𝑌 |h ), where
• 𝑌 = {𝑦1,…,𝑦𝑚}isthefulldata𝑦𝑖 = 𝑥𝑖 ∪𝑧𝑖 • h are the parameters
Determine:
§ h that (locally) maximizes 𝐸 [ ln 𝑃 (𝑌 |h)]
COMP9417 T1, 2022 54

General EM Method
Define likelihood function 𝑄(h′|h) which calculates 𝑌 = 𝑋 ∪ 𝑍 using observed X and current parameters h to estimate 𝑍
𝑄(h′|h) ← 𝐸[ln𝑃(𝑌|h′)|h,𝑋]
COMP9417 T1, 2022 55

General EM Method
§ Estimation (E) step: Calculate 𝑄(h′ |h) using the current hypothesis h and the observed data 𝑋 to estimate the probability distribution over 𝑌 .
𝑄(h′|h) ← 𝐸[ln𝑃(𝑌|h′)|h,𝑋]
§ Maximization (M) step: Replace hypothesis h by the hypothesis h’ that maximises this Q function.
h ← argmax𝑄(h′|h)
COMP9417 T1, 2022 56

Example: WSI Analysis
Tumour classification in WSI
COMP9417 T1, 2022 57

Example: WSI Analysis
Discriminative patch-based CNN
Source: C. Zhang et al. Whole slide image classification via iterative patch labelling. ICIP, 2018.
COMP9417 T1, 2022 58

Example: WSI Analysis
Discriminative patch-based CNN
Source: C. Zhang et al. Whole slide image classification via iterative patch labelling. ICIP, 2018.
COMP9417 T1, 2022 59

Hierarchical Clustering
COMP9417 T1, 2022 60

Hierarchical Clustering
§ Bottom up: at each step join the two closest clusters (starting with single-instance clusters)
• Design decision: distance between clusters
E.g., two closest instances in clusters vs. distance between means
§ Top down: find two clusters and then proceed recursively for the two subsets
• Can be very fast
§ Both methods produce a dendrogram (tree of “clusters”)
COMP9417 T1, 2022 61

Hierarchical Clustering
COMP9417 T1, 2022 62

Hierarchical Clustering
The algorithm relies on a general updating formula. With different operations andcoefficients,manydifferentversionsofthealgorithmcan beusedtogive variant clusterings.
Note: dissimilarity computed for every pair of points with one point in the first cluster and the other in the second.
COMP9417 T1, 2022 63

Hierarchical Clustering
Single link
Complete link
Average link
Centroid distance 64
COMP9417 T1, 2022

Hierarchical Clustering
Represent results of hierarchical clustering with a dendrogram See next diagram
§ at level 1 all points in individual clusters
§ 𝑥6 and 𝑥7 are most similar and are merged at level 2
§ dendrogram drawn to scale to show similarity between grouped clusters
COMP9417 T1, 2022 65

Hierarchical Clustering
COMP9417 T1, 2022 66

Hierarchical Clustering
An alternative representation of hierarchical clustering based on sets shows hierarchy (by set inclusion), but not distance.
COMP9417 T1, 2022 67

Hierarchical Clustering
Limitation to beware of:
§ hierarchical clustering imposes a bias – the clustering forms a dendrogram despite the possible lack of a implicit hierarchical structuring in the data
COMP9417 T1, 2022 68

Dendrograms
Next diagram: average-linkage hierarchical clustering of microarray data. For this dataset the class of each instance is shown in each leaf of dendrogram to illustrate how clustering has grouped similar tissue samples coincides with the labelling of samples by cancer subtype.
Followed on next slide by diagram showing:
§ average-linkage based on average dissimilarity between groups § complete-linkage based on dissimilarity of furthest pair between
§ single-linkage based on dissimilarity of closest pair between groups
COMP9417 T1, 2022 69

Dendrograms
COMP9417 T1, 2022 70

Dendrograms
COMP9417 T1, 2022 71

Number of Clusters
Many methods of estimating the “correct” number of clusters have been proposed, based on some clustering criterion:
§ Elbow method:
§ measure the within-cluster dispersion (total sum of squared distances
from each point to the cluster centroid)
§ compute this for various 𝑘 choices
§ choose the k that doesn’t improve the dispersion much
COMP9417 T1, 2022 72

Number of Clusters
COMP9417 T1, 2022 73

Number of Clusters
Another technique – Gap statistics:
§ Cluster the observed data and compute the corresponding total within-cluster
variation 𝑊𝑘.
§ Generate B reference data sets with a random uniform distribution. Cluster each of these reference data sets and compute the corresponding total within- cluster variation 𝑊𝑘𝑏.
§ Compute the estimated gap statistic as the deviation of the observed 𝑊𝑘 value from its expected value 𝑊𝑘𝑏 ∶ 𝐺𝑎𝑝(𝑘) = ∑𝑏𝑙𝑜𝑔( 𝑊45) − log(𝑊4). Compute also the standard deviation of the statistics.
§ Choose the number of clusters as the smallest value of 𝑘 such that the gap statistic is within one standard deviation of the gap at 𝑘 + 1:
𝐺𝑎𝑝(𝑘) ≥ 𝐺𝑎𝑝(𝑘 + 1) − 𝑠46%.
COMP9417 T1, 2022 74

Number of Clusters
COMP9417 T1, 2022 75

Cluster quality – Silhouette plot
Key idea: compare each object’s separation from other clusters relative to the homogeneity of its cluster.
For each object 𝑖 (e.g x!), define its silhouette width 𝑠(𝑖) ∈ [−1, 1]:
Let 𝑎(𝑖) be the average dissimilarity between x! and elements of 𝑃!, i.e.,
cluster to which x! belongs,
Let 𝑑(𝑖, 𝐶 ) be the average dissimilarity of 𝑖 to elements of some other
cluster 𝐶.
Let 𝑏(𝑖) = 𝑚𝑖𝑛𝐶 𝑑(𝑖, 𝐶). The silhouette width (value) is
𝑠(𝑖) = c(d)ef(d) ghi(f(d),c(d))
COMP9417 T1, 2022 76

Cluster quality – Silhouette plot
https://en.wikipedia.org/wiki/Silhouette_(clustering)
COMP9417 T1, 2022 77

Cluster quality – Silhouette plot
How can we intepret a Silhouette plot ?
§ say for some object i we have 𝑏(𝑖) ≫ 𝑎 𝑖 then it will be well-separated from all the other clusters, and its cluster will probably be homogeneous
• in such cases 𝑠(𝑖) will tend to be close to 1 for object i, and we can take it to be “well-classified” by its cluster
§ conversely, if 𝑠(𝑖) is close to −1 we can view it as “misclassified” by its cluster
§ can see which clusters are “good” or otherwise, and estimate number of clusters
§ can take average 𝑠(𝑖) over different clusterings for comparison
COMP9417 T1, 2022 78

Cluster quality – Silhouette plot
COMP9417 T1, 2022 79

Cluster quality – Silhouette plot
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
COMP9417 T1, 2022 80

Cluster quality – Silhouette plot
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
COMP9417 T1, 2022 81

Cluster quality – Silhouette plot
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
COMP9417 T1, 2022 82

Cluster quality – Silhouette plot
For n_clusters = 6 The average silhouette_score is : 0.4504666294372765
COMP9417 T1, 2022 83

Cluster quality – external evaluation
Use another set of data with known labels for evaluation
§ Perform clustering
§ Evaluation classification performance from the clustering outputs
• Rand index
• F-measure
• Jaccard index
COMP9417 T1

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com