代写代考 IN2064 Machine Learning Lecture 12

IN2064 Machine Learning Lecture 12
Lecture 12: Clustering
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Technical University of Munich

Winter term 2020/2021

Unsupervised learning
• Given some unlabeled data {xi} we want to discover latent structure in it.
Last week: Dimensionality reduction
• Given high-dimensional data x i ∈ RD , find a continuous representation zi ∈ RK with K ≪ D, such that the structure is preserved.
This week: Clustering
• Group objects xi into K groups (clusters) based on their similarity.
Introduction to Machine Learning 2
Data Analytics and Machine Learning

Clustering: Examples
• Image segmentation
• User profiling
• Gene expression analysis • Data compression
• Visualization
Sources: https://blogs.sas.com/content/subconsciousmusings/2016/05/26/data-mining-clustering/, https://en.wikipedia.org/wiki/Transcriptomics_technologies#/media/File:Transcriptomes_heatmap_example.svg, and C. Bishop – ”Pattern Recognition and Machine Learning”
Introduction to Machine Learning 3
Data Analytics and Machine Learning

Problem definition
• CollectionofdatapointsX ={x1,…,xN}⊂X. Goal
• For each object xi find the corresponding assignment
zi ∈ {1,…,K} to one of the K groups (clusters), such that:
– objects within the same group are similar to each other;
– objects between different groups are dissimilar from each other;
Introduction to Machine Learning 4
Data Analytics and Machine Learning

K-means Algorithm

Distance-based clustering
• If we want to group similar objects together, we need some notion of similarity to begin with.
• Typically defined as similarity function s (·, ·) : X × X → R.
• Alternatively: minimize dissimilarity, usually provided by a distance
function d(·,·).
• Typical distance functions include
– Manhattandistance: d(xi,xj)=􏰀d |xid −xjd|=||xi −xj||1
– Euclideandistance: d(xi,xj)=􏰁􏰀 (xid −xjd)2 =||xi −xj||2
– Mahalanobisdistance: d(xi,xj)=􏰁(xi −xj)TΣ−1(xi −xj)
Introduction to Machine Learning 6
Data Analytics and Machine Learning

K -means Model
• Consider the standard RD space with Euclidean distance as metric.
• Each of K clusters is defined by its centroid μk ∈ RD .
• Cluster indicator zi ∈ {0,1}K with zik = 1 ⇐⇒ xi belongs to cluster i (i.e. one-hot encoding).
• The K -means objective (also called distortion measure) is defined as
NK J(X,Z,μ)=􏰉􏰉zik||xi −μk||2
• We need to find the ”best”centroids μ and cluster assignments Z
Z∗,μ∗ = argminJ(X,Z,μ) Z,μ
Introduction to Machine Learning 7
Data Analytics and Machine Learning

Minimizing the K-means objective
• The joint optimization problem is difficult to solve (discrete optim.;
even if Z is assumed to be continuous it is non-convex) NK
minJ(X,Z,μ) = min􏰉􏰉zik||xi −μk||2
• However, the two subproblems
minJ(X,Z,μ) and minJ(X,Z,μ)
Zμ have simple closed form solutions!
• Idea: alternating optimization – update Z and μ in turns!
• This method for solving K-means is called Lloyd’s algorithm.
Introduction to Machine Learning 8
Data Analytics and Machine Learning

Lloyd’s algorithm for K-means
1. Initialize the centroids μ = {μ1 , …, μK }.
2. Update cluster indicators (solve minZ J (X , Z , μ)) 􏰈1 ifk=argminj||xi−μj||2,
zik= 0 else.
3. Update centroids (solve minμ J (X , Z , μ))
μk = 􏰉zikxi where Nk = 􏰉zik.
Nk i=1 i=1
4. If objective J (X , Z , μ) has not converged, return to step 2.
Introduction to Machine Learning 9
Data Analytics and Machine Learning

K-means in action
An interactive demo is available at
• http://stanford.edu/class/ee103/visualizations/kmeans/ kmeans.html
Source: Bishop, Figure. 9.1
Introduction to Machine Learning 10
Data Analytics and Machine Learning

Initialization of the centroids
We saw in the interactive demo that the initial locations of the centroids μk greatly affect the performance.
So, how do we initialize the centroids in a good way?
K-means++ algorithm
1. Choose the first centroid μ1 uniformly at random among the data
2. For each point xi compute the distance Di2 = ||xi − μ1||2.
3. Sample the next centroid μk from {xi} with probability proportional to Di2.
4. Recompute the distances Di2 = min{||x i − μ1 ||2 , …, ||x i − μk ||2 }.
5. Continue steps 3 and 4 until K initial centroids have been chosen.
Introduction to Machine Learning 11
Data Analytics and Machine Learning

Limitations of K-means
It is important to distinguish between the model (distortion J (X , Z , μ)
in our case) and the algorithm used to fit it (here Lloyd’s algorithm). Modeling issues
• Underlying assumptions are not explicit
• Can’t detect clusters with overlapping convex hulls • Sensitivity to outliers
• No uncertainty measure
Algorithmic issues
• Extreme sensitivity to initialization
How can we address (some of) these issues? Switch to the more principled probabilistic framework!
Introduction to Machine Learning 12
Data Analytics and Machine Learning

Gaussian Mixture Model

Probabilistic unsupervised learning
• Goal: Design a probabilistic model for the data, p(x | θ), parametrized by θ which we have to estimate.
• It’s difficult to come up with a useful model p(x | θ) without making any assumptions about the data.
• When doing clustering, we implicitly assume that each point x belongs to some cluster (indicated by z).
⇒ Let’s model p(x , z | θ) instead!
• Let’s recall what tools we developed during our discussion of supervised learning, and see if any of them can help us here.
• For this, assume for now, that we know the true cluster indicators z , and are only interested in estimating θ.
Introduction to Machine Learning 14
Data Analytics and Machine Learning

Parameter estimation
How can we estimate the parameters θ of a given model p(X , Z | θ)? Supervised learning
• Both X ∈ RN×D and Z ∈ {0,1}N×K are known. • Simple solution – maximum likelihood estimation
θ∗ = argmaxp(X,Z|θ) θ
Unsupervised learning
• Only the data X ∈ RN×D is given.
• Is there any way to estimate θ in this case?
Introduction to Machine Learning 15
Data Analytics and Machine Learning

Latent variables
How can we estimate θ when Z is unknown?
Since we chose to model p(X , Z | θ), we can compute the joint
probability and obtain p(X | θ) by marginalization
p(X |θ)=􏰉p(X,Z |θ)=􏰉p(X |Z,θ)·p(Z |θ)
and optimize w.r.t θ
θ∗ = argmaxp(X | θ) θ
Since the true Z are never observed and are only our abstraction used to simplify the model, they are called latent variables.1
1Sometimes also called hidden or explanatory variables.
Introduction to Machine Learning 16
Data Analytics and Machine Learning

Gaussian mixture model (GMM)
We decompose the joint probability p(x , z | θ) using the product rule2 p(x,z |θ)=p(x |z,θ)·p(z |θ)
In Gaussian mixture model we assume:
• Cluster prior is categorical p(z | θ) = Cat(π)
• Each cluster has its own multivariate normal distribution
p(x |zk =1,θ)=N(x |μk,Σk)
The parameters consist of the set of parameters of the prior and all
conditional distributions θ = {π, μ, Σ}.
2We assume that samples are drawn i.i.d., so consider only a single (x,z).
Introduction to Machine Learning 17
Data Analytics and Machine Learning

Generative process of GMM
We can draw samples from a GMM as follows.
We have K Gaussian components (clusters), each with known {μk , Σk }, and known prior probabilities πk for each cluster k.
Generative process
1. Draw a one-hot cluster indicator z ∼ Cat(π). zk = 1 ⇐⇒ current point belongs to cluster k.
2. Draw the sample x ∼ N(μk,Σk) if zk = 1.
We only observe the sample x, and don’t know which k it came from.
Introduction to Machine Learning 18
Data Analytics and Machine Learning

GMM likelihood
Using marginalization over z , the probability of a single sample x under a Gaussian mixture model can be computed as
p(x |π,μ,Σ)=􏰉p(zk =1|π)·p(x |zk =1,μk,Σk)
= 􏰉 πk · N (x | μk , Σk ) k=1
Hence,fortheentiredatasetX ={x1,…,xN}thelog-likelihoodis
N􏰆K􏰇 logp(X|π,μ,Σ)=􏰉log 􏰉πk ·N(xi |μk,Σk)
Introduction to Machine Learning 19
Data Analytics and Machine Learning

GMM density
(a) Isocontours of each component p(x | μk , Σk ) with respective mixing coefficients πk .
(b) Isocontours of the GMM distribution
p(x | π,μ,Σ) = 􏰉πk ·N(x | μk,Σk)
k=1 (c) Surface plot of p(x | π, μ, Σ).
Source: Bishop, Figure. 2.23
Introduction to Machine Learning 20
Data Analytics and Machine Learning

Learning and inference in GMM
Remember, that we are interested in two things
• We want to determine the ”optimal”values of the parameters (e.g., find values that maximize the log-likelihood)
π∗,μ∗,Σ∗ = argmax logp(X | π,μ,Σ) π,μ,Σ
• Given the parameters {π, μ, Σ}, we want to assign the points to clusters, i.e. compute the posterior distribution of cluster indicators
p(Z | X,π,μ,Σ) (This is the actual goal of clustering.)
Introduction to Machine Learning 21
Data Analytics and Machine Learning

Inference in GMM
By straightforward application of Bayes’ rule, we obtain the posterior
p(zik =1|xi,π,μ,Σ)=p(zik =1|π)p(xi |zik =1,μk,Σk) p(xi |π,μ,Σ)
= πkN(xi |μk,Σk) 􏰀Kj=1πjN(xi |μj,Σj)
over the entries of Z. We will call this quantity the responsibility of component k for the observation i.
We also introduce shorthand notation for it
γ(zik) = p(zik = 1 | xi,π,μ,Σ)
Introduction to Machine Learning 22
Data Analytics and Machine Learning

Inference: Example
(a) Data colored according to the true underlying Z .
(b) Observed data X .
(c) Data colored according to the inferred γ(Z ) (with π, μ, Σ known).
Source: Bishop, Figure. 9.5
Introduction to Machine Learning 23
Data Analytics and Machine Learning

Learning in GMM
Maximization of the log-likelihood w.r.t. the parameters appears to be more challenging. Recall that the log-likelihood is
N􏰆K􏰇 logp(X|π,μ,Σ)=􏰉log 􏰉πk ·N(xi |μk,Σk) .
Due to the summation over k, the logarithm doesn’t act directly on the
Gaussian density.
This happens to be a serious problem, as this means that the derivatives
w.r.t. the model parameters π, μ, Σ have no analytical roots.3 Is there anything that we can do in this case?
3i.e. we cannot obtain closed form expressions for the optimal values of π,μ,Σ.
Introduction to Machine Learning 24
Data Analytics and Machine Learning

Learning in GMM
Recall from the Linear Classification lecture (Linear Discriminant Analysis) that parameter estimation is rather straightforward if Z is known.
That is, we know how to solve the ”easy”problem
π∗,μ∗,Σ∗ = argmax logp(X,Z | π,μ,Σ) (*)
Too bad we don’t know anything about Z… Or do we?
Introduction to Machine Learning 25
Data Analytics and Machine Learning

Learning in GMM
Recall from the Linear Classification lecture (Linear Discriminant Analysis) that parameter estimation is rather straightforward if Z is known.
That is, we know how to solve the ”easy”problem
π∗,μ∗,Σ∗ = argmax logp(X,Z | π,μ,Σ) (*)
Too bad we don’t know anything about Z… Or do we?
Idea: use our initial beliefs γ0(Z ) = p(Z | X , π(0), μ(0), Σ(0)) and solve the easier problem (*).
Introduction to Machine Learning 25
Data Analytics and Machine Learning

Learning in GMM: from p(X) to p(X,Z)
Instead of optimizing p(X ) directly, we introduce the latent variables Z
by taking their expectation and formulate the proxy objective
E [logp(X,Z | π,μ,Σ)]. Z∼γt(Z)
Now the log acts directly on p(X , Z ) and we can optimize the whole expression. Note, that γt (Z ) and the optimizers π∗, μ∗ and Σ∗ recursively depend on each other.
This motivates the expectation maximization (EM) algorithm that iterates the following two steps until convergence:
1. Evaluate the posterior distribution γt (Z ) with respect to the current estimate of the parameters {π(t), μ(t), Σ(t)}
2. Obtain the next iteration of the parameters by solving
π(t+1), μ(t+1), Σ(t+1) = arg max E [log p(X , Z | π, μ, Σ)] π,μ,Σ Z ∼γt (Z )
Introduction to Machine Learning 26
Data Analytics and Machine Learning

EM algorithm for GMM
1. Initialize model parameters {π(0), μ(0), …, μ(0), Σ(0), …, Σ(0)}. 1K1K
2. E step. Evaluate the responsibilities π(t)N(xi|μ(t),Σ(t))
γ(z)=k kk. t ik 􏰀K π(t)N(xi|μ(t),Σ(t))
j=1j jj 3. M step. Re-estimate the parameters
μk = N γt(zik)xi
Σ(t+1) = 􏰉γt(zik)(xi −μ(t+1))(xi −μ(t+1))T k Nk k k
4. Iterate steps 2 & 3 until EZ∼γt(Z) 􏰂logp(X,Z | π(t),μ(t),Σ(t))􏰃 converges
π(t+1)= k whereNk=􏰉γt(zik).
Introduction to Machine Learning 27
Data Analytics and Machine Learning

EM algorithm for GMM: Demo
Gif: https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif Source: Computer Vision: Models, Learning, and Inference – S.J.Prince
Introduction to Machine Learning 28
Data Analytics and Machine Learning

EM algorithm in general
The Expectation-Maximization algorithm (EM) is applicable to more models than just GMMs.
In its most general form, the two steps are
• E step – evaluate the posterior γt(Z) = p(Z | X,θ(t)).
• M step – maximize the expected joint log-likelihood log p(X , Z | θ) w.r.t. θ under the current beliefs γt (Z )
θ(t+1) = argmax E [logp(X,Z | θ)]
Note: we do not get a closed form solution neither for γ(Z), nor for the
parameters, as γ(Z) and θ recursively depend on each other.
Therefore, we have to iterate between these two steps until the objective EZ ∼γ(Z ) [log p(X , Z | θ)] converges.
Introduction to Machine Learning 29
Data Analytics and Machine Learning

EM algorithm: A justification
Let p(X | θ) be a parametric probabilistic model with latent variables Z and q(Z ) any probability distribution. Then we have
logp(X | θ) = 􏰉q(Z)logp(X | θ) Z
= 􏰉q(Z)log p(X,Z | θ)
= 􏰉q(Z)log p(X,Z | θ) · q(Z)
p(Z|X,θ) q(Z) =􏰉q(Z)logp(X,Z |θ)−􏰉q(Z)logp(Z |X,θ)
q(Z ) q(Z ) ZZ
+KL(q || p(· | X,θ)).
􏰄 p(X,Z |θ)􏰅 E log
L(q, θ) is a lower-bound on log p(X | θ) for any q.
Introduction to Machine Learning 30
Data Analytics and Machine Learning

EM algorithm: The E-step
logp(X |θ)=L(q,θ)+KL(q||p(·|X,θ))
The lefthand side is constant with respect to q and the KL divergence is non-negative. Therefore L(q, θ) is maximal when the KL divergence is 0 which happens when q(Z) = γ(Z) = p(Z | X,θ).
This constitutes the E-step and ensures that the lower bound L(q, θ) is tight, i.e.
L(q,θ) = E log Z∼q
= E [logp(X,Z | θ)]+ H[q(·)]. Z∼q 􏰌 􏰋􏰊 􏰍
p(X,Z |θ)􏰅 q(Z)
logp(X |θ)=L(q,θ).
our proxy from GMMs
independent of θ
Introduction to Machine Learning 31
Data Analytics and Machine Learning

EM algorithm: The M-step
logp(X |θ)=L(q,θ)+KL(q||p(·|X,θ))
After the E-step we have log p(X | θ) = L(q, θ). So if we now perform
the M-step and maximize L(q, θ) with respect to θ, either
• L(q, θ) does not change, we are at a local maximum of log p(X | θ)
and EM converges
• or L(q, θ) “pushes” up against log p(X | θ).
In the latter case, we receive parameters θ′ that have a higher data log-likelihood. Furthermore, the same equation that we started with holds again
logp(X |θ′)=L(q,θ′)+KL(q||p(·|X,θ′)) and we can perform another iteration.
Introduction to Machine Learning 32
Data Analytics and Machine Learning

EM algorithm: Illustration
Introduction to Machine Learning 33
Data Analytics and Machine Learning

EM algorithm: Summary
When should we use EM?
• Intractable to optimize the log-likelihood due to latent variables Z 􏰆􏰇
logp(X | θ) = log 􏰉p(X | Z,θ)p(Z | θ) Z
• Easy to optimize the joint probability
logp(X,Z | θ) = logp(Z | θ)+logp(X | Z,θ)
The main idea of EM
• Use our current beliefs p(Z | X,θ(t)) to ”pretend”that we know Z. • E step – update the responsibilities γt(Z) = p(Z | X,θ(t)).
• M step – update parameters
θ(t+1) = argmax E [logp(X,Z | θ)]
Introduction to Machine Learning 34
Data Analytics and Machine Learning

Choosing number of clusters K: Heuristic methods
• Elbow/knee heuristic:
Plot within-cluster sum of squared distances for varying K. Look for a bent (knee/elbow).
• Gap statistic:
Compare within-cluster variation to uniform data. Choose smallest K such that gap statistic is within one std. dev. of gap at K + 1.
• Silhouette:
Per point difference of mean distance within cluster to points in closest other cluster. Maximize mean of all points.
Don’t forget cross-validation to prevent overfitting!
Introduction to Machine Learning 35
Data Analytics and Machine Learning

Choosing number of clusters K: Probabilistic methods
Needs generative model that defines the data likelihood Lˆ = p(X | Zˆ , θˆ) (with optimal parameters Zˆ , θˆ).
• Bayesian information criterion (BIC):
Approximate model likelihood p(X | model). Balances number of parameters vs. likelihood, BIC = K log N − 2 log Lˆ.
• Akaike information criterion (AIC):
Estimate information lost by given model, AIC = 2K − 2 log Lˆ
Introduction to Machine Learning 36
Data Analytics and Machine Learning

Hierarchical clustering
• Agglomorative: Bottom-up, merge cluster pairs iteratively
• Divisive: Top-down, split data recursively
From https://en.wikipedia.org/wiki/File: Iris_dendrogram.png
Introduction to Machine Learning 37
Data Analytics and Machine Learning

Agglomorative clustering: How to choose pairs?
Cluster pairs are chosen by minimum cluster distance.
Linkage criterion defines cluster distance from sample distance:
• Complete-linkage clustering: Maximum sample distance
• Single-linkage clustering: Minimum sample distance
• Unweighted average linkage clustering (UPGMA): Mean sample distance
• Centroid linkage clustering (UPGMC): Centroid distance
• Weighted average linkage clustering (WPGMA): Distance defined recursively as average of distances in previous level (before merging)
Introduction to Machine Learning 38
Data Analytics and Machine Learning

• Clustering: Group objects into K groups
• K-means: Alternatingly (1) group objects by the closest centroids and (2) set the centroids to the group means
Note: K-means update equivalent to isotropic GMM with σ → 0! (exercise)
• GMM: Fit set of Gaussians to the data (via EM), group is the Gaussian with highest probability
• EM for GMM: Alternatingly update (1) responsibilities based on parameters and (2) parameters based on responsibilities.
• EM in general: Alternatingly (1) evaluate the expected posterior γt (Z ) (E-step) and (2) update parameters by maximizing L(γt , θ) (M-step).
• Agglomorative clustering: Merge clusters recursively
Introduction to Machine Learning 39
Data Analytics and Machine Learning

Reading material
Reading material
• Bishop: chapters 9.1, 9.2, 9.3.0, 9.3.1
• Murphy: chapter 11.4.1, 11.4.2, 11.4.7 (optional)
Introduction to Machine Learning 40
Data Analytics and Machine Learning

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts