CSC 311: Introduction to Machine Learning
Lecture 11 – k-Means and EM Algorithm
. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec11 1 / 57
Copyright By PowCoder代写 加微信 powcoder
In the previous lecture, we covered PCA, Autoencoders and Matrix Factorization—all unsupervised learning algorithms.
I Each algorithm can be used to approximate high dimensional data using some lower dimensional form.
Those methods made an interesting assumption that data depends on some latent variables that are never observed. Such models are called latent variable models.
I For PCA, these correspond to the code vectors (representation). Today’s lecture:
I First, introduce K-means, a simple algorithm for clustering, i.e. grouping data points into clusters
I Then, we will reformulate clustering as a latent variable model, apply the EM algorithm
Intro ML (UofT) CSC311-Lec11
Clustering
Sometimes the data form clusters, where samples within a cluster are similar to each other, and samples in different clusters are dissimilar:
Such a distribution is multimodal, since it has multiple modes, or regions of high probability mass.
Grouping data points into clusters, with no observed labels, is called clustering. It is an unsupervised learning technique.
E.g. clustering machine learning papers based on topic (deep learning, Bayesian models, etc.)
I But topics are never observed (unsupervised). Intro ML (UofT) CSC311-Lec11
K-means intuition
K-means assumes there are k clusters, and each point is close to its cluster center, or mean (the mean of points in the cluster).
I If we knew the cluster assignment, we could easily compute the centers.
I If we knew the centers, we could easily compute cluster assignment. I Chicken and egg problem!
Very simple (and useful) heuristic – start randomly and alternate between the two!
Intro ML (UofT) CSC311-Lec11 4 / 57
Initialization: randomly initialize cluster centers
The algorithm iteratively alternates between two steps:
I Assignment step: Assign each data point to the closest cluster
I Refitting step: Move each cluster center to the center of gravity of
the data assigned to it
Assignments
Refitted means
Intro ML (UofT) CSC311-Lec11 5 / 57
Figure from Bishop Simple demo: http://syskall.com/kmeans.js/ Intro ML (UofT) CSC311-Lec11 6 / 57
K-means Objective
What is actually being optimized?
K-means Objective:
Find cluster centers m and assignments r to minimize the sum of squared distances of data points {x(n)} to their assigned cluster centers
min J({m}, {r}) = min X X r(n)||mk − x(n)||2
where r(n) = 1 means that x(n) is assigned to cluster k (with center mk) k
Finding the exact optimum can be shown to be NP-hard.
K-means can be seen as block coordinate descent on this objective (analogous to ALS for matrix completion)
I Assignment step = minimize w.r.t. {r(n)} k
I Refitting step = minimize w.r.t. {mk} Intro ML (UofT) CSC311-Lec11
{m},{r} {m},{r}
s.t.Xr(n) = 1,∀n, where r(n) ∈ {0,1},∀k,n
n=1 k=1 kk
How to optimize?: Alternating Minimization
Optimization problem:
min X X r(n)||mk − x(n)||2
{mk },{r(n) }
If we fix the centers {mk} then we can easily find the optimal
assignments {r(n)} for each sample n K
min X r(n)||mk − x(n)||2
I Assign each point to the cluster with the nearest center
(n) 1 ifk=argminj∥x(n)−mj∥2 rk = 0 otherwise
I E.g. if x(n) is assigned to cluster kˆ,
r(n) =[0,0,…,1,…,0]⊤
Only k-th entry is 1
Intro ML (UofT) CSC311-Lec11 8 / 57
Alternating Minimization
Likewise, if we fix the assignments {r(n)} then can easily find optimal centers {mk}
X X r(n)||mk − x(n)||2 k
K-Means simply alternates bewteen minimizing w.r.t. assignments and centers. This is an instance of alternating minimization, or block coordinate descent.
=2 r(n)(m−x(n)) =⇒m= nl
P r ( n ) x ( n ) l l l P r(n)
Intro ML (UofT) CSC311-Lec11 9 / 57
The K-means Algorithm
Initialization: Set K cluster means m1 , . . . , mK to random values Repeat until convergence (until assignments do not change):
I Assignment: Optimize J w.r.t. {r}: Each data point x(n) assigned to nearest center
kˆ(n) =argmin||mk −x(n)||2 k
and Responsibilities (1-hot or 1-of-K encoding) r(n) = I[kˆ(n) = k] for k = 1,..,K
I Refitting: Optimize J w.r.t. {m}: Each center is set to mean of data assigned to it
P r(n)x(n) m=nk.
k P r(n) nk
Intro ML (UofT) CSC311-Lec11
Why K-means Converges
K-means algorithm reduces the cost at each iteration.
I Whenever an assignment is changed, the sum squared distances J of data points from their assigned cluster centers is reduced.
I Whenever a cluster center is moved, J is reduced.
Test for convergence: If the assignments do not change in the
assignment step, we have converged (to at least a local minimum).
This will always happen after a finite number of iterations, since the number of possible cluster assignments is finite
K-means cost function after each assignment step (blue) and refitting step (red). The algorithm has converged after the third refitting step.
Intro ML (UofT) CSC311-Lec11
Local Minima
The objective J is non-convex (so coordinate descent on J is not guaranteed to converge to the global minimum)
There is nothing to prevent k-means getting stuck at local minima.
We could try many random starting points
A bad local optimum
Intro ML (UofT) CSC311-Lec11 12 / 57
K-means for Vector Quantization
Figure from Bishop
Given image, construct “dataset” of pixels represented by their RGB pixel intensities
Run k-means, replace each pixel by its cluster center
Intro ML (UofT) CSC311-Lec11
K-means for Image Segmentation
Given image, construct “dataset” of pixels, represented by their RGB pixel intensities and grid locations
Run k-means (with some modifications) to get superpixels
Intro ML (UofT) CSC311-Lec11 14 / 57
Soft K-means
Instead of making hard assignments of data points to clusters, we can make soft assignments. One cluster may have a responsibility of .7 for a datapoint and another may have a responsibility of .3.
I Allows a cluster to use more information about the data in the refitting step.
I How do we decide on the soft assignments?
I We already saw this in multi-class classification:
I 1-of-K encoding vs softmax assignments
Intro ML (UofT) CSC311-Lec11
Soft K-means Algorithm
Initialization: Set K means {mk} to random values
Repeat until convergence (measured by how much J changes):
I Assignment: Each data point n given soft “degree of assignment” to each cluster mean k, based on responsibilities
(n) exp[−β∥mk − x(n)∥2] rk = Pj exp[−β∥mj − x(n)∥2]
=⇒ r(n) = softmax(−β{∥mk − x(n)∥2}Kk=1)
I Refitting: Model parameters, means, are adjusted to match sample
means of datapoints they are responsible for:
P r(n)x(n) m=nk
k P r(n) nk
Intro ML (UofT) CSC311-Lec11 16 / 57
Questions about Soft K-means
Some remaining issues How to set β?
Clusters with unequal weight and width?
These aren’t straightforward to address with K-means. Instead, in the sequel, we’ll reformulate clustering using a generative model.
As β → ∞, soft k-Means becomes k-Means! (Exercise)
Intro ML (UofT) CSC311-Lec11 17 / 57
Gaussian Mixture Models
Intro ML (UofT) CSC311-Lec11 18 / 57
A Generative View of Clustering
What if the data don’t look like spherical blobs? I elongated clusters
I discrete data
Remainder of this lecture: formulating clustering as a probabilistic
I specify assumptions about how the observations relate to latent variables
I use an algorithm called E-M to (approximtely) maximize the likelihood of the observations
This lets us generalize clustering to non-spherical centers or to non-Gaussian observation models (as in this week’s tutorial).
This lecture is when probabilistic modeling starts to shine!
Intro ML (UofT) CSC311-Lec11
Generative Models Recap
Recall generative (Bayes) classifiers:
p(x, t) = p(x | t) p(t)
I We fit p(t) and p(x | t) using labeled data.
If t is never observed, we call it a latent variable, or hidden variable, and generally denote it with z instead.
I The things we can observe (i.e. x) are called observables.
By marginalizing out z, we get a density over the observables:
p(x) = Xp(x,z) = Xp(x|z)p(z) zz
This is called a latent variable model.
If p(z) is a categorial distribution, this is a mixture model, and
different values of z correspond to different components. Intro ML (UofT) CSC311-Lec11
Gaussian Mixture Model (GMM)
Most common mixture model: Gaussian mixture model (GMM)
A GMM represents a distribution as
p(x) = Xπk N(x|μk,Σk)
with πk the mixing coefficients, where:
Xπk=1 and πk≥0 ∀k k=1
This defines a density over x, so we can fit the parameters using maximum likelihood. We’re try to match the data density of x as closely as possible.
I This is a hard optimization problem (and the focus of this lecture).
GMMs are universal approximators of densities (analogously to our universality result for MLPs). Even diagonal GMMs are universal approximators.
Intro ML (UofT) CSC311-Lec11 21 / 57
Gaussian Mixture Model (GMM)
We can also write the model as a generative process: For i = 1,…,N:
z(i) ∼ Categorical(π) x(i)|z(i) ∼N(μz(i),Σz(i))
Intro ML (UofT) CSC311-Lec11 22 / 57
The Generative Model
500 points drawn from a mixture of 3 Gaussians.
a) Samples from p(x | z) b) Samples from the marginal p(x) c) Responsibilities p(z | x)
Intro ML (UofT) CSC311-Lec11 23 / 57
Maximum Likelihood with Latent Variables
How should we choose the parameters {πk, μk, Σk}Kk=1? Maximum likelihood principle: choose parameters to maximize
likelihood of observed data
We don’t observe the cluster assignments z, we only see the data x Given data D = {x(n)}Nn=1, choose parameters to maximize:
log p(D) = X log p(x(n)) n=1
We can find p(x) by marginalizing out z:
p(x) = X p(z = k, x) = X p(z = k)p(x|z = k) k=1 k=1
Intro ML (UofT)
CSC311-Lec11
Visualizing a Mixture of Gaussians – 1D Gaussians
If you fit a Gaussian to data:
Now, we are trying to fit a GMM (with K = 2 in this example):
[Slide credit: K. Kutulakos]
Intro ML (UofT) CSC311-Lec11 25 / 57
Visualizing a Mixture of Gaussians – 2D Gaussians
Intro ML (UofT) CSC311-Lec11 26 / 57
Expectation-Maximization (E-M)
Intro ML (UofT) CSC311-Lec11 27 / 57
Fitting GMMs: Maximum Likelihood
Some shorthand notation: let θ = {πk,μk,Σk} denote the full set of model parameters. Let X = {x(i)} and Z = {z(i)}.
Maximum likelihood objective:
! logp(X;θ) = Xlog Xπk N(x(i);μk,Σk)
NK i=1 k=1
In general, no closed-form solution
Not identifiable: solution is invariant to permutations Challenges in optimizing this using gradient descent?
I Non-convex (due to permutation symmetry)
I Need to enforce non-negativity constraint on πk and PSD constraint
I Derivatives w.r.t. Σk are expensive/complicated.
We need a different approach!
Intro ML (UofT) CSC311-Lec11 28 / 57
Fitting GMMs: Maximum Likelihood
Warning: you don’t want the global maximum. You can achieve arbitrarily high training likelihood by placing a small-variance Gaussian component on a training example.
This is known as a singularity.
Intro ML (UofT) CSC311-Lec11 29 / 57
Latent Variable Models: Inference
If we knew the parameters θ = {πk,μk,Σk}, we could infer which component a data point x(i) probably belongs to by inferring its latent variable z(i).
This is just posterior inference, which we do using Bayes’ Rule:
(i) Pr(z = k)p(x|z = k) )= PlPr(z=l)p(x|z=l)
Just like Na ̈ıve Bayes, GDA, etc. at test time.
Intro ML (UofT) CSC311-Lec11
Latent Variable Models: Learning
If we somehow knew the latent variables for every data point, we could simply maximize the joint log-likelihood.
log p(X, Z; θ) = X log p(x(i), z(i); θ)
= X log p(z(i)) + log p(x(i) | z(i)). i=1
This is just like GDA at training time. Our formulas from Week 8, written in a suggestive notation:
πk = 1Xr(i) Nk
PN r(i)·x(i) μ = i=1 k
PN r(i) i=1 k
1 Xr(i)(x(i) −μ )(x(i) −μ )⊤
PNr(i)k k k i=1 k i=1
1[z(i) = k] CSC311-Lec11
Intro ML (UofT)
Latent Variable Models
But we don’t know the z(i), so we need to marginalize them out. Now the log-likelihood is more awkward.
logp(X;θ) = Xlogp(x(i) |θ) i=1
=Xlog X p(x(i)|z(i);{μk},{Σk})p(z(i)|π)
i=1 z(i)=1
Problem: the log is outside the sum, so things don’t simplify. We have a chicken-and-egg problem, just like with K-Means!
I Given θ, inferring the z(i) is easy.
I Given the z(i), learning θ (with maximum likelihood) is easy. I Doing both simultaneously is hard.
Can you guess the algorithm?
Intro ML (UofT) CSC311-Lec11
Intuitively, How Can We Fit a Mixture of Gaussians?
We use the Expectation-Maximization algorithm, which alternates between two steps:
1. Expectation step (E-step): Compute the posterior probability over z given our current model – i.e. how much do we think each Gaussian generates each datapoint.
2. Maximization step (M-step): Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.
Intro ML (UofT) CSC311-Lec11
Expectation Maximization for GMM Overview
1. E-step:
I Assign the responsibility r(i) of component k for data point i using
the posterior probability:
r(i) = Pr(z(i) = k | x(i); θ)
2. M-step:
I Apply the maximum likelihood updates, where each component is fit with a weighted dataset. The weights are proportional to the responsibilities.
πk = 1Xr(i) Nk
PN r(i)·x(i) μ = i=1 k
PN r(i) i=1 k
1 Xr(i)(x(i) −μ )(x(i) −μ )⊤
PNr(i)k k k i=1 k i=1
Intro ML (UofT)
CSC311-Lec11
Suppose we recorded a bunch of temperatures in March for Toronto and Miami, but forgot to record which was which, and they’re all jumbled together.
Let’s try to separate them out using a mixture of Gaussians and E-M.
Intro ML (UofT) CSC311-Lec11 35 / 57
Random initialization
Intro ML (UofT) CSC311-Lec11 36 / 57
E-step M-step
Intro ML (UofT) CSC311-Lec11
E-step M-step
Intro ML (UofT) CSC311-Lec11
E-step M-step
Intro ML (UofT) CSC311-Lec11
E-step M-step
Intro ML (UofT) CSC311-Lec11
Expectation-Maximization
EM for Multivariate Gaussians:
In tutorial, you will fit a mixture of Bernoullis model.
Intro ML (UofT) CSC311-Lec11 41 / 57
Relation to k-Means
The K-Means Algorithm:
1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster center to the average of the data
assigned to it
The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current model
2. M-step: Maximize the probability that it would generate the data it is currently responsible for.
Can you find the similarities between the soft k-Means algorithm and EM algorithm with shared covariance β1 I?
Both rely on alternating optimization methods and can suffer from bad local optima.
Intro ML (UofT) CSC311-Lec11 42 / 57
Why EM Works
(the rest of this lecture is optional)
Intro ML (UofT) CSC311-Lec11 43 / 57
Jensen’s Inequality (optional)
Recall: if a function f is convex, then !
f Xλixi ≤Xλif(xi), ii
where {λi} are such that each λi ≥ 0 and Piλi =1.
If we treat the λi as the parameters of a categorical distribution, λi = Pr(X = xi), this can be rewritten as:
f(E[X]) ≤ E[f(X)].
This is known as Jensen’s Inequality. It
holds for continuous distributions as well.
Intro ML (UofT) CSC311-Lec11 44 / 57
Jensen’s Inequality (optional)
A function f(x) is concave if −f(x) is convex. In this case, we flip Jensen’s Inequality:
f(E[X]) ≥ E[f(X)].
When would you expect the inequality to be tight?
Intro ML (UofT) CSC311-Lec11 45 / 57
Where does EM come from? (optional)
Recall: the log-likelihood function is awkward because it has a summation inside the log:
! log p(X; θ) = X log(p(x(i); θ)) = X log X p(x(i), z(i); θ)
Introduce a new distribution q(z(i)) (we’ll see what this is shortly):
logp(X;θ) = (z(i)) p(x(i),z(i);θ)! i z(i) q(z(i))
X p(x(i),z(i);θ)
log Eq(z(i) )
to push the log inwards, obtaining the variational lower bound:
Notice that log is a concave function. So we can use Jensen’s Inequality
log p(X; θ) ≥ Intro ML (UofT)
X p(x(i),z(i);θ)
Eq(z(i)) log q(z(i)) , L(q, θ) CSC311-Lec11
Where does EM come from? (optional)
Just derived a lower bound on the log-likelihood:
X p(x(i),z(i);θ)
log p(X; θ) ≥ Simplifying the right-hand-side:
Eq(z(i)) log q(z(i))
L(q, θ) = X Eq(z(i))[log p(x(i), z(i); θ)] − Eq(z(i))[log q(z(i))]
constant w.r.t. θ The expected log-probability will turn out to be nice.
Intro ML (UofT) CSC311-Lec11
Where does EM come from? (optional)
Everything so far holds for any choice of q. But what should we actually pick?
Jensen’s inequality gives a lower bound on the log-likelihood, so the best we can achieve is to make the bound tight (i.e. equality).
Denote the current parameters as θold.
It turns out the posterior probability p(z(i) | x(i); θold) is a very
good choice for q. Plugging it in to the lower bound:
X p(x(i),z(i);θold) X p(x(i),z(i);θold)
Eq(z(i)) log q(z(i)) = Eq(z(i)) log p(z(i) | x(i); θold) ii
X h (i)oldi = Eq(z(i)) logp(x ;θ )
= X log p(x(i); θold) i
= log p(X; θold)
Intro ML (UofT) CSC311-Lec11 48 / 57
Equality achieved!
Where does EM come from? (optional)
How could you pick
q(z(i)) = p(z(i) | x(i); θold) if you didn’t already know the answer?
Observe: if f is strictly concave, then Jensen’s inequality becomes an equality exactly when the random variable X is determinisic.
Hence, to solve
log Eq(z(i)) q(z(i)) = Eq(z(i)) log
we should set q(z(i)) ∝ p(x(i), z(i); θ). Intro ML (UofT) CSC311-Lec11
p(x(i), z(i); θ)# q(z(i)) ,
“p(x(i), z(i); θ)# ”
Where does EM come from? (optional)
E-step: compute the responsibilities using Bayes’ Rule: r(i) , q(z(i) = k) = Pr(z(i) = k | x(i); θold)
Rewriting the variational lower bound in terms of the responsibilities:
L(q,θ) = XXr(i) logPr(z(i) = k;π) k
+XXr(i)logp(x(i)|z(i) =k;{μ },{Σk}) kk
M-step: maximize L(q,θ) with respect to θ, giving θnew. This can be done analytically, and gives the parameter updates we saw previously.
The two steps are guaranteed to improve the log-likelihood:
log p(X; θnew) ≥ L(q, θnew) ≥ L(q, θold) = log p(X; θold).
Intro ML (UofT) CSC311-Lec11 50 / 57
EM: Recap (optional)
Recap of EM derivation:
We’re trying to maximize the log-likelihood log p(X; θ).
The exact log-likelihood is awkward, but we can use Jensen’s Inequality to lower bound it with a nicer function L(q, θ), the variatonal lower bound, which depends on a choice of q.
The E-step chooses q to make the bound tight at the current
parameters θold. Mechanistically, this means computing the
responsibilities r(i) = Pr(z(i) = k | x(i); θold). k
The M-step maximizes L(q,θ) with respect to θ, giving θnew. For GMMs, this can be done analytically.
The combination of the E-step and M-step is guaranteed to improve the true log-likelihood.
Intro ML (UofT) CSC311-Lec11 51 / 57
Visualization of the EM Algorithm (optional)
The EM algorithm involves alternately computing a lower bound on the log likelihood for the current parameter values and then maximizing this bound to obtain the new parameter values.
Intro ML (UofT) CSC311-Lec11 52 / 57
GMM E-Step: Responsibilities (optional)
Lets see how it works on GMM:
Conditional probability (using Bayes’ rule) of z given x
rk =Pr(z=k|x) = =
Pr(z = k)p(x|z = k) p(x)
p(z = k)p(x|z = k) PK
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com