CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 7 – k-Means and EM Algorithm
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec7 1 / 43

Unsupervised Learning: Motivating Examples
Some examples of situations where you would use unsupervised learning:
􏰀 You want to understand how a scientific field has changed over time. You want to take a large database of papers and model how the distribution of topics changes from year to year. But what are the topics?
􏰀 You are a biologist studying animal behaviour, so you want to infer a high-level description of their behaviour from video. You don’t know the set of behaviour ahead of time.
􏰀 You want to reduce your energy consumption, so you take a time series of your energy consumption over time, and try to break it down into separate components (when refrigerator, washing machine, etc. were operating or not).
Common theme: you have some data, and you want to infer the structure underlying the data.
This structure is latent, which means it is not observed. Intro ML (UofT) CSC311-Lec7

Motivating Examples
Determine groups of people in image above
􏰀 based on clothing styles
􏰀 gender, age, etc
Intro ML (UofT) CSC311-Lec7 3 / 43

Today’s lecture: K-means, a simple algorithm for clustering, i.e., grouping data points into clusters
Today’s lecture: Reformulate clustering as a latent variable model, apply the EM algorithm
Intro ML (UofT) CSC311-Lec7 4 / 43

Clustering
Sometimes the data form clusters, where samples within a cluster are similar to each other, and samples in different clusters are dissimilar:
Such a distribution is multimodal, since it has multiple modes, or regions of high probability mass.
Grouping data points into clusters, with no observed labels, is called clustering. It is an unsupervised learning technique.
Example: clustering machine learning papers based on topic (deep learning, Bayesian models, etc.)
􏰀 But topics are never observed (unsupervised).
Intro ML (UofT) CSC311-Lec7 5 / 43

Clustering problem
Assume that the data points {x(1), . . . , x(N)} live in an Euclidean space, i.e., x(n) ∈ RD.
Assume that each data point belongs to one of K clusters
Assume that the data points from same cluster are similar, i.e., close in
Euclidean distance.
How can we identify those clusters and the data points that belong to each cluster?
Intro ML (UofT) CSC311-Lec7 6 / 43

K-means Objective
Let’s formulate this as an optimization problem
K-means Objective:
Find cluster centres {mk}Kk=1 and assignments {r(n)}Nn=1 to minimize the sum of squared distances of data points {x(n)} to their assigned cluster centres
􏰀 Data sample n = 1, .., N : x(n) ∈ RD (observed),
􏰀 Cluster centre k = 1, .., K : mk ∈ RD (not observed),
􏰀 Responsibilities: Cluster assignment for sample n:
r(n) ∈ RK 1-of-K encoding (not observed) Mathematically:
NK􏰓􏰓 􏰔􏰕􏰋􏰋2
min J {mk},{r(n)} = min {mk },{r(n) } {mk },{r(n) }
r(n) 􏰓􏰓mk −x(n)􏰓􏰓 , k
n=1 k=1 where r(n) = I{x(n) is assigned to cluster k}, e.g.,
r(n) = [0,…,1,…,0]⊤.
Finding an optimal solution is an NP-hard problem!
Intro ML (UofT) CSC311-Lec7

K-means Objective
Optimization problem: min
{mk },{r(n) }
(n) 􏰓 (n)􏰓 rk 􏰓mk−x 􏰓
distance between x
and its assigned cluster centre
Since r(n) = I{x(n) is assigned to cluster k} (e.g., k
r(n)=[0,…,1,…,0]⊤), the inner sum is over K terms but only one of them is non-zero.
For example, if data point x(n) is assigned to cluster k = 3, then rn = [0,0,1,0,…] and
􏰋K 􏰓 􏰓2 􏰓 􏰓2
(n) 􏰓 (n)􏰓
rk 􏰓mk−x 􏰓=􏰓m3−x 􏰓.
Intro ML (UofT)
CSC311-Lec7

How to Optimize? Alternating Minimization
Optimization problem:
{mk },{r(n) }
r(n) 􏰓􏰓mk − x(n)􏰓􏰓 k
Problem is hard when minimizing jointly over the parameters {mk }, {r(n) }.
But if we fix one and minimize over the other, then it becomes easy. Doesn’t guarantee the same solution!
Intro ML (UofT) CSC311-Lec7

Alternating Minimization (Optimizing Assignments)
Optimization problem:
{mk },{r(n) }
􏰋 􏰋 r(n)||mk − x(n)||2 k
􏰀 If we fix the centres {mk}, we can easily find the optimal
assignments {r(n)} for each sample n
􏰋K 􏰓 􏰓2 r(n)􏰓􏰓mk−x(n)􏰓􏰓 .
􏰀 Assign each point to the cluster with the nearest centre
(n) 􏰘 1 ifk=argminj∥x(n)−mj∥2 rk = 0 otherwise
􏰀 E.g. if x(n) is assigned to cluster kˆ,
r(n) =[0,0,…,1,…,0]⊤
Only kˆ-th entry is 1 (UofT) CSC311-Lec7

Alternating Minimization (Optimizing Centres)
If we fix the assignments {r(n)}, then we can easily find optimal centres {mk }
􏰀 Set each cluster’s centre to the average of its assigned data points: For l = 1, 2, …, K
􏰋 􏰋 r(n)||mk − x(n)||2 k
Let’s alternate between minimizing J({mk},{r(n)}) with respect to
{mk} and {r(n)}
This is called alternating minimization.
=2 r(n)(m−x(n)) =⇒m= nl
􏰊 r ( n ) x ( n ) l l l 􏰊 r(n)
Intro ML (UofT) CSC311-Lec7

K-means Algorithm
High level overview of algorithm:
Initialization: randomly initialize cluster centres
The algorithm iteratively alternates between two steps:
􏰀 Assignment step: Assign each data point to the closest cluster
􏰀 Refitting step: Move each cluster centre to the mean of the data
assigned to it
Assignments
Refitted means
Intro ML (UofT) CSC311-Lec7 12 / 43

Figure from Bishop Simple demo: http://syskall.com/kmeans.js/ Intro ML (UofT) CSC311-Lec7 13 / 43

The K-means Algorithm
Initialization: Set K cluster means m1 , . . . , mK to random values Repeat until convergence (until assignments do not change):
􏰀 Assignment: Optimize J w.r.t. {r}: Each data point x(n) assigned to nearest centre
kˆ(n) =argmin||mk −x(n)||2 k
and Responsibilities (1-hot or 1-of-K encoding) r(n) = I{kˆ(n) = k} for k = 1,..,K
􏰀 Refitting: Optimize J w.r.t. {m}: Each centre is set to mean of data assigned to it
􏰊 r(n)x(n) m=nk.
k 􏰊 r(n) nk
Intro ML (UofT) CSC311-Lec7

K-means for Vector Quantization
Figure from Bishop
Given image, construct “dataset” of pixels represented by their RGB pixel intensities
Run k-means, replace each pixel by its cluster centre
Intro ML (UofT) CSC311-Lec7

K-means for Image Segmentation
Given image, construct “dataset” of pixels, represented by their RGB pixel intensities and grid locations
Run k-means (with some modifications) to get superpixels
Intro ML (UofT) CSC311-Lec7 16 / 43

Questions about K-means
Why does update set mk to mean of assigned points? What if we used a different distance measure?
How can we choose the best distance?
How to choose K?
Will it converge?
Hard cases – unequal spreads, non-circular spreads, in-between points
Intro ML (UofT) CSC311-Lec7

Why K-means Converges
K-means algorithm reduces the cost at each iteration.
􏰀 Whenever an assignment is changed, the sum squared distances J of data points from their assigned cluster centres is reduced.
􏰀 Whenever a cluster centre is moved, J is reduced.
Test for convergence: If the assignments do not change in the assignment
step, we have converged (to at least a local minimum).
This will always happen after a finite number of iterations, since the number of possible cluster assignments is finite
K-means cost function after each assignment step (blue) and refitting step (red). The algorithm has converged after the third refitting step.
Intro ML (UofT) CSC311-Lec7

Local Minima
The objective J is non-convex (so coordinate descent on J is not guaranteed to converge to the global minimum)
There is nothing to prevent k-means getting stuck at local minima.
We could try many random starting points
A bad local optimum
Intro ML (UofT) CSC311-Lec7 19 / 43

Soft K-means
Instead of making hard assignments of data points to clusters, we can make soft assignments. One cluster may have a responsibility of 0.7 for a datapoint and another may have a responsibility of 0.3.
􏰀 Allows a cluster to use more information about the data in the refitting step.
􏰀 How do we decide on the soft assignments?
􏰀 We already saw this in multi-class classification:
􏰀 1-of-K encoding vs softmax assignments
Intro ML (UofT) CSC311-Lec7

Soft K-means Algorithm
Initialization: Set K means {mk} to random values
Repeat until convergence (measured by how much J changes):
􏰀 Assignment: Each data point n given soft ”degree of assignment” to each cluster mean k, based on responsibilities
r(n) = exp[−β∥mk − x(n)∥2]
k 􏰊Kj=1 exp[−β∥mj − x(n)∥2]
=⇒ r(n) = softmax(−β{∥mk − x(n)∥2}Kk=1)
􏰀 Refitting: Model parameters, means, are adjusted to match sample
means of datapoints they are responsible for:
􏰊 r(n)x(n) m=nk
k 􏰊 r(n) nk
Intro ML (UofT) CSC311-Lec7 21 / 43

Questions about Soft K-means
Some remaining issues How to set β?
Clusters with unequal weight and width?
These aren’t straightforward to address with K-means. Instead, in the sequel, we’ll reformulate clustering using a generative model.
As β → ∞, soft k-Means becomes k-Means! (Exercise)
Intro ML (UofT) CSC311-Lec7 22 / 43

A Generative View of Clustering
Next: probabilistic formulation of clustering
We need a sensible measure of what it means to cluster the data well
􏰀 This makes it possible to judge different methods
􏰀 It may help us decide on the number of clusters
An obvious approach is to imagine that the data was produced by a generative model
􏰀 Then we adjust the model parameters using maximum likelihood i.e. to maximize the probability that it would produce exactly the data we observed
Intro ML (UofT) CSC311-Lec7 23 / 43

The Generative Model
We’ll be working with the following generative model for data D Assume a datapoint x is generated as follows:
􏰀 Choose a cluster z from {1,…,K} such that p(z = k) = πk
􏰀 Given z, sample x from a Gaussian distribution N(x|μz,I)
Can also be written:
p(z = k) = πk p(x|z = k) = N(x|μk,I)
Intro ML (UofT) CSC311-Lec7 24 / 43

Clusters from Generative Model
This defines joint distribution p(z, x) = p(z)p(x|z) with parameters {πk,μk}Kk=1
The marginal of x is given by p(x) = 􏰊z p(z, x) p(z = k|x) can be computed using Bayes rule
p(z = k|x) = p(x|z = k)p(z = k). p(x)
This tells us the probability that x comes from the kth cluster.
Intro ML (UofT) CSC311-Lec7

The Generative Model
500 points drawn from a mixture of 3 Gaussians.
a) Samples from p(x | z) b) Samples from the marginal p(x) c) Responsibilities p(z | x)
Intro ML (UofT) CSC311-Lec7 26 / 43

Maximum Likelihood with Latent Variables
How should we choose the parameters {πk,μk}Kk=1?
Maximum likelihood principle: choose parameters to maximize
likelihood of observed data
We don’t observe the cluster assignments z, we only see the data x Given data D = {x(n)}Nn=1, choose parameters to maximize:
log p(D) = 􏰋 log p(x(n)) n=1
We can find p(x) by marginalizing out z:
p(x) = 􏰋 p(z = k, x) = 􏰋 p(z = k)p(x|z = k) k=1 k=1
Intro ML (UofT)
CSC311-Lec7

Gaussian Mixture Model (GMM)
What is p(x)?
p(x) = 􏰋p(z = k)p(x|z = k) = 􏰋πkN(x|μk,I)
This distribution is an example of a Gaussian Mixture Model (GMM), and πk are known as the mixing coefficients
In general, we would have different covariance for each cluster, i.e., p(x|z = k) = N(x|μk,Σk). For this lecture, we assume Σk = I for simplicity.
If we allow arbitrary covariance matrices, GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.
Intro ML (UofT) CSC311-Lec7

Visualizing a Mixture of Gaussians – 1D Gaussians
If you fit one Gaussian distribution to data:
Now, we are trying to fit a GMM with K = 2:
[Slide credit: K. Kutulakos]
Intro ML (UofT) CSC311-Lec7 29 / 43

Visualizing a Mixture of Gaussians – 2D Gaussians
Intro ML (UofT) CSC311-Lec7 30 / 43

Fitting GMMs: Maximum Likelihood
Maximum likelihood objective:
NN􏰖K􏰗 logp(D) = 􏰋logp(x(n)) = 􏰋log 􏰋πkN(x(n)|μk,I)
n=1 n=1 k=1
How would you optimize this w.r.t. parameters {πk,μk}?
􏰀 No closed-form solution when we set derivatives to 0
􏰀 Difficult because sum inside the log
One option: gradient ascent. Can we do better? Can we have a closed-form update?
Intro ML (UofT) CSC311-Lec7

Maximum Likelihood
Observation: if we knew z(n) for every x(n), (i.e. our dataset was Dcomplete = {(z(n), x(n))}Nn=1) the maximum likelihood problem is easy:
log p(Dcomplete) = 􏰋 log p(z(n), x(n))
= 􏰋 log p(x(n)|z(n)) + log p(z(n)) n=1
=􏰋􏰋I{z(n) =k}􏰔logN(x(n)|μ ,I)+logπ 􏰕 kk
CSC311-Lec7

Maximum Likelihood
logp(D ) = 􏰋􏰋I{z(n) = k}􏰔logN(x(n)|μ ,I)+logπ 􏰕
We have been optimizing something similar for Naive bayes classifiers
By maximizing log p(Dcomplete), we would get this:
􏰊Nn=1 I{z(n) = k} x(n)
􏰊Nn=1 I{z(n) = k} = class means
I{z(n) = k} = class proportions
Intro ML (UofT)
CSC311-Lec7

Maximum Likelihood
We haven’t observed the cluster assignments z(n), but we can compute p(z(n)|x(n)) using Bayes rule
Conditional probability (using Bayes rule) of z given x
p(z = k|x) = =
p(z = k)p(x|z = k) p(x)
p(z = k)p(x|z = k) 􏰊Kj=1 p(z = j)p(x|z = j)
πkN(x|μk,I) 􏰊Kj=1 πj N (x|μj , I)
Intro ML (UofT)
CSC311-Lec7

Maximum Likelihood
logp(Dcomplete)=􏰋􏰋I{z(n) =k}(logN(x(n)|μk,I)+logπk)
We don’t know the cluster assignments I{z(n) = k} (they are our latent variables), but we know their expectation E[I{z(n)=k}|x(n)]=p(z(n)=k|x(n)).
If we plug in r(n) = p(z(n) = k|x(n)) for I{z(n) = k}, we get: k
􏰋􏰋r(n)(logN(x(n)|μ ,I)+logπ ) kkk
This is still easy to optimize! Solution is similar to what we have seen:
􏰊N r(n)x(n) 􏰊N r(n) μˆ = n=1 k πˆ = n=1 k
k 􏰊Nr(n) k N n=1 k
(n) πk N (x(n) |μk ,I)
Note: this only works if we treat rk = 􏰊Kj=1 πjN(x(n)|μj,I) as fixed.
Intro ML (UofT) CSC311-Lec7

How Can We Fit a Mixture of Gaussians?
This motivates the Expectation-Maximization algorithm, which alternates between two steps:
1. E-step: Compute the posterior probabilities r(n) = p(z(n) = k|x(n)) k
given our current model, i.e., how much do we think a cluster is
responsible for generating a datapoint.
2. M-step: Use the equations on the last slide to update the
parameters, assuming r(n) are held fixed – change the parameters of k
each Gaussian to maximize the probability that it would generate the data it is currently responsible for.
Intro ML (UofT) CSC311-Lec7 36 / 43

EM Algorithm for GMM
Initialize the means μˆk and mixing coefficients πˆk Iterate until convergence:
􏰀 E-step: Evaluate the responsibilities r(n) given current parameters k
(n) πˆkN(x(n)|μˆk,I)
j=1 j j=1 2 j
πˆk exp{−12∥x(n) −μˆk∥2} )= 􏰊K πˆjN(x(n)|μˆ ,I) = 􏰊K πˆj exp{−1∥x(n) −μˆ ∥2}
􏰀 M-step: Re-estimate the parameters given current responsibilities
μˆ = 1 􏰋 r ( n ) x ( n )
k Nk k n=1
πˆk =Nk withNk=􏰋r(n) Nk
􏰀 Evaluate log likelihood and check for convergence
N􏰖K􏰗 logp(D) = 􏰋log 􏰋πˆkN(x(n)|μˆk,I)
Intro ML (UofT) CSC311-Lec7 37 / 43

Intro ML (UofT) CSC311-Lec7 38 / 43

What Just Happened: A Review
The maximum likelihood objective 􏰊Nn=1 log p(x(n)) was hard to optimize
The complete data likelihood objective was easy to optimize:
􏰋logp(z(n),x(n))=􏰋􏰋I{z(n) =k}(logN(x(n)|μk,I)+logπk) n=1 n=1 k=1
We don’t know z(n)’s (they are latent), so we replaced I{z(n) = k} with responsibilities r(n) = p(z(n) = k|x(n)).
That is: we replaced I{z(n) = k} with its expectation under p(z(n)|x(n)) (E-step).
Intro ML (UofT) CSC311-Lec7 39 / 43

What Just Happened: A Review
We ended up with the expected complete data log-likelihood:
􏰋 E (n) (n) [log p(z(n), x(n))] = 􏰋 􏰋 r(n)􏰑 log N (x(n)|μ , I)+log π 􏰒
p(z|x) kkk n=1 k=1
which we maximized over parameters {πk,μk}k (M-step) The EM algorithm alternates between:
􏰀 The E-step: computing the r(n) = p(z(n) = k|x(n)) (i.e. expectations k
E[I{z(n) = k}|x(n)]) given the current model parameters πk,μk
􏰀 The M-step: update the model parameters πk,μk to optimize the
expected complete data log-likelihood
Intro ML (UofT) CSC311-Lec7 40 / 43

Relation to k-Means
The K-Means Algorithm:
1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster centre to the average of the data
assigned to it
The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current model
2. M-step: Maximize the probability that it would generate the data it is currently responsible for.
Can you find the similarities between the soft k-Means algorithm and EM algorithm with shared covariance β1 I?
Both rely on alternating optimization methods and can suffer from bad local optima.
Intro ML (UofT) CSC311-Lec7 41 / 43

Further Discussion
We assumed that the covariance of each Gaussian was I to simplify the math. This assumption can be removed, allowing clusters to have different spatial spreads. The resulting algorithm is still very simple.
Possible problems with maximum likelihood objective:
􏰀 Singularities: Arbitrarily large likelihood when a Gaussian explains a single point with variance shrinking to zero
􏰀 Non-convex
EM is more general than what was covered in this lecture. Here, EM
algorithm is used to find the optimal parameters under the GMMs.
Intro ML (UofT) CSC311-Lec7 42 / 43

A probabilistic view of clustering. Each cluster corresponds to a different Gaussian.
Model using latent variables.
General approach, can replace Gaussian with other distributions
(continuous or discrete)
More generally, mixture models are very powerful models, i.e.,
universal distribution approximators Optimization is done using the EM algorithm.
Intro ML (UofT) CSC311-Lec7 43 / 43