CSCI 567: Machine Learning
Copyright By PowCoder代写 加微信 powcoder
Lecture 10, Nov 10
Administrivia
• Start on your project if you haven’t already!
• Make groups (of 4) by tomorrow (Nov 11), minimum team size is 3.
• Top teams as of Nov 16 can get a bonus!
• HW4 is due in about one weeks (Nov 16 at 2pm).
• We’ll release another question on Gaussian mixture models tomorrow.
• Today’s plan:
• Clustering
• Gaussian Mixture Models and Expectation Maximization (EM)
• In the discussion, we will go over popular evaluation metrics for
supervised learning
A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
Aim to discover
hidden patterns and
explore data
Reinforcement
Aim to make
sequential decisions
Clustering
Clustering
• Introduction
• Formalizing and solving the objective (alternating minimization)
• !-means algorithm
Given: a set of data points (feature vectors), without labels
Output: group the data into some clusters, which means
• assign each point to a specific cluster
• find the center (representative/prototype/…) of each cluster
Clustering: Informal definition
Given: data points x1, . . . ,xn ! R
d and #clusters k we want
Output: group the data into k clusters, which means,
• find assignment !ij ! {0, 1} for each data point i ! [n] and j ! [k] s.t.
j![k] !ij = 1 for
any fixed i
• find the cluster centers µ1, . . . ,µk ! R
Clustering: More formal definition
Clustering is one of the most fundamental ML tasks, with many applications:
• recognize communities in a social network
• group similar customers in market research
• image segmentation
• accelerate other algorithms (e.g. nearest neighbor classification)
Many applications
Clustering
• Introduction
• Formalizing and solving the objective (alternating minimization)
• !-means algorithm
As with PCA, no ground-truth to even measure the quality of the answer (no labels given).
What is the high-level goal here?
We want to partition the points into k clusters, such that points within each cluster are close to their
cluster center.
We can turn this into an optimization problem, find !ij and µj to minimize
{!ij}, {µj}
!ij!xi ” µj!
i.e. the sum of squared distances of each point to its center. This is the “k-means” objective.
Formal objective
Unfortunately, finding the exact minimizer of the k-means objective is NP-hard!
Therefore, we use a heuristic (alternating minimization) that alternatingly minimizes over {!ij} and {µj}:
Initialize {µ
j : j ! [k]}
For t = 1, 2, . . .
ij } = argmin
j } = argmin
ij }, {µj}
How to solve this? Alternating minimization
The first step
{!ij}, {µj}
!ij!xi ” µj!
!ij!xi ” µj!
is simply to assign each xi to the closest µj , i.e.
j == argmin
for all j # [k] and i # [n].
Alternating minimization: Closer look
The second step
{!ij}, {µj}
!ij!xi ” µj!
is simply to average the points of each cluster (hence the name)
|{i : !ij = 1}|
for each j # [k].
Alternating minimization: Closer look
Clustering
• Introduction
• Formalizing and solving the objective (alternating minimization)
• !-means algorithm
Step 0 Initialize µ
, . . . ,µk
Step 1 For the centers µ
, . . . ,µk being fixed, assign each point to the closest center:
j == argmin
Step 2 For the assignments {!ij} being fixed, update the centers
Step 3 Return to Step 1 if not converged (convergence means that all the assigments !ij are un-
changed in Step 1).
!-means algorithm
!-means algorithm: Example
k-means will converge in a finite number of iterations, why?
• objective strictly decreases at each step if the algorithm has not converged.
!-means algorithm: Convergence
Why? For t = 1, 2, . . .
ij } = argmin
j == argmin
j } = argmin
ij }, {µj}
k-means will converge in a finite number of iterations, why?
• objective strictly decreases at each step if the algorithm has not converged.
• #possible assignments is finite (kn, exponentially large though)
Therefore, the algorithm must converge in at most kn steps.
Why? More specifically, why can’t the algorithm cycle between different clusterings?
• Suppose the algorithm finds the same clustering at time steps t1 and t2.
• Since the objective function value decreases at every step, this means the same clustering (at
time steps t1 and t2) has two different costs, which is not possible.
• Therefore, by contradiction, the algorithm cannot cycle between clusterings.
!-means algorithm: Convergence
k-means will converge in a finite number of iterations, why?
• objective strictly decreases at each step if the algorithm has not converged.
• #possible assignments is finite (kn, exponentially large though)
• it could take exponentially many iterations to converge
• and it might not converge to the global minimum of the k-means objective
!-means algorithm: Convergence
There are different ways to initialize:
• randomly pick k points as initial centers µ
, . . . ,µ
• or randomly assign each point to a cluster, then average to find centers
• or more sophisticated approaches (e.g. k-means++)
Initialization matters for convergence.
!-means algorithm: How to initialize?
!-means algorithm: Local vs Global minima
K-means converges immediately in both cases, but
• left has K-means objective L2 = 4W 2
• right has K-means objective W 2, 4 times better than left!
• in fact, left is local minimum, and right is global minimum.
Simple example: 4 data points, 2 clusters, 2 different initializations
!-means algorithm: Summary
• Clustering is a fundamental unsupervised learning task.
• k-means is a alternating minimization algorithm for the k-means objective.
• The algorithm always converges, but it can converge to a local minimum.
• Initialization matters a lot for the convergence. There are principled initialization schemes,
which have guarantees on the solution they find (e.g. k-means++).
Gaussian mixture
Gaussian Mixture Model
• Introduction
• Learning the parameters
• EM algorithm
• EM for the Gaussian Mixture Model
Gaussian mixture models (GMM) is a probabilistic approach for clustering
• more explanatory than minimizing the k-means objective
• can be seen as a soft version of k-means
To solve GMM, we will introduce a powerful method for learning probabilistic models:
the Expectation Maximization (EM) algorithm.
Gaussian mixture models
For classification, we discussed the sigmoid model to “explain”
how the labels are generated.
Similarly, for clustering, we want to come up with a probabilis-
tic model p to “explain” how the data is generated.
That is, each point is an independent sample of x ! p.
Why do generative modelling?
• can generate data from p
• can estimate probability of seeing any datapoint (useful
for many tasks, such as for finding outliers/anomalies in
A generative model
What probabilistic model
generates data like this?
GMM is a natural model to explain such data.
Assume there are 3 ground-truth Gaussian models.
To generate a point, we
• first randomly pick one of the Gaussian models,
• then draw a point according this Gaussian.
Hence the name “Gaussian mixture model”.
GMM: Intuition
Figure from Wikipedia
A GMM has the following density function:
!jN(x | µj ,!j)
• k: the number of Gaussian components (same as #clusters we want)
• !1, . . . ,!k: mixture weights, a distribution over k components
• µj and !j : mean and covariance matrix of the k-th Gaussian
• N : the density function for a Gaussian
GMM: Formal definition
By introducing a latent variable z ! [k], which indicates cluster membership, we can see p as a
marginal distribution
p(x, z = j) =
p(z = j)p(x|z = j) =
!jN(x | µj ,!j)
x and z are both random variables drawn from the model
• x is observed
• z is unobserved/latent
Another view
An example
The conditional distributions are
p(x | z = red) = N(x | µ
p(x | z = blue) = N(x | µ
p(x | z = green) = N(x | µ
The marginal distribution is
p(x) = p(red)N(x | µ
,!1) + p(blue)N(x | µ2,!2)
+ p(green)N(x | µ
Learning a GMM means finding all the parameters ! = {!j ,µj ,!j}
In the process, we will learn the distribution of the latent variable zi as well:
p(zi = j | xi) := “ij ! [0, 1]
i.e. “soft assignment” of each point to each cluster, as opposed to “hard assignment” by k-means.
GMM is more explanatory than k-means
• both learn the cluster centers µj’s
• in addition, GMM learns cluster weight !j and covariance !j , thus
• we can predict probability of seeing a new point
• we can generate synthetic data
Learning GMMs
Gaussian Mixture Model
• Introduction
• Learning the parameters
• EM algorithm
• EM for the Gaussian Mixture Model
As always, we want to do maximum-likelihood estimation (MLE): find
p(xi ;!) = argmax
ln p(xi ;!) := argmax
This is called incomplete log-likelihood (since zi’s are unobserved). We can still write it down as an
optimization problem by marginalizing out the zi’s.
ln p(xi ;!) =
p(xi, zi = j ;!)
p(zi = j ;!)p(xi|zi = j ;!)
!jN(xi | µj ,!j)
This is a non-concave problem, and does not have a closed-form solution.
One solution is to still apply GD/SGD, but a much more effective approach is the Expectation
Maximization (EM) algorithm.
How do we learn the parameters?
Step 0 Initialize !j ,µj ,!j for each j ! [k]
Step 1 (E-Step) update the “soft assignment” (fixing parameters)
“ij = p(zi = j | xi) ” !jN
xi | µj ,!j
Step 2 (M-Step) update the model parameter (fixing assignments)
“ij(xi # µj)(xi # µj)
Step 3 return to Step 1 if not converged
We will see how this is a special case of EM.
Preview of EM for learning GMMs
See Colab notebook
Gaussian Mixture Model
• Introduction
• Learning the parameters
• EM algorithm
• EM for the Gaussian Mixture Model
In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the
maximizer of
ln p(xi ;!) =
p(xi, zi ;!)dzi
• ! is the parameters for a general probabilistic model
• xi’s are observed random variables
• zi’s are latent variables
Again, directly solving the objective is usually complicated and does not have a closed form solution.
EM algorithm
High-level idea
Keep maximizing a lower bound of ! that is more manageable
Jensen’s inequality
Figure from Wikipedia
Finding the lower bound of P :
ln p(x ;!) = ln
p(x, z ;!)
p(x, z ;!)
p(x, z ;!)
p(x, z ;!)
Therefore, our log-likelihood can be written as,
ln p(xi ;!) !
Ezi!qi(zi)
p(xi, zi ;!)
= F (!, {qi}
A lower bound on the log likelihood
The expression for the likelihood holds for any {qi}, so how do we choose? If we have some guess
of the parameters !, we should choose {qi} to try to make the lower bound tight at that value of !,
i.e. make the inequality hold with equality at that value of !.
Alternatively maximizing the lower bound
The expression for the likelihood holds for any {qi}, so how do we choose? If we have some guess
of the parameters !, we should choose {qi} to try to make the lower bound tight at that value of !,
i.e. make the inequality hold with equality at that value of !.
Equivalently, this is the same as alternatingly maximizing F over {qi} and ! (similar to k-means).
Suppose we fix !
, what should we choose {q
The inequality arises from the step where we used Jensen’s inequality. How do we get this step to
hold with equality? The function should be a constant function, i.e.
p(xi, zi ;!
for some constant ci which does not depend on the value taken by the random variable zi.
Alternatively maximizing the lower bound
(continued) Since
(zi) = 1, we get,
p(xi, zi ;!
p(xi, zi ;!
p(xi, zi ;!
p(xi, zi ;!
= p(zi|xi ;!
i.e., the posterior distribution of zi given xi and !
, we found the tightest lower bound F
” P (!) for all !.
= P (!(t))
Maximizing over {#!}
i }, maximize over !:
p(xi, zi ;!)
[ln p(xi, zi ;!)]!
[ln p(xi, zi ;!)]
Q(! ;!(t)) ({q
i } are computed via !
Q is the (expected) complete likelihood and is usually more tractable.
• versus the incomplete likelihood: P (!) =
i=1 ln p(xi ;!)
Maximizing over %
Step 0 Initialize !
Step 1 (E-Step) update the posterior of latent variables zi,
i (zi) = p(zi | xi ;!
and obtain Expectation of complete likelihood
Q(! ;!(t)) =
[ln p(xi, zi ;!)]
Step 2 (M-Step) update the model parameter via Maximization
(t+1) ! argmax
Q(! ;!(t))
Step 3 t! t+ 1 and return to Step 1 if not converged
General EM algorithm
Pictorial explanation
P (!) is non-concave, but Q(!;!(t)) often is concave and easy
to maximize.
P (!(t+1)) ! F
(t+1) ; {q
= P (!(t))
So EM always increases the objective value and will converge
to some local maximum (similar to k-means).
Gaussian Mixture Model
• Introduction
• Learning the parameters
• EM algorithm
• EM for the Gaussian Mixture Model
i (zi = j) = p
zi = j | xi ;!
xi, zi = j ;!
xi, zi = j ;!
(t)) p(xi | zi = j ;!
This computes the “soft assignment” “ij = q
i (zi = j), i.e. conditional probability of xi belonging
to cluster j.
Applying EM to learn GMMs: E-Step
Q(!,!(t)) = argmax
[ln p(xi, zi ;!)]
[ln p(zi ;!) + ln p(xi|zi ;!)]
{!j ,µj ,!j}
ln”j + lnN(xi | µj ,!j)
To find “1, . . . ,”k, solve
To find each µj ,!j , solve
!ij lnN(xi | µj ,!j)
Applying EM to learn GMMs: M-Step
Solutions to previous two problems are very natural, for each j
i.e. (weighted) fraction of examples belonging to cluster j
i.e. (weighted) average of examples belonging to cluster j
“ij(xi ! µj)(xi ! µj)
i.e (weighted) covariance of examples belonging to cluster j
You will verify some of these in HW4.
Applying EM to learn GMMs: M-Step
EM for learning GMMs:
Step 0 Initialize !j ,µj ,!j for each j ! [k]
Step 1 (E-Step) update the “soft assignment” (fixing parameters)
“ij = p(zi = j | xi) ” !jN
xi | µj ,!j
Step 2 (M-Step) update the model parameter (fixing assignments)
“ij(xi # µj)(xi # µj)
Step 3 return to Step 1 if not converged
Applying EM to learn GMMs: Putting it together
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com