IT代考 CSCI 567: Machine Learning

CSCI 567: Machine Learning

Copyright By PowCoder代写 加微信 powcoder

Lecture 10, Nov 10

Administrivia

• Start on your project if you haven’t already!
• Make groups (of 4) by tomorrow (Nov 11), minimum team size is 3.
• Top teams as of Nov 16 can get a bonus!

• HW4 is due in about one weeks (Nov 16 at 2pm).
• We’ll release another question on Gaussian mixture models tomorrow.

• Today’s plan:
• Clustering
• Gaussian Mixture Models and Expectation Maximization (EM)
• In the discussion, we will go over popular evaluation metrics for

supervised learning

A simplistic taxonomy of ML

Supervised learning:
Aim to predict

outputs of future
datapoints

Unsupervised

Aim to discover
hidden patterns and

explore data

Reinforcement

Aim to make
sequential decisions

Clustering

Clustering

• Introduction

• Formalizing and solving the objective (alternating minimization)

• !-means algorithm

Given: a set of data points (feature vectors), without labels

Output: group the data into some clusters, which means

• assign each point to a specific cluster

• find the center (representative/prototype/…) of each cluster

Clustering: Informal definition

Given: data points x1, . . . ,xn ! R
d and #clusters k we want

Output: group the data into k clusters, which means,

• find assignment !ij ! {0, 1} for each data point i ! [n] and j ! [k] s.t.

j![k] !ij = 1 for
any fixed i

• find the cluster centers µ1, . . . ,µk ! R

Clustering: More formal definition

Clustering is one of the most fundamental ML tasks, with many applications:

• recognize communities in a social network

• group similar customers in market research

• image segmentation

• accelerate other algorithms (e.g. nearest neighbor classification)

Many applications

Clustering

• Introduction

• Formalizing and solving the objective (alternating minimization)

• !-means algorithm

As with PCA, no ground-truth to even measure the quality of the answer (no labels given).

What is the high-level goal here?

We want to partition the points into k clusters, such that points within each cluster are close to their
cluster center.

We can turn this into an optimization problem, find !ij and µj to minimize

{!ij}, {µj}

!ij!xi ” µj!

i.e. the sum of squared distances of each point to its center. This is the “k-means” objective.

Formal objective

Unfortunately, finding the exact minimizer of the k-means objective is NP-hard!

Therefore, we use a heuristic (alternating minimization) that alternatingly minimizes over {!ij} and {µj}:

Initialize {µ
j : j ! [k]}

For t = 1, 2, . . .

ij } = argmin

j } = argmin

ij }, {µj}

How to solve this? Alternating minimization

The first step

{!ij}, {µj}

!ij!xi ” µj!

!ij!xi ” µj!

is simply to assign each xi to the closest µj , i.e.

j == argmin

for all j # [k] and i # [n].

Alternating minimization: Closer look

The second step

{!ij}, {µj}

!ij!xi ” µj!

is simply to average the points of each cluster (hence the name)

|{i : !ij = 1}|

for each j # [k].

Alternating minimization: Closer look

Clustering

• Introduction

• Formalizing and solving the objective (alternating minimization)

• !-means algorithm

Step 0 Initialize µ
, . . . ,µk

Step 1 For the centers µ
, . . . ,µk being fixed, assign each point to the closest center:

j == argmin

Step 2 For the assignments {!ij} being fixed, update the centers

Step 3 Return to Step 1 if not converged (convergence means that all the assigments !ij are un-
changed in Step 1).

!-means algorithm

!-means algorithm: Example

k-means will converge in a finite number of iterations, why?

• objective strictly decreases at each step if the algorithm has not converged.

!-means algorithm: Convergence

Why? For t = 1, 2, . . .

ij } = argmin

j == argmin

j } = argmin

ij }, {µj}

k-means will converge in a finite number of iterations, why?

• objective strictly decreases at each step if the algorithm has not converged.

• #possible assignments is finite (kn, exponentially large though)

Therefore, the algorithm must converge in at most kn steps.

Why? More specifically, why can’t the algorithm cycle between different clusterings?

• Suppose the algorithm finds the same clustering at time steps t1 and t2.

• Since the objective function value decreases at every step, this means the same clustering (at
time steps t1 and t2) has two different costs, which is not possible.

• Therefore, by contradiction, the algorithm cannot cycle between clusterings.

!-means algorithm: Convergence

k-means will converge in a finite number of iterations, why?

• objective strictly decreases at each step if the algorithm has not converged.

• #possible assignments is finite (kn, exponentially large though)

• it could take exponentially many iterations to converge

• and it might not converge to the global minimum of the k-means objective

!-means algorithm: Convergence

There are different ways to initialize:

• randomly pick k points as initial centers µ
, . . . ,µ

• or randomly assign each point to a cluster, then average to find centers

• or more sophisticated approaches (e.g. k-means++)

Initialization matters for convergence.

!-means algorithm: How to initialize?

!-means algorithm: Local vs Global minima

K-means converges immediately in both cases, but

• left has K-means objective L2 = 4W 2

• right has K-means objective W 2, 4 times better than left!

• in fact, left is local minimum, and right is global minimum.

Simple example: 4 data points, 2 clusters, 2 different initializations

!-means algorithm: Summary

• Clustering is a fundamental unsupervised learning task.

• k-means is a alternating minimization algorithm for the k-means objective.

• The algorithm always converges, but it can converge to a local minimum.

• Initialization matters a lot for the convergence. There are principled initialization schemes,
which have guarantees on the solution they find (e.g. k-means++).

Gaussian mixture

Gaussian Mixture Model

• Introduction

• Learning the parameters

• EM algorithm

• EM for the Gaussian Mixture Model

Gaussian mixture models (GMM) is a probabilistic approach for clustering

• more explanatory than minimizing the k-means objective

• can be seen as a soft version of k-means

To solve GMM, we will introduce a powerful method for learning probabilistic models:
the Expectation Maximization (EM) algorithm.

Gaussian mixture models

For classification, we discussed the sigmoid model to “explain”
how the labels are generated.

Similarly, for clustering, we want to come up with a probabilis-
tic model p to “explain” how the data is generated.

That is, each point is an independent sample of x ! p.

Why do generative modelling?

• can generate data from p

• can estimate probability of seeing any datapoint (useful
for many tasks, such as for finding outliers/anomalies in

A generative model

What probabilistic model
generates data like this?

GMM is a natural model to explain such data.

Assume there are 3 ground-truth Gaussian models.

To generate a point, we

• first randomly pick one of the Gaussian models,

• then draw a point according this Gaussian.

Hence the name “Gaussian mixture model”.

GMM: Intuition

Figure from Wikipedia

A GMM has the following density function:

!jN(x | µj ,!j)

• k: the number of Gaussian components (same as #clusters we want)

• !1, . . . ,!k: mixture weights, a distribution over k components

• µj and !j : mean and covariance matrix of the k-th Gaussian

• N : the density function for a Gaussian

GMM: Formal definition

By introducing a latent variable z ! [k], which indicates cluster membership, we can see p as a
marginal distribution

p(x, z = j) =

p(z = j)p(x|z = j) =

!jN(x | µj ,!j)

x and z are both random variables drawn from the model

• x is observed

• z is unobserved/latent

Another view

An example

The conditional distributions are

p(x | z = red) = N(x | µ

p(x | z = blue) = N(x | µ

p(x | z = green) = N(x | µ

The marginal distribution is

p(x) = p(red)N(x | µ
,!1) + p(blue)N(x | µ2,!2)

+ p(green)N(x | µ

Learning a GMM means finding all the parameters ! = {!j ,µj ,!j}

In the process, we will learn the distribution of the latent variable zi as well:

p(zi = j | xi) := “ij ! [0, 1]

i.e. “soft assignment” of each point to each cluster, as opposed to “hard assignment” by k-means.

GMM is more explanatory than k-means

• both learn the cluster centers µj’s

• in addition, GMM learns cluster weight !j and covariance !j , thus

• we can predict probability of seeing a new point

• we can generate synthetic data

Learning GMMs

Gaussian Mixture Model

• Introduction

• Learning the parameters

• EM algorithm

• EM for the Gaussian Mixture Model

As always, we want to do maximum-likelihood estimation (MLE): find

p(xi ;!) = argmax

ln p(xi ;!) := argmax

This is called incomplete log-likelihood (since zi’s are unobserved). We can still write it down as an
optimization problem by marginalizing out the zi’s.

ln p(xi ;!) =

p(xi, zi = j ;!)

p(zi = j ;!)p(xi|zi = j ;!)

!jN(xi | µj ,!j)

This is a non-concave problem, and does not have a closed-form solution.

One solution is to still apply GD/SGD, but a much more effective approach is the Expectation
Maximization (EM) algorithm.

How do we learn the parameters?

Step 0 Initialize !j ,µj ,!j for each j ! [k]

Step 1 (E-Step) update the “soft assignment” (fixing parameters)

“ij = p(zi = j | xi) ” !jN

xi | µj ,!j

Step 2 (M-Step) update the model parameter (fixing assignments)

“ij(xi # µj)(xi # µj)

Step 3 return to Step 1 if not converged

We will see how this is a special case of EM.

Preview of EM for learning GMMs

See Colab notebook

Gaussian Mixture Model

• Introduction

• Learning the parameters

• EM algorithm

• EM for the Gaussian Mixture Model

In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the
maximizer of

ln p(xi ;!) =

p(xi, zi ;!)dzi

• ! is the parameters for a general probabilistic model

• xi’s are observed random variables

• zi’s are latent variables

Again, directly solving the objective is usually complicated and does not have a closed form solution.

EM algorithm

High-level idea

Keep maximizing a lower bound of ! that is more manageable

Jensen’s inequality

Figure from Wikipedia

Finding the lower bound of P :

ln p(x ;!) = ln

p(x, z ;!)

p(x, z ;!)

p(x, z ;!)

p(x, z ;!)

Therefore, our log-likelihood can be written as,

ln p(xi ;!) !

Ezi!qi(zi)

p(xi, zi ;!)

= F (!, {qi}

A lower bound on the log likelihood

The expression for the likelihood holds for any {qi}, so how do we choose? If we have some guess
of the parameters !, we should choose {qi} to try to make the lower bound tight at that value of !,
i.e. make the inequality hold with equality at that value of !.

Alternatively maximizing the lower bound

The expression for the likelihood holds for any {qi}, so how do we choose? If we have some guess
of the parameters !, we should choose {qi} to try to make the lower bound tight at that value of !,
i.e. make the inequality hold with equality at that value of !.

Equivalently, this is the same as alternatingly maximizing F over {qi} and ! (similar to k-means).

Suppose we fix !

, what should we choose {q

The inequality arises from the step where we used Jensen’s inequality. How do we get this step to
hold with equality? The function should be a constant function, i.e.

p(xi, zi ;!

for some constant ci which does not depend on the value taken by the random variable zi.

Alternatively maximizing the lower bound

(continued) Since

(zi) = 1, we get,

p(xi, zi ;!

p(xi, zi ;!

p(xi, zi ;!

p(xi, zi ;!

= p(zi|xi ;!

i.e., the posterior distribution of zi given xi and !

, we found the tightest lower bound F

” P (!) for all !.

= P (!(t))

Maximizing over {#!}

i }, maximize over !:

p(xi, zi ;!)

[ln p(xi, zi ;!)]!

[ln p(xi, zi ;!)]

Q(! ;!(t)) ({q
i } are computed via !

Q is the (expected) complete likelihood and is usually more tractable.

• versus the incomplete likelihood: P (!) =

i=1 ln p(xi ;!)

Maximizing over %

Step 0 Initialize !

Step 1 (E-Step) update the posterior of latent variables zi,

i (zi) = p(zi | xi ;!

and obtain Expectation of complete likelihood

Q(! ;!(t)) =

[ln p(xi, zi ;!)]

Step 2 (M-Step) update the model parameter via Maximization

(t+1) ! argmax

Q(! ;!(t))

Step 3 t! t+ 1 and return to Step 1 if not converged

General EM algorithm

Pictorial explanation

P (!) is non-concave, but Q(!;!(t)) often is concave and easy
to maximize.

P (!(t+1)) ! F

(t+1) ; {q

= P (!(t))

So EM always increases the objective value and will converge
to some local maximum (similar to k-means).

Gaussian Mixture Model

• Introduction

• Learning the parameters

• EM algorithm

• EM for the Gaussian Mixture Model

i (zi = j) = p

zi = j | xi ;!

xi, zi = j ;!

xi, zi = j ;!

(t)) p(xi | zi = j ;!

This computes the “soft assignment” “ij = q
i (zi = j), i.e. conditional probability of xi belonging

to cluster j.

Applying EM to learn GMMs: E-Step

Q(!,!(t)) = argmax

[ln p(xi, zi ;!)]

[ln p(zi ;!) + ln p(xi|zi ;!)]

{!j ,µj ,!j}

ln”j + lnN(xi | µj ,!j)

To find “1, . . . ,”k, solve

To find each µj ,!j , solve

!ij lnN(xi | µj ,!j)

Applying EM to learn GMMs: M-Step

Solutions to previous two problems are very natural, for each j

i.e. (weighted) fraction of examples belonging to cluster j

i.e. (weighted) average of examples belonging to cluster j

“ij(xi ! µj)(xi ! µj)

i.e (weighted) covariance of examples belonging to cluster j

You will verify some of these in HW4.

Applying EM to learn GMMs: M-Step

EM for learning GMMs:

Step 0 Initialize !j ,µj ,!j for each j ! [k]

Step 1 (E-Step) update the “soft assignment” (fixing parameters)

“ij = p(zi = j | xi) ” !jN

xi | µj ,!j

Step 2 (M-Step) update the model parameter (fixing assignments)

“ij(xi # µj)(xi # µj)

Step 3 return to Step 1 if not converged

Applying EM to learn GMMs: Putting it together

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com