Unsupervised Learning I
1
Today
• Unsupervised learning
– K-Means clustering
– Gaussian Mixture clustering
2
Unsupervised Learning I
Clustering
3
Supervised learning
Training set:
slide credit: Andrew Ng
Unsupervised learning
Training set:
slide credit: Andrew Ng
Clustering
Gene analysis Social network analysis
Types of voters Trending news
Unsupervised Learning I
K-means Algorithm
7
slide credit: Andrew Ng
cluster centroids
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
K-means algorithm
Input:
– (number of clusters) – Training set
(drop convention)
slide credit: Andrew Ng
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for =1to
:= index (from 1 to ) of cluster centroid closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
slide credit: Andrew Ng
K-means Cost Function
= index of cluster (1,2,…, ) to which example is currently assigned
= cluster centroid ( )
= cluster centroid of cluster to which example
has been assigned Optimization cost: “distortion”
slide credit: Andrew Ng
Random initialization
Should have
Randomly pick examples.
Set examples.
training equal to these
slide credit: Andrew Ng
Local Optima
slide credit: Andrew Ng
Avoiding Local Optima with Random Initialization
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get . Compute cost function (distortion)
}
Pick clustering that gave lowest cost
slide credit: Andrew Ng
How to choose K?
Elbow method:
12345678 12345678 (no. of clusters) (no. of clusters)
slide credit: Andrew Ng
Cost function
Cost function
How to choose K?
Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose.
E.g. T-shirt sizing T-shirt sizing
Weight
Weight
Height Height
slide credit: Andrew Ng
24
Unsupervised Learning I
Mixtures of Gaussians
25
𝑥2
Mixtures of Gaussians: Intuition
“Soft” cluster membership
Define a distribution over 𝑥 :
To generate each point 𝑥,
• Choose its cluster
component z
• Sample 𝑥 from the
𝑥1
Gaussian distribution for that component
26
𝑥2
Σ1
𝜇2 𝜇1
𝑥1
• Assume 𝐾 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
• Introduce discrete r.v. 𝑧 ∈ 𝑅𝐾 that denotes the component that generates the point
• one element of 𝑧 is equal to 1 and others are 0, i.e. “one- hot”:
𝑧𝑘 ∈ 0,1 andσ𝑘𝑧𝑘 =1 27
Mixtures of Gaussians: component membership variable 𝑧
𝑥2
Σ1
𝜇2
•
•
Suppose 𝐾 = 2 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
To sample 𝑖-th data point: – Pick component 𝑧𝑖 with
p 𝑧𝑘 = 1 = 𝜋𝑘 (parameter) – for example, 𝜋𝑘 = 0.5, and we
picked𝑧1= 0,1𝑇
– Pick data point x𝑖 with probability N(x; 𝜇𝑘 , Σ𝑘 )
Mixtures of Gaussians: Data generation example
𝜇1
𝑥1
𝑧1= 0 1
𝑥
1
28
Mixtures of Gaussians • 𝑧𝑘 ∈ 0,1 andσ𝑘𝑧𝑘 =1
• 𝐾 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
• define the joint distribution p(𝐱, 𝒛) in terms of a marginal distribution p(𝒛) and a conditional distribution p(𝐱|𝒛)
• where
Substitute and simplify
29
Maximum Likelihood Solution for Mixture of Gaussians
• This distribution is known as a Mixture of Gaussians
• We can estimate parameters using Maximum Likelihood, i.e. maximize
ln𝑝 𝑿|𝝅,𝝁,𝚺 =
ln𝑝 𝑥1,𝑥2,…,𝑥𝑁|𝜋1,…,𝜋𝐾,𝜇1,…,𝜇𝐾,Σ1,…,Σ𝐾
• This algorithm is called Expectation Maximization (EM)
• Very similar to soft version of K-Means!
30
Expectation Maximization
• We can estimate parameters using Maximum Likelihood, i.e. minimize neg. log likelihood
__
• Problem: don’t know values of “hidden” (or “latent”) variable
𝑧, we don’t observe it
• Solution: treat 𝑧𝑖 as parameters and use coordinate descent
31
Coordinate Descent
gradient descent:
• Minimize w.r.t all parameters at each step
coordinate descent:
• fix some coordinates, minimize w.r.t. the rest
• alternate
Credit: http://vis.supstat.com/2013/03/gradient-descent-algorithm-with-r/
32
Credit: Martin Takac
𝑥2
Σ1
Σ2
𝜇2
𝜇1
𝜋1,𝜋2
Coordinate descent for Mixtures of Gaussians:
Alternate
• fix 𝝅, 𝝁, 𝚺, update 𝑧𝑖
• fix 𝑧𝑖, update 𝝅, 𝝁, 𝚺
Expectation Maximization
𝑥1
33
Expectation Maximization Algorithm • A general technique for finding maximum likelihood
E-Step: estimate posterior probability of the latent variables 𝑝(𝑧𝑘|𝑥), holding parameters fixed
M-Step: maximize likelihood w.r.t parameters (here 𝜇𝑘, Σk, 𝜋𝑘) using latent probabilities from E-step
latent variable
estimators in latent variable models
• Initialize and iterate until convergence:
34
EM for Gaussian Mixtures Example
35
EM for Gaussian Mixtures
see Bishop Ch. 9.2
36
Gaussian Mixtures
Data: Parameters:
𝑋 = {𝑥𝑛}
𝜋𝑘, 𝜇𝑘, Σ𝑘
–
How many possible solutions for K clusters? 𝐾𝑁 Is the cost function convex? no
38
Summary • Unsupervisedlearning
• Discrete latent variables:
– K-Means clustering
– Gaussian Mixture clustering
• Next time: Continuous latent variables – Principal components analysis