程序代写代做代考 algorithm Unsupervised Learning I

Unsupervised Learning I
1

Today
• Unsupervised learning
– K-Means clustering
– Gaussian Mixture clustering
2

Unsupervised Learning I
Clustering
3

Supervised learning
Training set:
slide credit: Andrew Ng

Unsupervised learning
Training set:
slide credit: Andrew Ng

Clustering
Gene analysis Social network analysis
Types of voters Trending news

Unsupervised Learning I
K-means Algorithm
7

slide credit: Andrew Ng

cluster centroids
slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

K-means algorithm
Input:
– (number of clusters) – Training set
(drop convention)
slide credit: Andrew Ng

K-means algorithm
Randomly initialize cluster centroids
Repeat {
for =1to
:= index (from 1 to ) of cluster centroid closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
slide credit: Andrew Ng

K-means Cost Function
= index of cluster (1,2,…, ) to which example is currently assigned
= cluster centroid ( )
= cluster centroid of cluster to which example
has been assigned Optimization cost: “distortion”
slide credit: Andrew Ng

Random initialization
Should have
Randomly pick examples.
Set examples.
training equal to these
slide credit: Andrew Ng

Local Optima
slide credit: Andrew Ng

Avoiding Local Optima with Random Initialization
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get . Compute cost function (distortion)
}
Pick clustering that gave lowest cost
slide credit: Andrew Ng

How to choose K?
Elbow method:
12345678 12345678 (no. of clusters) (no. of clusters)
slide credit: Andrew Ng
Cost function
Cost function

How to choose K?
Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose.
E.g. T-shirt sizing T-shirt sizing
Weight
Weight
Height Height
slide credit: Andrew Ng
24

Unsupervised Learning I
Mixtures of Gaussians
25

𝑥2
Mixtures of Gaussians: Intuition
“Soft” cluster membership
Define a distribution over 𝑥 :
To generate each point 𝑥,
• Choose its cluster
component z
• Sample 𝑥 from the
𝑥1
Gaussian distribution for that component
26

𝑥2
Σ1
𝜇2 𝜇1
𝑥1
• Assume 𝐾 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
• Introduce discrete r.v. 𝑧 ∈ 𝑅𝐾 that denotes the component that generates the point
• one element of 𝑧 is equal to 1 and others are 0, i.e. “one- hot”:
𝑧𝑘 ∈ 0,1 andσ𝑘𝑧𝑘 =1 27
Mixtures of Gaussians: component membership variable 𝑧

𝑥2
Σ1
𝜇2


Suppose 𝐾 = 2 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
To sample 𝑖-th data point: – Pick component 𝑧𝑖 with
p 𝑧𝑘 = 1 = 𝜋𝑘 (parameter) – for example, 𝜋𝑘 = 0.5, and we
picked𝑧1= 0,1𝑇
– Pick data point x𝑖 with probability N(x; 𝜇𝑘 , Σ𝑘 )
Mixtures of Gaussians: Data generation example
𝜇1
𝑥1
𝑧1= 0 1
𝑥
1
28

Mixtures of Gaussians • 𝑧𝑘 ∈ 0,1 andσ𝑘𝑧𝑘 =1
• 𝐾 components, 𝑘-th component is a Gaussian with parameters 𝜇𝑘 , Σk
• define the joint distribution p(𝐱, 𝒛) in terms of a marginal distribution p(𝒛) and a conditional distribution p(𝐱|𝒛)
• where
Substitute and simplify
29

Maximum Likelihood Solution for Mixture of Gaussians
• This distribution is known as a Mixture of Gaussians
• We can estimate parameters using Maximum Likelihood, i.e. maximize
ln𝑝 𝑿|𝝅,𝝁,𝚺 =
ln𝑝 𝑥1,𝑥2,…,𝑥𝑁|𝜋1,…,𝜋𝐾,𝜇1,…,𝜇𝐾,Σ1,…,Σ𝐾
• This algorithm is called Expectation Maximization (EM)
• Very similar to soft version of K-Means!
30

Expectation Maximization
• We can estimate parameters using Maximum Likelihood, i.e. minimize neg. log likelihood
__
• Problem: don’t know values of “hidden” (or “latent”) variable
𝑧, we don’t observe it
• Solution: treat 𝑧𝑖 as parameters and use coordinate descent
31

Coordinate Descent
gradient descent:
• Minimize w.r.t all parameters at each step
coordinate descent:
• fix some coordinates, minimize w.r.t. the rest
• alternate
Credit: http://vis.supstat.com/2013/03/gradient-descent-algorithm-with-r/
32
Credit: Martin Takac

𝑥2
Σ1
Σ2
𝜇2
𝜇1
𝜋1,𝜋2
Coordinate descent for Mixtures of Gaussians:
Alternate
• fix 𝝅, 𝝁, 𝚺, update 𝑧𝑖
• fix 𝑧𝑖, update 𝝅, 𝝁, 𝚺
Expectation Maximization
𝑥1
33

Expectation Maximization Algorithm • A general technique for finding maximum likelihood
E-Step: estimate posterior probability of the latent variables 𝑝(𝑧𝑘|𝑥), holding parameters fixed
M-Step: maximize likelihood w.r.t parameters (here 𝜇𝑘, Σk, 𝜋𝑘) using latent probabilities from E-step
latent variable
estimators in latent variable models
• Initialize and iterate until convergence:
34

EM for Gaussian Mixtures Example
35

EM for Gaussian Mixtures
see Bishop Ch. 9.2
36

Gaussian Mixtures
Data: Parameters:
𝑋 = {𝑥𝑛}
𝜋𝑘, 𝜇𝑘, Σ𝑘

How many possible solutions for K clusters? 𝐾𝑁 Is the cost function convex? no
38

Summary • Unsupervisedlearning
• Discrete latent variables:
– K-Means clustering
– Gaussian Mixture clustering
• Next time: Continuous latent variables – Principal components analysis