Lecture 18. Gaussian Mixture Model. Expectation Maximization.
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Unsupervisedlearning ∗ Diversity of problems
• Gaussianmixturemodel(GMM)
∗ A probabilistic approach to clustering
∗ The GMM model
∗ GMM clustering as an optimisation problem
• TheExpectationMaximization(EM)algorithm
2
COMP90051 Statistical Machine Learning
Unsupervised Learning
A large branch of ML that concerns with learning the structure of the data in the absence of labels
3
COMP90051 Statistical Machine Learning
Previously: Supervised learning
• Supervised learning: Overarching aim is making predictions from data
• We studied methods such as random forest, ANN and SVM in the context of this aim
• We had instances 𝒙𝒙 ∈ 𝑹𝑹 , 𝑖𝑖 = 1, … , 𝑛𝑛 and 𝑖𝑖
𝑚𝑚
corresponding labels 𝑦𝑦𝑖𝑖 as inputs, and the aim was to predict labels for new instances
• Can be viewed as a function approximation problem, but with a big caveat: ability to generalise is critical
• Bandits: a setting of partial supervision
4
COMP90051 Statistical Machine Learning
Now: Unsupervised learning
• Nextfewlectures:unsupervisedlearningmethods
• Inunsupervisedlearning,thereisnodedicated
variable called a “label”
• Instead, we just have a set of points 𝒙𝒙 ∈ 𝑹𝑹 , 𝑖𝑖 =
1,…,𝑛𝑛
𝑖𝑖
𝑚𝑚
• The aim of unsupervised learning is to explore the structure (patterns, regularities) of the data
• Theaimof“exploringthestructure”isvague
5
COMP90051 Statistical Machine Learning
Unsupervised learning tasks
• Diversityoftasksfallintounsupervisedlearning category
∗ Clustering (now)
∗ Dimensionality reduction (soon)
∗ Learning parameters of probabilistic models (before/now)
• Applicationsandrelatedtasksarenumerous:
∗ Market basket analysis. E.g., use supermarket transaction
logs to find items that are frequently purchased together
∗ Outlier detection. E.g., find potentially fraudulent credit card transactions
∗ Often unsupervised tasks in (supervised) ML pipelines
6
COMP90051 Statistical Machine Learning
Refresher: k-means clustering
1. Initialisation: choose 𝑘𝑘 cluster centroids randomly
2. Update:
a) Assign points to the nearest* centroid
b) Compute centroids under the current assignment
3. Termination: if no change then stop
4. Go to Step 2
*Distance represented by choice of metric typically L2 Still one of the most popular data mining algorithms.
7
COMP90051 Statistical Machine Learning
Refresher: k-means clustering
Requires specifying the number of clusters in advance
Measures “dissimilarity” using Euclidean distance
Finds “spherical” clusters
An iterative optimization procedure
Data: Old Faithful Geyser Data: waiting time between eruptions and the duration of eruptions
Figure: Bishop, Section 9.1 8
COMP90051 Statistical Machine Learning
Gaussian Mixture Model A probabilistic view of clustering
9
COMP90051 Statistical Machine Learning
Modelling uncertainty in data clustering
• k-means clustering assigns each point to exactly one cluster
• Similar to k-means, a probabilistic mixture model requires
the user to choose the number of clusters in advance
• Unlike k-means, the probabilistic model gives us 𝑐𝑐a power to express uncertainly about the origin of each point
∗ Eachpointoriginatesfromcluster𝑐𝑐withprobability𝑤𝑤,𝑐𝑐=1,…,𝑘𝑘 • That is, each point still originates from one particular cluster
(aka component), but we are not sure from which one
• Next
∗ IndividualcomponentsmodelledasGaussians
∗ FittingillustratesgeneralExpectationMaximization(EM)algorithm
10
COMP90051 Statistical Machine Learning
Clustering: probabilistic model
Data points xi are independent and identically distributed (i.i.d.) samples from a mixture of K distributions (components)
Each component in the mixture is what we call a cluster
Cluster 1
In principle, we can adopt any probability distribution for the components, however, the normal distribution is a common modelling choiceGaussian Mixture Model
Cluster 2
11
COMP90051 Statistical Machine Learning
Normal (aka Gaussian) distribution
• Recall that a 1D Gaussian is2𝜋𝜋𝜎𝜎2 𝒩𝒩 𝑥𝑥|𝜇𝜇,𝜎𝜎 ≡ 1
exp − 𝑥𝑥−𝜇𝜇 2
2𝜎𝜎2
𝒩𝒩 𝒙𝒙 | 𝝁𝝁 , 𝚺𝚺 ≡ 2 𝜋𝜋 − 𝑑𝑑2 𝚺𝚺 − 12 e x p − 2 𝒙𝒙 − 𝝁𝝁 𝑇𝑇 𝚺𝚺 − 1 𝒙𝒙 − 𝝁𝝁
• And a 𝑑𝑑-dimensional Gaussian is 1
∗ 𝚺𝚺 is a PSD symmetric 𝑑𝑑 × 𝑑𝑑 matrix, the covariance matrix
∗ 𝚺𝚺 denotes determinant
∗ No need to memorize the full formula.
Figure: Bishop 12
COMP90051 Statistical Machine Learning
Gaussian mixture model (GMM)
• Gaussian mixture distribution (for one data point): 𝑘𝑘𝑘𝑘
𝑃𝑃𝒙𝒙≡�𝑤𝑤𝑗𝑗𝒩𝒩𝒙𝒙|𝝁𝝁𝑗𝑗,𝚺𝚺𝒋𝒋 ≡�P𝐶𝐶𝑗𝑗 𝑃𝑃𝒙𝒙|𝑪𝑪𝒋𝒋 𝑗𝑗=1 𝑗𝑗=1
• 𝑃𝑃 𝒙𝒙|𝑪𝑪𝒋𝒋 ≡𝒩𝒩(𝒙𝒙|𝝁𝝁𝑗𝑗,𝚺𝚺𝑗𝑗)is class/component conditional density (modeled as a Gaussian) for class j
1Dexample
• HereP 𝐶𝐶 ≥0and∑𝑘𝑘 P 𝐶𝐶 =1
𝑗𝑗 𝑗𝑗=1𝑗𝑗
• Parameters of the model are P 𝐶𝐶
𝝁𝝁 𝑗𝑗 , 𝚺𝚺𝑗𝑗 , 𝑗𝑗 = 1 , … , 𝑘𝑘
𝑗𝑗
,
Mixture and individual component densities are re-scaled for visualisation purposes Figure: Bishop
13
30/07/2013 Week 1, Lecture 2 14
COMP90051 Statistical Machine Learning
Clustering as model estimation
• Given a set of data points, we assume that data points are generated by a GMM
∗ Each point in our dataset originates from the 𝑗𝑗-th normal distribution component with probability 𝑤𝑤𝑗𝑗 = P(𝐶𝐶𝑗𝑗 )
• Clusteringnowamountstofindingparametersof the GMM that “best explain” the observed data
• CalluponoldfriendMLEprincipletofindparameter values that maximise 𝑝𝑝(𝒙𝒙 , … , 𝒙𝒙 )
1𝑛𝑛
15
COMP90051 Statistical Machine Learning
•
Fitting the GMM
Modelling the data points as independent, aim is to find𝑷𝑷 𝑪𝑪 ,𝝁𝝁 ,𝚺𝚺,𝑗𝑗=1,…,𝑘𝑘thatmaximise
𝒋𝒋𝑗𝑗𝑗𝑗
𝑃𝑃 𝒙𝒙 , … , 𝒙𝒙 = ∏𝑛𝑛 ∑𝑘𝑘 P 𝐶𝐶 𝑃𝑃 𝒙𝒙 |𝐶𝐶 1 𝑛𝑛 𝑖𝑖=1𝑗𝑗=1 𝑗𝑗 𝑖𝑖𝑗𝑗
•
where𝑃𝑃𝒙𝒙|𝑪𝑪𝒋𝒋 ≡𝒩𝒩(𝒙𝒙|𝝁𝝁𝑗𝑗,𝚺𝚺𝑗𝑗) Can be solved analytically?
Taking the derivative of this expression is pretty awkward, try
𝑛𝑛𝑘𝑘
log𝑃𝑃𝒙𝒙1,…,𝒙𝒙𝑛𝑛 =�𝑖𝑖=1log �𝑗𝑗=1P𝐶𝐶𝑗𝑗 𝑃𝑃𝒙𝒙𝑖𝑖|𝐂𝐂𝑗𝑗
the usual log trick
Expectation-Maximisation (EM)
16
COMP90051 Statistical Machine Learning
Expectation Maximisation Algorithm
For a moment, let’s put GMM problem aside – to come back to later.
17
COMP90051 Statistical Machine Learning
• •
•
1. 2.
Motivation of EM
Consider a parametric probabilistic model 𝑝𝑝 𝑿𝑿|𝜽𝜽 , where 𝑿𝑿 denotes data and 𝜽𝜽 denotes a vector of parameters
According to MLE, we need to maximise 𝑝𝑝 𝑿𝑿|𝜽𝜽 as a function of 𝜽𝜽
∗ equivalently maximise log 𝑝𝑝 𝑿𝑿|𝜽𝜽
There can be a couple of issues with this task
Sometimes we don’t observe some of the variables needed
to compute the log likelihood
∗ Example:GMMclustermembershipisnotknowninadvance
Sometimes the form of the log likelihood is inconvenient to
∗ Example:takingaderivativeofGMMloglikelihoodresultsina cumbersome equation
work with
art: Ebrahim at Wikimedia Commons (CC4) 18
COMP90051 Statistical Machine Learning
MLE vs EM
• MLE is a frequentist principle that suggests that given a dataset, the “best” parameters to use are the ones that maximise the probability of the data
∗ MLE is a way to formally pose the problem
• EMisanalgorithm
∗ EM is a way to solve the problem posed by MLE
∗ Especially convenient under unobserved latent variables
• MLEcanbefoundbyothermethodssuchasgradient descent (but gradient descent is not always the most convenient method)
19
COMP90051 Statistical Machine Learning
Expectation-Maximisation (EM) Algorithm
• InitialisationStep:
∗ Initialize K clusters: C1, …, CK
(μj, Σj) and P(Cj) for each cluster j.
• IterationStep:
∗ Estimate the cluster of each datum
p(Cj |xi)
∗ Re-estimate the cluster parameters
(μ j , Σ j ), p(C j )
Expectation Maximisation
for each cluster j
20
COMP90051 Statistical Machine Learning
EM for GMM and generally
• EM is a general approach, goes beyond GMMs ∗ Purpose:ImplementMLEunderlatentvariablesZ
(‘latent’ is fancy for ‘missing’)
• What are variables, parameters in GMMs?
∗ Variables:PointlocationsXandclusterassignmentsZ
• let 𝑧𝑧 denote true cluster membership for each point 𝒙𝒙 , computing the
𝑖𝑖𝑖𝑖 likelihood with known values 𝒛𝒛 is simplified (see next section)
∗ Parameters:θareclusterlocationsandscales • What is EM really doing?
∗ Coordinateascentonalowerboundonthelog-likelihood • M-step: ascent in modeled parameters θ
• E-step: ascent in the marginal likelihood P(Z)
∗ Eachstepmovestowardsalocaloptimum ∗ Cangetstuck,canneedrandomrestarts
21
COMP90051 Statistical Machine Learning
Needed tool: Jensen’s inequality
• Compares effect of averaging before and after applying a convex function:
𝑓𝑓 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝒙𝒙 ≤ 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑓𝑓 𝒙𝒙
• Example:
∗ Let 𝑓𝑓 be some convex function, such as 𝑓𝑓 𝑥𝑥 = 𝑥𝑥2
∗ Consider 𝒙𝒙 = 1,2,3,4,5 ′, then 𝑓𝑓 𝒙𝒙 = 1,4,9,16,25 ′ ∗ Average of input 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝒙𝒙 = 3
∗ 𝑓𝑓𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝒙𝒙 =9
∗ Average of output 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑓𝑓 𝒙𝒙 = 12.4
𝒇𝒇
𝒇𝒇
𝒇𝒇(𝑋𝑋)
• Proof follows from the definition of convexity ∗ Proof by induction
• General statement:
∗ If 𝑿𝑿 random variable, 𝑓𝑓 is a convex function
∗𝑓𝑓𝔼𝔼𝑿𝑿≤𝔼𝔼𝑓𝑓𝑿𝑿
plot: MHz’as at Wikimedia Commons (public domain)
22
COMP90051 Statistical Machine Learning
Putting the latent variables to use
• We want to maximise log 𝑝𝑝 𝑿𝑿|𝜽𝜽 . We don’t observe 𝒁𝒁 (here discrete),
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 =log∑ 𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 𝒁𝒁
Marginalisation (here ∑ … iterates 𝒁𝒁
but can introduce it nonetheless.
over all possible values of 𝒁𝒁)
Need 𝒁𝒁 to have non-zero marginal
Jensen’s inequality holds since log … is a concave function
• = log ∑𝒁𝒁 𝑝𝑝 𝑿𝑿, 𝒁𝒁|𝜽𝜽 𝑝𝑝(𝒁𝒁) 𝒁𝒁 𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽
• = log ∑ 𝑝𝑝(𝒁𝒁) 𝑝𝑝(𝒁𝒁)
• =log𝔼𝔼𝒁𝒁 𝑝𝑝𝑿𝑿,𝒁𝒁|𝜽𝜽𝑝𝑝(𝒁𝒁)
𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 • ≥𝔼𝔼 log 𝑝𝑝(𝒁𝒁) 𝒁𝒁 𝑝𝑝(𝒁𝒁)
• =𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 −𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
23
COMP90051 Statistical Machine Learning
Maximising the lower bound (1/2)
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 − 𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
• The right hand side (RHS) is a lower bound on the
original log likelihood
∗ This holds for any 𝜽𝜽 and any non zero 𝑝𝑝(𝒁𝒁)
• This lower bound is a function of two “variables” 𝜽𝜽 and 𝑝𝑝 𝒁𝒁 . We want to maximise the RHS as a function of these two “variables”
• It is hard to optimise with respect to both at the same time, so EM resorts to an iterative procedure
• Intuitively, we want to push the lower bound up
24
COMP90051 Statistical Machine Learning
Maximising the lower bound (2/2)
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 − 𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
• EM is essentially coordinate ascent:
∗ Fix 𝜽𝜽 and optimise the lower bound for 𝑝𝑝 𝒁𝒁 ∗ Fix 𝑝𝑝 𝒁𝒁 and optimise for 𝜽𝜽
we will prove this shortly
• For any point 𝜽𝜽 , it can be shown that setting 𝑝𝑝 𝒁𝒁 = 𝑝𝑝(𝒁𝒁|𝑿𝑿, 𝜽𝜽∗) makes the lower bound tight
• The convenience of EM comes from the following ∗
• For any 𝑝𝑝(𝒁𝒁), the second term does not depend on 𝜽𝜽
• When 𝑝𝑝 𝒁𝒁 = 𝑝𝑝(𝒁𝒁|𝑿𝑿, 𝜽𝜽∗), the first term can usually be
maximised as a function of 𝜽𝜽 in a closed-form ∗ If not, then probably don’t use EM
25
COMP90051 Statistical Machine Learning
Example (1/3)
log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 −𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
log 𝑝𝑝 𝑿𝑿|𝜽𝜽
≡ 𝐺𝐺 𝜽𝜽, 𝑝𝑝 𝒁𝒁
𝐺𝐺 𝜃𝜃, 𝑝𝑝 𝒁𝒁|𝑿𝑿, 𝜃𝜃(𝑡𝑡) 𝐺𝐺 𝜃𝜃,𝑝𝑝2 𝒁𝒁
𝜃𝜃(𝑡𝑡)
𝐺𝐺 𝜃𝜃,𝑝𝑝1 𝒁𝒁
𝜃𝜃
26
COMP90051 Statistical Machine Learning
Example (2/3)
log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 −𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
≡ 𝐺𝐺 𝜽𝜽, 𝑝𝑝 𝒁𝒁 log 𝑝𝑝 𝑿𝑿|𝜽𝜽
𝜃𝜃(𝑡𝑡) 𝜃𝜃(𝑡𝑡+1)
𝐺𝐺 𝜃𝜃, 𝑝𝑝 𝒁𝒁|𝑿𝑿, 𝜃𝜃(𝑡𝑡)
𝜃𝜃
27
COMP90051 Statistical Machine Learning
Example (3/3)
log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 −𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
≡ 𝐺𝐺 𝜽𝜽, 𝑝𝑝 𝒁𝒁 log 𝑝𝑝 𝑿𝑿|𝜽𝜽 𝐺𝐺 𝜃𝜃, 𝑝𝑝 𝒁𝒁|𝑿𝑿, 𝜃𝜃(𝑡𝑡+1)
𝜃𝜃(𝑡𝑡) 𝜃𝜃(𝑡𝑡+1)
𝜃𝜃
28
COMP90051 Statistical Machine Learning
EM as iterative optimisation
1. Initialisation: choose (random) initial values of 𝜽𝜽(1)
log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽
∗ E-step: compute 𝑄𝑄 𝜽𝜽,𝜽𝜽
∗ M-step: 𝜽𝜽(𝑡𝑡+1) = argmax 𝑄𝑄 𝜽𝜽, 𝜽𝜽 𝑡𝑡
2. Update: 𝑡𝑡 𝒁𝒁|𝑿𝑿,𝜽𝜽(𝑡𝑡)
≡ 𝔼𝔼
3. 4.
𝜽𝜽
Termination: if no change then stop Go to Step 2
This algorithm will eventually stop (converge), but the resulting estimate can be only a local maximum
29
COMP90051 Statistical Machine Learning
Maximising the lower bound (2/2)
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 − 𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
• EM is essentially coordinate descent:
∗ Fix 𝜽𝜽 and optimise the lower bound for 𝑝𝑝 𝒁𝒁 ∗ Fix 𝑝𝑝 𝒁𝒁 and optimise for 𝜽𝜽
we will prove this now
• For any point 𝜽𝜽 , it can be shown that setting 𝑝𝑝 𝒁𝒁 = 𝑝𝑝(𝒁𝒁|𝑿𝑿, 𝜽𝜽∗) makes the lower bound tight
• The convenience of EM follows from the following ∗
• For any 𝑝𝑝(𝒁𝒁), the second term does not depend on 𝜽𝜽
• When 𝑝𝑝 𝒁𝒁 = 𝑝𝑝(𝒁𝒁|𝑿𝑿, 𝜽𝜽∗), the first term can usually be
maximised as a function of 𝜽𝜽 in a closed-form ∗ If not, then probably don’t use EM
30
COMP90051 Statistical Machine Learning
Putting the latent variables in use • We want to maximise log 𝑝𝑝 𝑿𝑿|𝜽𝜽 . We don’t know 𝒁𝒁, but consider an
arbitrary non-zero distribution 𝑝𝑝(𝒁𝒁) • log𝑝𝑝 𝑿𝑿|𝜽𝜽 =log∑𝒁𝒁𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽
Rule of marginal distribution (here ∑ … iterates over all
• =log∑𝒁𝒁 𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 𝑝𝑝(𝒁𝒁)
𝒁𝒁
possible values of 𝒁𝒁)
• = log ∑ 𝑝𝑝(𝒁𝒁) 𝒁𝒁
𝑝𝑝(𝒁𝒁) 𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽
• =log𝔼𝔼𝒁𝒁 𝑝𝑝𝑿𝑿,𝒁𝒁|𝜽𝜽𝑝𝑝(𝒁𝒁)
𝒁𝒁 𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 • ≥𝔼𝔼 log 𝑝𝑝(𝒁𝒁) 𝑝𝑝(𝒁𝒁)
log …
Jensen’s inequality holds since
is a concave function
• =𝔼𝔼𝒁𝒁 log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 −𝔼𝔼𝒁𝒁 log𝑝𝑝(𝒁𝒁)
31
COMP90051 Statistical Machine Learning
Setting a tight lower bound (1/2)
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼𝒁𝒁 • =𝔼𝔼
log𝑝𝑝 𝑿𝑿,𝒁𝒁|𝜽𝜽 log 𝑝𝑝(𝒁𝒁)
𝒁𝒁 • =𝔼𝔼𝒁𝒁
𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 𝑝𝑝 𝑿𝑿|𝜽𝜽
𝑝𝑝(𝒁𝒁) Chain rule of probability
log𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 +log𝑝𝑝 𝑿𝑿|𝜽𝜽 𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 𝒁𝒁
• =𝔼𝔼 𝒁𝒁
log 𝑝𝑝(𝒁𝒁) +𝔼𝔼 log𝑝𝑝 𝑿𝑿|𝜽𝜽 Linearityof𝔼𝔼[.]
• =𝔼𝔼 𝒁𝒁
𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽
log 𝑝𝑝(𝒁𝒁) +log𝑝𝑝𝑿𝑿|𝜽𝜽
𝔼𝔼[. ] of a constant 32
• log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥𝔼𝔼 𝒁𝒁
log 𝑝𝑝(𝒁𝒁) +log𝑝𝑝 𝑿𝑿|𝜽𝜽 𝑝𝑝(𝒁𝒁)
𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽
COMP90051 Statistical Machine Learning
Setting a tight lower bound (2/2)
Ultimate aim: Lower bound of what maximise this we want to maximise
log𝑝𝑝 𝑿𝑿|𝜽𝜽 ≥ 𝔼𝔼 log𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 + log𝑝𝑝 𝑿𝑿|𝜽𝜽 𝒁𝒁 𝑝𝑝(𝒁𝒁)
First, note that this term* ≤ 0
Second, note that if 𝑝𝑝 𝒁𝒁 = 𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 , then
𝔼𝔼 log𝑝𝑝𝒁𝒁|𝑿𝑿,𝜽𝜽 =𝔼𝔼 log1=0 𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽 𝑝𝑝 𝒁𝒁|𝑿𝑿,𝜽𝜽
For any 𝜽𝜽∗, setting 𝑝𝑝 𝒁𝒁 = 𝑝𝑝 𝒁𝒁|𝑿𝑿, 𝜽𝜽∗ maximises the lower bound on log 𝑝𝑝 𝑿𝑿|𝜽𝜽∗ and makes it tight
*Negative Kullback-Leibler divergence between 𝑝𝑝(𝒁𝒁) and 𝑝𝑝 𝒁𝒁|𝑿𝑿, 𝜽𝜽
33
COMP90051 Statistical Machine Learning
Estimating Parameters of Gaussian Mixture Model
A classical application of the Expectation-Maximisation algorithm
34
COMP90051 Statistical Machine Learning
Latent variables of GMM
• Let 𝑧𝑧 , … , 𝑧𝑧 denote true origins of the corresponding points
1𝑛𝑛
𝒙𝒙 ,…,𝒙𝒙 . Each 𝑧𝑧 is a discrete variable that takes values in 1,…,𝑘𝑘,
1𝑛𝑛𝑖𝑖
where 𝑘𝑘 is a number of clusters
• Now compare the original log likelihood 𝑛𝑛𝑘𝑘
log𝑝𝑝 𝒙𝒙1,…,𝒙𝒙𝑛𝑛 = �log �𝑤𝑤𝑐𝑐𝒩𝒩 𝒙𝒙𝑖𝑖|𝝁𝝁𝑐𝑐,𝚺𝚺𝑐𝑐 𝑖𝑖=1 𝑐𝑐=1
• With complete log likelihood (if we knew 𝒛𝒛) 𝑛𝑛
log𝑝𝑝 𝒙𝒙1,…,𝒙𝒙𝑛𝑛,𝒛𝒛 = �𝑖𝑖=1 log 𝑤𝑤𝑧𝑧𝑖𝑖𝒩𝒩 𝒙𝒙𝑖𝑖|𝝁𝝁𝑧𝑧𝑖𝑖,𝚺𝚺𝑧𝑧𝑖𝑖
• Recall that taking a log of a normal density function results in a tractable expression
35
Handling uncertainty about 𝒛𝒛
• We cannot compute complete log likelihood because we don’t know 𝒛𝒛
COMP90051 Statistical Machine Learning
• EM algorithm handles this uncertainty replacing log 𝑝𝑝 𝑿𝑿, 𝒛𝒛|𝜽𝜽 with
expectation 𝔼𝔼𝒛𝒛|𝑿𝑿,𝜽𝜽 𝑡𝑡 log 𝑝𝑝 𝑿𝑿, 𝒛𝒛|𝜽𝜽
• This in turn requires the distribution of 𝑝𝑝 𝒛𝒛|𝑿𝑿, 𝜽𝜽(𝑡𝑡)
given current
parameter estimates
• Assuming that 𝑧𝑧𝑖𝑖 are pairwise independent,
weneed𝑃𝑃 𝑧𝑧𝑖𝑖 =𝑐𝑐|𝒙𝒙𝑖𝑖,𝜽𝜽(𝑡𝑡)
• E.g., suppose 𝒙𝒙𝑖𝑖 = (−2, −2). What is the probability that this point originated from Cluster 1
Cluster 1 Cluster 2
36
COMP90051 Statistical Machine Learning
•
E-step: Cluster responsibilities
𝑤𝑤 𝒩𝒩 𝒙𝒙 |𝝁𝝁 ,𝚺𝚺
Setting latent Z as originating cluster,
𝑖𝑖 𝑖𝑖 (𝑡𝑡)
𝑃𝑃 𝑧𝑧 =𝑐𝑐|𝒙𝒙,𝜽𝜽 =
yields(viaBayesrule) 𝑐𝑐 𝑖𝑖
𝑐𝑐 𝑐𝑐
∑𝑘𝑘 𝑤𝑤𝑙𝑙𝒩𝒩 𝒙𝒙𝑖𝑖|𝝁𝝁𝑙𝑙,𝚺𝚺𝑙𝑙 𝑙𝑙=1
•
This probability is called responsibility that cluster 𝑐𝑐 takes for data point 𝑖𝑖
𝐴𝐴𝑖𝑖𝑐𝑐 ≡𝑃𝑃 𝑧𝑧𝑖𝑖 =𝑐𝑐|𝒙𝒙𝑖𝑖,𝜽𝜽(𝑡𝑡)
Cluster 1 Cluster 2
37
COMP90051 Statistical Machine Learning
Expectation step for GMM
To simplify notation, we denote 𝒙𝒙1, … , 𝒙𝒙𝑛𝑛 as 𝑿𝑿 • 𝑄𝑄𝜽𝜽,𝜽𝜽𝑡𝑡 ≡𝔼𝔼𝒛𝒛|𝑿𝑿,𝜽𝜽𝑡𝑡 log𝑝𝑝𝑿𝑿,𝒛𝒛|𝜽𝜽
• =∑𝒛𝒛𝑝𝑝 𝒛𝒛|𝑿𝑿,𝜽𝜽(𝑡𝑡) log𝑝𝑝 𝑿𝑿,𝒛𝒛|𝜽𝜽 •=∑𝑝𝑝𝒛𝒛|𝑿𝑿,𝜽𝜽(𝑡𝑡)∑𝑛𝑛 log𝑤𝑤𝒩𝒩𝒙𝒙|𝝁𝝁,𝚺𝚺
𝒛𝒛 𝑖𝑖=1 𝑧𝑧𝑖𝑖 𝑖𝑖𝑧𝑧𝑖𝑖𝑧𝑧𝑖𝑖
•=∑𝑛𝑛 ∑𝑝𝑝𝒛𝒛|𝑿𝑿,𝜽𝜽(𝑡𝑡)log𝑤𝑤𝒩𝒩𝒙𝒙|𝝁𝝁,𝚺𝚺 𝑖𝑖=1𝒛𝒛 𝑧𝑧𝑖𝑖 𝑖𝑖𝑧𝑧𝑖𝑖𝑧𝑧𝑖𝑖
•=∑𝑛𝑛 ∑𝑘𝑘 𝐴𝐴log𝑤𝑤𝒩𝒩𝒙𝒙|𝝁𝝁,𝚺𝚺
𝑖𝑖=1 𝑐𝑐=1 𝑖𝑖𝑐𝑐 𝑧𝑧𝑖𝑖
𝑖𝑖 𝑧𝑧𝑖𝑖 𝑧𝑧𝑖𝑖
• =∑𝑛𝑛 ∑𝑘𝑘 𝐴𝐴 log𝑤𝑤 𝑖𝑖=1 𝑐𝑐=1 𝑖𝑖𝑐𝑐 𝑧𝑧𝑖𝑖
•+∑𝑛𝑛 ∑𝑘𝑘 𝐴𝐴log𝒩𝒩𝒙𝒙|𝝁𝝁,𝚺𝚺 𝑖𝑖=1 𝑐𝑐=1 𝑖𝑖𝑐𝑐 𝑖𝑖 𝑧𝑧𝑖𝑖 𝑧𝑧𝑖𝑖
38
COMP90051 Statistical Machine Learning
Maximisation step for GMM
• In the maximisation step, take partial derivatives of 𝑄𝑄 𝜽𝜽, 𝜽𝜽 𝑡𝑡 with respect to each of the parameters and set the derivatives to zero to obtain new parameter estimates
•𝑤𝑤(𝑡𝑡+1)=1∑𝑛𝑛 𝐴𝐴 •𝝁𝝁𝑐𝑐 =𝑛𝑛𝑖𝑖=1𝑖𝑖𝑐𝑐
(𝑡𝑡+1) ∑𝑛𝑛
𝑐𝑐 𝑖𝑖=1
𝑟𝑟𝑖𝑖𝑖𝑖𝒙𝒙𝑖𝑖 𝑟𝑟𝑖𝑖
∗Here𝐴𝐴≡∑𝑛𝑛 𝐴𝐴
𝑐𝑐 𝑛𝑛 𝑖𝑖=1 𝑖𝑖𝑐𝑐 ′ ′
• 𝚺𝚺(𝑡𝑡+1) =∑𝑖𝑖=1𝑟𝑟𝑖𝑖𝑖𝑖𝒙𝒙𝑖𝑖𝒙𝒙𝑖𝑖 −𝝁𝝁𝑡𝑡 𝝁𝝁𝑡𝑡 𝑐𝑐 𝑟𝑟𝑖𝑖 𝑐𝑐𝑐𝑐
• Note that these are the estimates for step (𝑡𝑡 + 1)
39
COMP90051 Statistical Machine Learning
Example of fitting Gaussian Mixture model
40
COMP90051 Statistical Machine Learning
K-means as a EM for a restricted GMM
• Consider a GMM model in which all components have the same fixed probability 𝑤𝑤 = 1/𝑘𝑘, and each Gaussian has the same fixed
𝑐𝑐
covariance matrix 𝚺𝚺 = 𝜎𝜎 𝑰𝑰, where 𝑰𝑰 is the identity matrix
2
• In such a model, only component centroids 𝝁𝝁𝑐𝑐 need to be
𝑐𝑐
• Next approximate a probabilistic cluster responsibility 𝐴𝐴 =
estimated
𝑃𝑃 𝑧𝑧𝑖𝑖 = 𝑐𝑐|𝒙𝒙𝑖𝑖 , 𝝁𝝁(𝑡𝑡) with a deterministic assignment 𝐴𝐴𝑖𝑖𝑐𝑐 = 1 if
𝑖𝑖𝑐𝑐 centroid 𝝁𝝁𝑐𝑐(𝑡𝑡) is closest to point 𝒙𝒙𝑖𝑖, and 𝐴𝐴𝑖𝑖𝑐𝑐 = 0 otherwise
𝑐𝑐
• Such a formulation results in a E-step where 𝝁𝝁 should be set as a
centroid of points assigned to cluster 𝑐𝑐
𝑐𝑐
• In other words, k-means algorithm is a EM algorithm for the restricted GMM model described above!!!
41
COMP90051 Statistical Machine Learning
This lecture
• Unsupervisedlearning ∗ Diversity of problems
• Gaussianmixturemodel(GMM)
∗ A probabilistic approach to clustering
∗ The GMM model
∗ GMM clustering as an optimisation problem
• TheExpectationMaximization(EM)algorithm
• Nextlecture:Moreunsupervisedwithdimreduction
42