Expectation Maximization Algorithm
CSE, UNSW
April 14, 2021
1/15
CSE, UNSW
Expectation Maximization Algorithm
Motivation
Missing data Latent variable Easier optimization
2/15
Expectation Maximization Algorithm
CSE, UNSW
Convex Function
e.g., −log(x)
An important concept in optimization / machine learning.
3/15
Expectation Maximization Algorithm
CSE, UNSW
Jensen’s Inequality
Let f be a convex function defined on an interval I. If {xi }ni=1 ∈ I and {λi }ni=1 ≥ 0 with i λi = 1, then
nn
f λi xi ≤ λi f (xi )
i=1 i=1
Theequalityholdsiffx1 =x2 =…=xn orf islinear. Corollary: Since ln(x) is a concave function (i.e., −ln(x) is a convex function), then
nn
ln λif(xi) ≥ λi ln(f(xi))
i=1 i=1
In addition, the equality holds iff f (xi ) is a constant.
4/15
Expectation Maximization Algorithm
CSE, UNSW
Log Likelihood
Define log likelihood function L(θ) = ln Pr{x | θ}. For i.i.d. examples, L(θ) = i L(i)(θ) = i lnPrx(i) | θ.
Goal: find θ∗ that maximizes the log likelihood.
What if the model contains latent variable z = [z(i)]i (whose
value is unknown)?
(i) def (i) (i) (i) L(θ)=lnPrx|θ=ln z(i)Prx,z|θ
(i) Prx(i),z(i) | θ
=ln z(i) q(z )· q(z(i)) (†)
≥ q(z(i))·lnPrx(i),z(i) |θ z(i) q(z(i))
If q(z(i)) = Prz(i) | x(i),θ, then the equality holds Expectation Maximization Algorithm
5/15
CSE, UNSW
θ vs θ(old)
Given the current parameter θ(old), and let
(i)def (i) (i) q(old)(z )=Pr z |x ,θ(old)
Prx(i),z(i) | θ L(i)(θ) = ln q(old)(z(i)) q(old)(z(i))
z(i)
Prx(i),z(i) | θ
≥ q(old)(z(i)) ln q(old)(z(i)) z(i)
lnPrx(i),z(i) |θ −Prz(i) |x(i),θ lnPrz(i) |x(i),θ
z(i)
= Prz(i) | x(i),θ
(†)
=Prz(i) |x(i),θ (old)
z(i)
z
(i )
constant,entropy(q)
(old) (old) lnPrx(i),z(i) | θ+
C
(old)
def (i)
=Q (θ,θ(old))
6/15
Expectation Maximization Algorithm
CSE, UNSW
EM
Hence, the EM algorithm iterates the following two steps:
[E-step]: Compute the q(old)(z(i)) = Prz(i) | x(i), θ(old)
[M-step]: Find θ that maximizes the function Q(θ,θ(old)) (see above (just sum over i).
Alternative interpretation:
(i) def (i) (i) (i) (i) Q(θ,θ(old))= Prz|x,θ(old)lnPrx,z|θ
z(i)
= Ez(i)∼q(old)(z(i)) ln Pr x , z | θ
i.e., the expected complete log-likelihood (function) Sample z from the proposal distribution q
Then it is easy to compute the complete log-likelihood Do this for every possible z
(i) (i)
7/15
Expectation Maximization Algorithm
CSE, UNSW
Illustration
θ E-step M-step θ E-step M-step . . . (t) q(t) log-likelihood (t+1) q(t+1) log-likelihood
z(t )
z(t +1)
8/15
CSE, UNSW
Expectation Maximization Algorithm
How EM converges
9/15
CSE, UNSW
Expectation Maximization Algorithm
Example 1: Three Coins
Given three coins: z, a, and b, with head probabilities π, α, and β, respectively.
Generative process: if toss(z) == head, return(toss(a)); else return(toss(b)).
Observed data x = [1,1,0,1,0,0,1,0,1,1]. Goal: estimate the parameters
The usual assumption: all tosses are i.i.d.
If we know { z(i) }10 i=1
Observed data:
z(i) 1011100100 coin→x(i) abaaabbabb x(i) 1101001011
πMLE = αMLE = βMLE = CSE, UNSW Expectation Maximization Algorithm
10/15
Solution /1
Problem setup!:
θ =?
Missing data (i.e., z) = ?
Complete likelihood (for a single item): Pr{xi , zi | θ} (change of notation henceforth)
The E-step: Given current θt, we can determine the distribution q
def Pr{zi = 1,xi | θt}
μi,t =Pr{zi =1|xi,θt}=
π αxi(1−α )1−xi
Pr{xi |θt}
π αxi(1−α )1−xi +(1−π )βxi(1−β )1−xi
=ttt tttttt
Numerator:
=Pr{zi =1|θt}·Pr{xi |zi =1,θt}
typical trick to write the piece-wise function for the likelihood.
Denominator: sum over zi = 1 and zi = 0.
Expectation Maximization Algorithm
11/15
CSE, UNSW
Solution /2
Compute Q(θ, θ(old)) First
ln(Pr{xi,zi |θ})=lnπ[αxi(1−α)1−xi]zi ·[(1−π)βxi(1−β)1−xi]1−zi =lnπ+zi ·(xi lnα+(1−xi)ln(1−α))+
(1−zi)·(xi lnβ+(1−xi)ln(1−β)) Then:
Q = q(zi)ln(Pr{xi,zi | θt}) i zi
=μi,t ln(Pr{xi,zi =1|θt})+(1−μi,t)ln(Pr{xi,zi =0|θt}) i
The M-step:
∂Q(θ | θt) = 0 =⇒ πt+1 = 1 μi,t
∂π n
∂Q(θ | θt) ∂α ∂Q(θ | θt) ∂β
=0=⇒αt+1= iμi,t i(1−μi,t)xi
i
i μi,txi
=0=⇒βt+1= i(1−μi,t) Expectation Maximization Algorithm
12/15
CSE, UNSW
Understanding the Equations
∂Q(θ | θt) = 0 =⇒ πt+1 = 1 μi,t
∂π n
∂Q(θ | θt) ∂α ∂Q(θ | θt) ∂β
Consider the example on the question page. In that example, we can deem that μi,t is a binary variable, i.e., μi,t = 1 iff coin z(i) = head, or equivalent, coin a is chosen to determine x(i). Then one can easily verify that the MLE estimation is the same as the update rules in EM. Therefore, these rules can be deemed as a “soft” version of MLE: informally, each x(i) has μi,t contribution to the parameter estimation of coin a, and (1 − μi ,t ) contribution to the parameter estimation of coin b.
=0=⇒αt+1= iμi,t i(1−μi,t)xi
=0=⇒βt+1= i(1−μi,t)
i
i μi,txi
13/15
CSE, UNSW
Expectation Maximization Algorithm
Concrete Example
μi,t =p(zi =1|xi =1, θt )
π=0.6,α=0.1,β=0.8 = p(zi =1,xi =1|θt)
= =
p(xi =1|θt)
p(zi =1,xi =1|θt)
p(zi =1,xi =1|θt)+p(zi =0,xi =1|θt) 0.6·0.1 = 0.16
0.6·0.1+0.4·0.8
Similarly,
p(zi =1|xi =0,θt)=
0.6·0.9 =0.82 0.6·0.9+0.6·0.2
14/15
CSE, UNSW
Expectation Maximization Algorithm
Concrete Example /2
How many different scenarios?
zi xi
00 01 10 11
Observations: 6 1’s and 4 0’s.
p(zi |xi,θt)
0.18
0.84
0.82 0.16
πt+1 = 1μi,t = 0.16·6+0.82·4 =0.424 n 10
i
i μi,txi 0.16·6
αt+1= μ = 4.24 =0.226
i i,t
i(1 − μi,t)xi 0.84 · 6
βt+1 = i(1−μi,t) = 0.84·6+0.18·4 =0.875 Expectation Maximization Algorithm
15/15
CSE, UNSW