程序代写代做 algorithm go C AI case study The Expectation-Maximization (EM) algorithm

The Expectation-Maximization (EM) algorithm
Can Yang
Department of Mathematics
The Hong Kong University of Science and Technology
Spring, 2020

Outline
Introduction to the EM algorithm
Finite mixture models
The EM algorithm for random-effects models The EM algorithm in general
Case Study: Crowd Sourcing

Expectation of a function
􏰀 Let p(x|θ) be the probability density of random variable x, where θ is the parameter of the density function, then
p(x|θ) ≥ 0,
􏰀 The expectation of x is given by
p(x|θ)dx = 1. xp(x|θ)dx.
E[x] =
􏰀 Note that E[x] involves θ but not x.
E[f] =
􏰀 Note that E[f] involves θ but not x.
􏰉 +∞ −∞
􏰉
􏰀 Let f(x) be a function of x. Similarly, the expectation of f(x) is given by
􏰉
f(x)p(x|θ)dx (1)

Motivation of the EM algorithm
􏰀 The goal of the EM algorithm is to find maximum likelihood solutions for models having latent variables.
􏰀 We denote the set of all observed data by X, and similarly we denote the set of all latent variables by Z. The set of all model parameters is denoted by θ, and so the log likelihood function is given by
􏰣􏰤 ln p(X|θ) = ln 􏰂 p(X, Z|θ) .
Z
􏰀 A key observation is that the summation over the latent variables appears inside the logarithm. Even if the joint distribution
p(X, Z|θ) belongs to the exponential family, the marginal distribution p(X|θ) typically does not as a result of this summation. The presence of the sum prevents the logarithm from acting directly on the joint distribution, resulting in complicated expressions for the maximum likelihood solution.

Motivation of the EM algorithm
􏰀 Now suppose that, for each observation in X, we were told the corresponding value of the latent variable Z.
􏰀 We shall call {X, Z} the complete-data set, and we shall refer to the actual observed data X as the incomplete. The likelihood function for the complete-data set simply takes the form ln p(X, Z|θ), and we shall suppose that maximization of this the complete-data log likelihood function is straightforward.
􏰀 In practice, however, we are not given the complete data set {X, Z}, but only the incomplete data X. Our state of knowledge of the values of the latent variables in Z is given only by the posterior distribution p(Z|X, θ).
􏰀 Because we cannot use the complete-data log likelihood, we consider instead its expected value under the posterior distribution of the latent variable, which corresponds to the E step of the EM algorithm.
􏰀 In the subsequent M step, we maximize this expectation.

An abstract of the EM algorithm
􏰀 In the E step, we use the current parameter values θold to find the posterior distribution of the latent variables given by p(Z|X, θold).
􏰀 We then use this posterior distribution to find the expectation of the complete-data log likelihood evaluated for some general parameter value θ. This expectation, denoted Q(θ,θold), is given by
Q(θ, θold) = EZ|X,θold [ln p(X, Z|θ)] = 􏰂 ln p(X, Z|θ)p(Z|X, θold) Z
􏰀 In the M step, we determine the revised parameter estimate θnew by maximizing this function
θnew =argmaxQ(θ,θold). θ
Note that in the definition of Q(θ,θold), the logarithm acts directly on the joint distribution p(X, Z|θ), and so the corresponding M-step maximization will, by supposition, be tractable.

Outline
Introduction to the EM algorithm
Finite mixture models
The EM algorithm for random-effects models The EM algorithm in general
Case Study: Crowd Sourcing

Mixture of Gaussian
􏰀 Suppose we have a data set of observations X = {x1,…,xN}, and all the observations are drawn independently from the mixture distribution
K
p(x) = 􏰂 πk N (x|μk , Σk ).
k=1
􏰀 Let θ = {πk,μk,Σk}k=1,…,K be the model parameter. Consider the problem of maximizing the likelihood for the data set X. The log-likelihood function
N􏰕K􏰖 lnp(X|θ)=􏰂ln 􏰂πkN(xn|μk,Σk) . (2)
6
4
2
0
−2
−4
−6
n=1 k=1
y
−8
−8 −6 −4 −2 0 2 4 6
x

􏰀 Let zn = [zn1,…,znK] be the latent variable indicating the membership of xn, i.e., znk = 1 if xn comes from the k-th component, znk = 0 otherwise. clearly, 􏰁k znk = 1.
􏰀 Let πk be the proportion of data points from the k-th component πk =p(znk =1), 0≤πk ≤1, 􏰂πk =1.
k
⇒ p(zn) = 􏰈 πznk . k
k=1
􏰀 Similarly, the conditional distribution of x given a particular value for z is a Gaussian
K
p(xn|znk = 1) = N(xn|μk,Σk) ⇒ p(xn|zn) = 􏰈 N(xn|μk,Σk)znk.
k=1
􏰀 A compact representation
π1 = p(zn1 = 1) 
 π2 =p(zn2 =1) 
K
􏰀 The marginal of xn is
K
k=1
􏰕K􏰖
…
π=p(z =1)
 􏰁Kk=1 znk = 1 
K nK
􏰀 The joint distribution of {xn , zn } is
p(xn,zn) = p(xn|zn)p(zn|π) = 􏰈(πkN(xn|μk,Σk))znk
p(xn) = 􏰂p(xn,zn) = 􏰂 􏰈(πkN(xn|μk,Σk))znk = 􏰂πkN(xn|μk,Σk). zn zn k=1 k

􏰀 For all N data points (due to i.i.d), the complete-data likelihood function takes the form
NK
p(X,Z|θ) = 􏰈 􏰈(πkN(xn|μk,Σk))znk
n=1 k=1
􏰀 Then the complete-data log-likelihood function is
NK
lnp(X,Z|θ) = 􏰂 􏰂 znk{lnπk + lnN(xn|μk,Σk)} (3)
n=1 k=1
􏰀 Compared with the incomplete log-likelihood function (2) for the incomplete data, the summation over k and the logarithm have been interchanged in the complete-data log-likelihood function (4). The logarithm now acts directly on the Gaussian distribution, which itself is a member of the exponential family.
􏰀 Although the complete log likelihood function can be maximized trivially in closed form, the solution involves latent variables.
􏰀 As we discussed in the EM algorithm, in the E-step we consider the expectation, with respect to the posterior distribution of the latent variables, of the complete log likelihood.
NK
Ez|X,θold[lnp(X,Z|θ)]= 􏰂 􏰂E[znk|X,θold]{lnπk +lnN(xn|μk,Σk)}
n=1 k=1
(4)

􏰀 In the M-step, we consider maximization of the following function with respect to parameters θ = {πk,μk,Σk}k=1,…,K
NK
Ez|X,θold[lnp(X,Z|θ)]= 􏰂 􏰂γ(znk){lnπk +lnN(xn|μk,Σk)}.
n=1 k=1
􏰀 In literature, the above function is often called as the Q(θ,θold) function.
􏰀 Maximizing the Q(θ, θold) function w.r.t. πk (with constraints 􏰁kπk =1,πk ≥0),wehave
􏰁Nn=1 γ(znk)
π
􏰂􏰂
N 􏰕 N 􏰖 +λ=0⇒ 􏰂γ(znk)+λπk =0⇒􏰂 􏰂γ(znk)+λπk
k
n=1 γ(znk)=N and
k n=1 πk =1⇒λ=−N ⇒πk =
=0 􏰁Nn=1 γ(znk)
Using
where λ is the Lagrangian multiplier.
N .
􏰂
nkk
􏰀 Maximizing the Q(θ,θold) function w.r.t. μk, we have
􏰁Nn=1 γ(znk)xn μk = 􏰁n γ(znk) .
􏰀 Maximizing the Q(θ,θold) function w.r.t. Σk, we have 􏰁Nn=1 γ(znk)(xn − μ)(xn − μ)T
Σk = 􏰁n γ(znk) .
􏰀 Let Nk = 􏰁n γ(znk). In fact, Nk can be interpreted as the effective number of points assigned to the k-th component.

EM for the mixture of Gaussian
􏰀 Step 1. Initialize θold, i.e., the means μk, covariances Σk and mixing coefficients πk, and evaluate the initial value of the incomplete log likelihood.
􏰀 Step 2. E-step Evaluate
γ(znk) = E[znk|X,θold] = 􏰁Kj=1 πjN(xn|μj,Σj)
􏰀 Step 3. M-step Update parameters using
new 􏰁Nn=1 γ(znk)
πkN(xn|μk,Σk)
πk = N ;
new μk =
new Σk =
􏰁Nn=1 γ(znk)xn 􏰁n γ(znk) ;
􏰁Nn=1 γ(znk)(xn − μ)(xn − μ)T 􏰁n γ(znk) .
􏰀 Step 4. Evaluate the incomplete log-likelihood
N􏰕K􏰖 lnp(X|π,μ,Σ)=􏰂ln 􏰂πkN(xn|μk,Σk) ,
n=1 k=1
and check for convergence of either the parameter or the incomplete log-likelihood. If the convergence criterion is not satisfied then go back to step 2, stops otherwise.

222
L=1
000
−2 −2 −2
−2 0 (a) 2 −2 0 (b) 2 −2
222
L = 2 L = 5 L = 20
000
−2 −2 −2
−2 0 (d) 2 −2 0 (e) 2 −2
0 (c) 2
Figure 1: Illustration of the EM algorithm using the Old Faithful set (Bishop, Pattern Recognition and Machine learning, 2006), where L is the number of EM iterations.
0 (f) 2

K-means clustering
􏰀 A binary indicator variables rnk ∈ {0, 1}: If data point xn is
assignedtoclusterk,thenrnk =1,andrnj =0forj̸=k. 􏰀 The ob jective function
􏰀 The algorithm
NK
J = 􏰂 􏰂 rnk ∥xn − μk ∥2
n=1 k=1
􏰀 Step 1: Randomly initialize μk,k = 1,…,K.
􏰀 Iterate in Steps 2 and 3 until convergence.
􏰀 Step 2: Update rnk as 􏰓 1
rnk = 0
􏰀 Step 3: Update μk as
ifk=argminj∥xn−μj∥2 otherwise
􏰁n rnkxn μk= 􏰁nrnk

Relation to K-means
􏰀 Consider a Gaussian mixture model in whihch the covariance matrices of the mixture components are given by εI, where ε is a variance parameter that shared by all of the components, and I is the identity matrix, so that
1 􏰄1 2􏰅 p(x|μk,Σk)= (2πε)D/2 exp −2ε∥x−uk∥ .
􏰀 The posterior probabilities for a particular data point xn, are given by
πk exp{−∥xn − μk∥2/2ε} γ(znk) = 􏰁j πj exp{−∥xn − μj∥2/2ε}
􏰀 Consider the limit case ε → 0. We have γ(znk) → rnk.

Relation to K-means
􏰀 The EM re-estimation equation for the μk becomes the same as
the K-mean update.
􏰀 Finally, in the limit ε → 0, the expected complete data log
likelihood
Ez|X,θold[lnp(X,Z|θ)]=􏰂􏰂γ(znk){lnπk +lnN(xn|μk,Σk)}.
becomes
􏰀 Maximizing (6) is equivalent to miniming
NK n=1 k=1
NK􏰄1􏰅 􏰂􏰂rnk −2ε∥xn −μk∥2 n=1 k=1
+Const (6)
NK
J = 􏰂􏰂rnk∥xn −μk∥2.
n=1 k=1
􏰀 To summarize, the K-means algorithm is the limit version of the EM algorithm for Gaussian mixture.

Outline
Introduction to the EM algorithm
Finite mixture models
The EM algorithm for random-effects models The EM algorithm in general
Case Study: Crowd Sourcing

The EM algorithm for random-effects models
􏰀 As a third example of the application of EM, we consider the random effects model
y = Wu + e,
u ∼ N (0, σu2 I), (6) e ∼ N ( 0 , σ e2 I ) .
􏰀 y ∈ RN×1: the vector of response, where n is the sample size.
􏰀 W=[w1T,…,wNT]∈RN×c:therandomeffectdesignmatrix.
􏰀 u ∈ Rc×1: the vector of coefficients corresponding to random
effects.
􏰀 e ∈ RN ×1 : the error term.
􏰀 Let α = 1 and β = 1 be the precision (the inverse of variance is called
σu2 σe2
precision).
􏰀 Here we can view u as the latent variable and (α, β) as the model parameters. 􏰀 The complete-data log likelihood function is then given by
where
ln p(y, u|α, β) = ln p(y|u, β) + ln p(u|α)
N
p(y|u,β)= 􏰈N(yn|uTwn,β−1)
n=1
p(u|α) = N (u|0, α−1 I)

􏰀 Therefore, the complete-data log likelihood function can be further written as
N 􏰄 β 􏰅 β 􏰂N
lnp(y,u|α,β)= ln − (yn −uTwn)2 + 2 2π 2n=1
c 􏰟 α 􏰠 α
ln − uTu.
2 2π 2
􏰀 Suppose we have previous parameter θold = [αold, βold], now we can take the
expectation w.r.t. the posterior distribution (at θold):
N􏰄β􏰅β􏰂N 􏰊 T 2􏰋
Eu|y,W,θold [ln p(y, u|α, β)] = 2 ln 2π − 2
c􏰟α􏰠α 􏰊T􏰋
Eu|y,W,θold (yn − u wn) +2ln2π−2Eu|y,W,θold uu.
n=1
􏰀 To evaluate the expectation here, we need the following basic property of the multivariate Gaussian distribution (e.g., see Equation (2.62) in Bishop. C Pattern Recognition and machine learning): If x ∈ RD follows the multivariate Gaussian N (μ, Σ), then
E[xxT]=μμT +Σ. (5)
􏰀 For this problem, the posterior distribution of u is given as
where
u|y,W,α,β ∼ N(μr,Σr) 􏰄α 􏰅−1
μr = βI+WTW WTy,Σr =(α+βWTW)−1. The subscript “r” indicates a random-effects model.

􏰀 Now we can evaluation Eu|y,W,θ
old
􏰥(yn − uT wn )2 􏰦 as follows: 􏰊T2􏰋T2
􏰊 T 2􏰋 (yn−u wn)
Eu|y,W,θold
=Eu|y,W,θold
=Eu|y,W,θold
=Eu|y,W,θold
=yn2 − 2ynEu|y,W,θold (uT )wn + 􏰊T r(wnwnT Eu|y,W,θold [uuT ])􏰋 =yn2 − 2ynμTr wn + 􏰊Tr[wnwnT (μrμTr + Σr)]􏰋 ( using Equation (5))
T r[(yn − u wn) ] (since (yn − u wn) is a scalar) 􏰊2TTT􏰋
yn − 2ynu wn + Tr(wn uu wn)] 􏰊2TTT􏰋
yn − 2ynu wn + Tr(wnwn uu )]
􏰀 Similarly, we have
􏰀 In the M-step, we set the derivatives w.r.t α to zero, we obtain
E [ u T u ] = μ Tr μ r + T r ( Σ r )
cc α==.
E [ u T u ] μ Tr μ r + T r ( Σ r ) 􏰀 An analogous result hold for β
N β=.
∥y − Wμr∥2 + Tr[WT WΣr]

Outline
Introduction to the EM algorithm
Finite mixture models
The EM algorithm for random-effects models The EM algorithm in general
Case Study: Crowd Sourcing

􏰀 Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z. The joint distribution p(X, Z|θ) is governed by a set of parameters denoted θ. Our goal is to maximize the likelihood function that is given by
p(X|θ) = 􏰂 p(X, Z|θ). (6) Z
where Z is assumed to be discrete, but summation can be replaced by integration if Z is continuous.
􏰀 Suppose that direct optimization of p(X|θ) is difficult, but that optimization of the complete-data likelihood function p(X, Z|θ) is significantly easier. Next we introduce a distribution q(Z) defined over the latent variables, and we observe that, for any choice of q(Z), the following decomposition holds
where
ln p(X|θ) = L(q, θ) + KL(q||p) (7)
􏰂 􏰓 p(X, Z|θ) 􏰔
L(q, θ) = KL(q||p) = −
q(Z) ln
q(Z)
􏰓 p(Z|X, θ) 􏰔
Z
􏰂
q(Z) ln
􏰀 It is easy to verify Equation (7) by using p(X, Z|θ) = p(X|θ)p(Z|X, θ).
Z
q(Z)

Ascent property of EM
􏰀 Kullback-Leibler divergence (KL divergence, also known as relative entropy)
dx
􏰀 Recall Jensen’s inequality holds for convex function f(x). f(E[x]) ≤ E[f(x)]
􏰀 Applying Jensen’s inequality to f (x) = − ln(x) in KL diverence, we have 􏰉 􏰓p(x)􏰔 􏰉
bound of ln p(X|θ). Ascent property of EM
􏰀 In the E-step, ln p(x|θold) = L(q, θold) if and only if q(Z) = p(Z|X, θold).
􏰀 In the M-step, L(q, θold) ≤ L(q, θnew).
􏰀 L(q, θnew ) ≤ L(q′, θnew ) = ln p(x|θnew ), where q′(Z) = p(Z|X, θnew ) and the inequality holds since L(q, θ) is a low bound of ln p(X|θ).
KL(q||p) = − q(x) ln where p(x) and q(x) are both probability function.
KL(q||p) = − q(x)ln
q(x)
􏰉 􏰓p(x)􏰔
dx ≥ −ln p(x)dx = 0.
􏰀 Using this result, we can see that ln p(X|θ) ≥ L(q, θ), i.e., L(q, θ) is a low
q(x)

􏰀 Here the red curve depicts the (in-complete data) log likelihood function whose value we wish to maximize.
􏰀 We start with some initial parameter value θold, and in the first E step we evaluate the posterior distribution over latent variables, which gives rise to a lower bound L(q,θold) whose value equals the log likelihood at θold, as shown by the blue curve.
􏰀 In the M step, the bound is maximized giving the value θnew, which gives a larger value of log likelihood than θold.
􏰀 The subsequent E step then constructs a bound that is tangential at θnew as shown by the green curve.
L(q,θ)
lnp(X|θ)
θold θnew
Figure 2: The EM algorithm involves alternately computing a lower bound on the log likelihood for the current parameter values and then maximizing this bound to obtain the new parameter values. This leads to the connection between the EM algorithm and the Majorization Minimization (MM) algorithm.

Outline
Introduction to the EM algorithm
Finite mixture models
The EM algorithm for random-effects models The EM algorithm in general
Case Study: Crowd Sourcing

Introduction
􏰀 Superviesd Learning: training data comprises examples of input vectors along with their corresponding target vectors.
􏰀 What if the target values are subjectively labeled? Or if it’s infeasible or expensive to obtain the reliable labels?
􏰀 Motivation: medical images
􏰀 The actual gold standard is obtained by biopsy: expensive,
invasive, potentially dangerous.
􏰀 Radiologists visually examine the medical images which provides
subjective version of gold standard.
􏰀 Computer-aided diagnosis(CAD): Build a classifier to predict
whether a suspicious region on a medical image(like a X-ray,CT scan, or MRI) is malignant or benign.

Use of subjective labels
􏰀 Crowdsourcing: Amazon Mechanical Turk, Galaxy Zoo…
􏰀 Advantage: cheaper
􏰀 Disadvantage: noisy, hard to evaluate without gold standard.
Figure 3: Left: Middle Age Mechanical Turk. Right: Amazon Mechanical Turk

Goals & Problems
􏰀 Main task: Jointly learn the classifier and estimate the gold standard from multiple noisy lables.
􏰀 challenges:
􏰀 How to adapt conventional supervised learning algorithms when
multiple annotators take the place of gold-standard?
􏰀 How to evaluate the system without gold-standard?
􏰀 How to evaluate the performance of each annotator?

Majority Voting
􏰀 Majority Vote:
 1􏰂R j
1 if yi > 0.5,
 Rj=1 yˆi = R
􏰀 Soft Threshold:
1􏰂
 j
0if y <0.5. Ri j=1 1 􏰂R Pr[yi = 1|yi1, ..., yiR] = R yij 􏰀 Majority Vote fails when there is only one true expert and all the novices give the same incorrect label to a specific instance. j=1 Two-coin Model for Annotators 􏰀 Let yj ∈ {0, 1}, j = 1, ..., R be the label assigned to the instance x by the jth annotator, let y be the actual (unobserved) label for this instance. 􏰀 If the true label is one, a coin with bias αj is flipped, and ”one” is annotated to this instance if a head is obtained, otherwise ”zero” is annotated: sensitivity : αj := Pr[yj = 1|y = 1] (8) 􏰀 On the other hand, if the true lable is zero, a coin with bias βj is flipped, and ”zero” is annotated if a head is obtained, otherwise ”one” is annotated: specificity : βj := Pr[yj = 0|y = 0] (9) 􏰀 Assumption: αj and βj do not depend on the instance x .i.e the radiologist’s performance is consistent across different sub-groups of data. Classification Model 􏰀 A sigmoid logistic link is suggested for the classifier: Pr[y = 1|x, w] = σ(wT x), where σ(z) = 1/(1 + e−z). 􏰀 Given the data D consisting of N instances with annotations from R annotators i.e. D = 􏰡xi, yi1, ..., yiR􏰢Ni=1, the task is to estimate the weight vector w and also the sensitivity α = [α1 , ..., αR ] and the specificity β = [β1, ..., βR] of R annotators. 􏰀 Another task is to get an estimate of the unknown gold standard yi, ..., yN . Maximum Likelihood 􏰀 Let θ = {w, α, β} denotes the parameters, the marginal likelihood function is expressed as: N Pr[D|θ] = 􏰈 Pr[yi1, ..., yiR|xi, θ] 􏰀 MLE i=1 􏰈N (10) = {Pr[yi1,...,yiR|yi =1,α]Pr[yi =1|xi,w] i=1 +Pr[yi1,...,yiR|yi =0,β]Pr[yi =0|xi,w]} 􏰀 Assumption: Given true label yi, the annotations yi1, ..., yiR are independent, i.e. the annotators make their decisions independently. RR 1R 􏰈jj􏰈jyjj1−yj Pr[yi,...,yi |yi=1,α]= Pr[yi|yi=1,α]= [α]i[1−α] i j=1 j=1 RR 1R 􏰈jj􏰈j1−yjjyj Pr[yi,...,yi |yi=0,β]= Pr[yi|yi=0,α]= [β] i[1−β]i j=1 j=1 ˆˆ θML = {αˆ,β,wˆ} = argmax{logPr[D|θ]} θ The EM Algorithm 􏰀 The complete data likelihood is Pr[D, y|θ] = 􏰈 Pr[yi1, . . . , yiR, yi|xi; θ] where i = 􏰈 Pr[yi1, . . . , yiR|yi; α, β]Pr[yi|xi; w] i = 􏰈 ayi pyi (1 − pi)1−yi b1−yi , iii i pi :=σ(wTxi) R j=1 􏰀 The complete log-likelihood is: ai := bi := 􏰈 j yj j 1−yj [α ] i [1 − α ] i j=1 R 􏰈 j 1−yj j yj [β ] i [1 − β ] i (11) N log Pr[D, y|θ] = 􏰂 yi log[piai] + (1 − yi) log[(1 − pi)bi], (12) i=1 EM Algorithm: E-Step 􏰀 We first take expectation with respect to the posterior of yi: N E{logPr[D,y|θ]}=􏰂μilog[piai]+(1−μi)log[(1−pi)bi], (13) i=1 where the expectation is with respect to yi with current estimated posterior probability: μi := Pr[yi = 1|yi1, ..., yiR, xi, θ] = Pr[yi1, ..., yiR|yi = 1, α] · Pr[yi = 1|xi, w] Pr[yi1, . . . , yiR|xi; θ] (14) = aipi aipi + bi(1 − pi) EM Algorithm: M-Step 􏰀 Based on the μi obtained in E-step, the model parameters θ are then estimated by maximizing the Q function (13). Let the gradient of the Q function equating to zero, we have closed form solutions for αj and βj: βj = 􏰀 Due to the non-linearity of the sigmoid, we do not have a closed form solution for w, so the Newton-Ralphson update is used: wt+1 = wt − ηH−1g, where g is the gradient, H is the Hessian matrix and η is the step size. 􏰁N μiyj αj = 1 i , 􏰁N (1 − μi)(1 − yj) i=1 i (15) 􏰁Ni=1 μi 􏰁Ni=1(1 − μi) N g = 􏰂[μi − σ(wT xi)]xi, i=1 N H = − 􏰂[σ(wT xi)][1 − σ(wT xi)]xTi xi. i=1 (16) Reference Bishop, C. Pattern recognition and machine learning. Springer, 2006. Geoffrey McLachlan, Thriyambakam Krishnan The EM Algorithm and Extensions, 2nd. Wiley-Interscience, 2008. Geoffrey McLachlan, Thriyambakam Krishnan Finite Mixture Models. Wiley-Interscience, 2000.

Related Posts