§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Missing Data and EM
MAST90083 Computational Statistics and Data Mining
School of Mathematics & Statistics The University of Melbourne
Missing Data and EM 1/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Outline
§5.1 Introduction
§5.2 Motivation
§5.3 Expectation-Maximization
§5.4 Derivation of the EM
§5.5 Newton-Raphson and Fisher Scoring
Missing Data and EM 2/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Introduction
Assume a set of observations y = {y1, . . . , yN } representing i.i.d. samples from a random variable y
We aim to model this data set by specifying a parametric probability density model
y ∼ g (y;θ)
The vector θ represents one or more unknown parameters that govern the distribution of the random variable y
Missing Data and EM 3/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Example
If we assume that y has a normal distribution with mean μ and variance σ2 than
and
θ = μ,σ2
1 (y−μ)2
g(y;θ)=√2πσexp − 2σ2
Given the sample y, we aim to find the parameter vector that is most likely the“true”parameter vector of the DGP that generated the sample set y
Missing Data and EM 4/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Maximum Likelihood
The probability density function of the set of observations under the model g (y;θ) is
N
L(y;θ) = g (y;θ) = g (y1,…,yN;θ) = g (yi;θ)
i=1
L (y; θ) defines the likelihood function. It is a function of the
θ (unknown) with the set of observations y = {y1, . . . , yN }
fixed.
The maximum likelihood method is most popular technique of
parameter estimation. It consists in finding the most likely estimate θˆ by maximizing L (y; θ)
θˆ = argmaxL(y;θ) θ
Missing Data and EM 5/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Maximum Likelihood
The log-likelihood corresponds to the logarithm of L (y; θ)
NN
l(y;θ) = l(yi;θ) = logg (yi;θ)
i=1 i=1
and l (yi ; θ) = log g (yi ; θ) is called log-likelihood component
The maximum likelihood method is generally obtained by maximizing l (y; θ)
The likelihood function is also used to assess the precision of θˆ
Missing Data and EM 6/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Maximum Likelihood
The score function is defined by
where
N
l ̇(y;θ) = l ̇(yi;θ)
i=1
l ̇ ( y ; θ ) = ∂ l ( y ; θ ) ∂θ
At the maximum in the parameter space l ̇ ( y ; θ ) = 0
Missing Data and EM 7/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Properties of Maximum Likelihood
The information matrix is defined by
I (θ) = −N ∂2l(yi;θ)
i=1 ∂θ∂θ⊤
I (θ) evaluated at θ = θˆ is the observed information and
Var θˆ = I θˆ−1
The Fisher information or expected information is
i (θ) = Eθ [I (θ)] Assume θ0 denotes the trues value of θ
Missing Data and EM 8/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Properties of Maximum Likelihood
The sampling distribution of the maximum likelihood estimator is a normal distribution
θˆ → N θ 0 , i ( θ 0 ) − 1
The samples are independently obtained from g (y, θ0)
This suggests that the sampling distribution of θˆ may be ˆ ˆ−1
approximated by N θ, I θ
The corresponding estimates for the standard errors of θˆ are j
ˆ−1
obtained from I θ
jj
Missing Data and EM 9/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Local likelihood
Any parametric model can be made local if the fitting method accommodates observation weights
Local likelihood allows a relation from a globally parametric model to one that is local
N
l (z, θ(z0)) = Kh(z0, zi )l (zi , θ(z0))
i=1
For example l (z, θ) = y − x⊤θ2. This fits a linear varying coefficient model θ(z) by maximizing the local likelihood
Missing Data and EM 10/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Likelihood and Kullback-Leibler Divergence
The maximum likelihood estimate is obtained by
ˆ 1 N θ = argmaxl(y,θ) = argmax
Using the empirical density
1 N
gN (y) = N
which puts a mass 1/N at yi′s we have
1N
logg (yi,θ)
N
logg (yi,θ) = logg (y,θ)gN (y)dy
i=1
θθN
i=1
i=1
δ (y − yi )
Missing Data and EM 11/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Likelihood and Kullback-Leibler Divergence
We have
logg (y,θ)gN (y)dy = logg (y,θ)dGN (y) The maximization can be replaced by
θˆ = arg min log g (y, θ0) dGN (y) − log g (y, θ) dGN (y) θ
Missing Data and EM 12/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Likelihood and Kullback-Leibler Divergence
Using the law of large numbers
θˆ = argmin θ
logg (y,θ0)dG (y,θ0)
− logg(y,θ)dG(y,θ0)
Missing Data and EM 13/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Likelihood and Kullback-Leibler Divergence
and we have
θˆ = argminKL[g (y,θ0),g (y,θ)] θ
Therefore the ML estimate is also the one that minimizes the KLD between a family of parametrized distributions and the true distribution.
Missing Data and EM 14/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
Simple mixture model for density estimation using maximum likelihood
A Gaussian density would not be appropriate → because there are two regimes
Missing Data and EM 15/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
We model y as a mixture of two model densities y1 ∼ N μ1,σ12 y2 ∼ N μ2,σ2
y ∼(1−z)y1 +zy2
where z ∈ {0, 1} with p(z = 1) = π is the mixing coefficient
Missing Data and EM 16/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
The generative representation can be seen as
Generate a z ∈ {0, 1} with probability π
Depending on the outcome deliver y1 or y2
Let φ (y) denote the normal density with parameters θ = μ, σ2. Then the density of y is
2
p(y ) = (1 − π)φθ1 (y ) + πφθ2 (y ) = mi φθi (y )
i=1 where m1 = 1 − π, m2 = π and 2i=1 m1 = 1.
Missing Data and EM 17/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
Now suppose we are given a data set of size N and we want to fit this model using maximum likelihood to estimate
θ = (π,θ1,θ2) = (π,μ1,σ12,μ2,σ2) The likelihood is
N
l (y, θ) = log [(1 − π)φθ1 (yi ) + πφθ2 (yi )]
i=1
Direct work with l (y, θ) is difficult instead we make use of z
Missing Data and EM 18/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Illustration
Suppose one of the component mixture say φθ2 has its mean μ2 exactly equal to one of the observation so that μ2 = yi
This data point will contribute a term in the likelihood function of the form 1
2πσ2
If we consider the limit σ2 → 0, then this term goes to infinity
and as is the likelihood function
Thus maximizing of the log-likelihood function is not a well posed problem because such singularities will always be present when one Gaussian is identified to an observation
Missing Data and EM 19/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
It is preferable to work with the joint density p(y,z)
The marginal density of z is specified in terms of the mixing
coefficient π, p(z = 1) = π and
p(z) = πz(1 − π)1−z
Similarly, the conditional
p(y/z = 1) = φθ2(y)
which can also be written as
p(y/z) = φθ2(y)zφθ1(y)1−z
Missing Data and EM 20/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Motivation
The joint density is given by p(y/z)p(z) and the marginal of y is obtained by summing the joint density over all possible states of z to give
p(y) = p(z)p(y/z) = πφθ2 (y) + (1 − π)φθ1 (y) z
Thus the marginal density of y is the Gaussian mixture
If we have several observations y = {y1, . . . , yN } and because we have represented the marginal distribution in the form p(y) = z p(y,z), it follows that for every observed data yi there is a corresponding zi
Therefore there is an equivalent formulation of the Gaussian mixture involving an explicit latent variable.
Missing Data and EM 21/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Justification for the EM
We are now able to work with the joint density p(y,z) instead of the marginal p(y)
This leads to the introduction of the expectation maximization (EM) algorithm
Another important quantity is the conditional density of z given y.
We use γ(z) to denote p(z = 1/y)
Missing Data and EM 22/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Justification for the EM
The value of this conditional can be obtained using Bayes theorem
γ(z)=p(z =1/y)= p(z =1)p(y/z =1) = p(y)
p(z = 1)p(y/z = 1) πφθ2(y)+(1−π)φθ1(y)
π is the probability of z = 1 while γ(z) is the corresponding probability once we have observed y
γ(z) can be seen as the responsibility that φθ2 takes for explaining the observation y
Missing Data and EM 23/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Justification for the EM
For the two component mixture, start with initial estimates for the parameters (part 1)
Expectation step:
γi = πˆφθˆ2(yi) i=1,…,N
πˆφθˆ2 (yi ) + (1 − πˆ)φθˆ1 (yi ) Maximization step:
Ni=1(1−γˆi)yi 2 Ni=1(1−γˆi)(yi −μˆ1)2 μˆ= σˆ=
1 Ni=1(1 − γˆi ) 1 Ni=1(1 − γˆi )
Ni=1γˆiyi 2 Ni=1γˆi(yi −μˆ2)2 μˆ= σˆ=
2Nγˆ2 Nγˆ i=1 i i=1 i
Missing Data and EM 24/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Justification for the EM
For the two component mixture, start with initial estimates for the parameters (part 2)
Maximization step:
N
πˆ = γˆ i / N i=1
Iterate these two steps until convergence
N l(y,z,θ)=[(1−zi)logφθ1(yi)+zi logφθ2(yi)]
i=1 N
+[(1−zi)log(1−π)+zi logπ] i=1
Missing Data and EM 25/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
When to use the EM
The EM algorithm is very useful when:
We have missing values due to the observation process, including unknown clusters.
Assuming hidden (latent) parameter for problem simplification.
Missing Data and EM 26/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
EM – how does it work?
We observed Y (observed data) with the pdf g(y|θ).
Y can be either a number, a vector, a matrix, or of a more general form.
We assume that some hidden parameter Z exist.
Let (Y, Z) be the complete data having the pdf f (y, z|θ).
We also assume/specify the joint pdf of (Y, Z) f (Y, Z|θ)
Missing Data and EM 27/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Reminder
The objective is to maximize lng(y|θ) w.r.t. θ in order to find the MLE of θ
If it is easy to do maximization of lng(y|θ) directly, then there is no need to use the EM.
So suppose maximizing lng(y|θ) is difficult but maximizing ln f (y, z|θ) is relatively easy provided that (Y, Z) were completely observed.
Missing Data and EM 28/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
EM – how does it work?
Now let’s define a new likelihood function L(θ|y,z) = f(y,z|θ)
We call it the complete-data likelihood
What is constant and what is random in this function?
y – the set of the observed data, known and fixed θ – parameter/s of the DGP, fixed but unknown z – latent variables ,unknown random variables
Therefore, L(θ|y, z) = hθ,y(z)
We need a tool to solve the optimization of the complete-data likelihood.
Missing Data and EM 29/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
EM – how does it work?
The expected value maximizes the likelihood!
So, what is the expected value of the complete-data likelihood?
(1) Q(θ,θi−1) = E[lnf(y,z|θ)|Y,θi−1] Recall, the expectation of conditional density is
E[h(y)|X = x] = h(y)f ̇(y|x)dy y
y
Note, k(z|y,θi−1) is a conditional distribution of the unobserved data. It depends on the current value of θi−1 & on the observed data y.
(2) E[lnf(y,z|θ)|Y,θ ] =
lnf(y,z|θ)k(z|y,θ )dy
i−1 ̇i−1
Missing Data and EM 30/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
EM – how does it work?
For a given value of θi−1 & the observed dataset y we can evaluate Q(θ,θi−1).
This is the 1st step of the EM algorithm – the E-step .
But, Q(θ,θi−1) will remain a function of θ!
Now we can maximize it with respect to θ
θi = argmax Q(θ, θi−1) .
θ
and update θi−1 by θi.
This is the 2nd step of the EM algorithm – the M-step .
Missing Data and EM 31/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
EM – how does it work?
In the EM algorithm the two steps are repeated many times.
Each iteration is guaranteed to increase the log L. Moreover, EM algorithm is guaranteed to increase the observed-data log − liklihood.
Why?
Missing Data and EM 32/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Deriving the EM
EM algorithm:
Start from an appropriate initial value θ(0).
Given the current iterate θ(r), r = 1,2,···, E-step ComputeQ(θ,θ(r))=Elnf(y,z|θ)|y,θ(r).
M-step Maximize Q(θ,θ(r)) as a function of θ to obtain θ(r+1).
The iteration continues until ||θ(r+1) − θ(r)|| or
|Q(θ(r+1), θ(r)) − Q(θ(r), θ(r))| is smaller than a prescribed ε > 0 (e.g. ε = 10−6).
Remark: If the M-step is replaced with
M’-step: Find θ(r+1) so that Q(θ(r+1),θ(r)) > Q(θ(r),θ(r)),
the resultant algorithm will be called the GEM (generalized EM).
Missing Data and EM 33/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Properties
Note that the conditional pdf of Z given Y = y, k(z|y, θ), can be written as k(z|y, θ) = f (y,z|θ) .
Then we have
ln f (y, z|θ) = ln g(y|θ) + ln k(z|y, θ). (1)
Namely, complete-data log-likelihood equals the sum of observed-data log-likelihood and conditional log-likelihood.
g (y|θ)
Missing Data and EM 34/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Properties
Taking expectation on both sides of (1) w.r.t. the conditional pdf ofZgivenY=yandsomevalueθ′ ofθ,wehave
Q(θ, θ′) = ln g(y|θ) + H(θ, θ′), where (2)
Q(θ,θ′) = EZ [lnf(y,Z|θ)|y,θ′]= H(θ,θ′) = EZ [lnk(Z|y,θ)|y,θ′]=
k(z|y,θ′)lnf(y,z|θ)dz, and k(z|y,θ′)lnk(z|y,θ)dz.
Given a value θ′, if we can find θ′′ such that
Q(θ′′, θ′) = max Q(θ, θ′) we know θ
ln g(y|θ′′) ≥ ln g(y|θ′).
Missing Data and EM 35/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Deriving the EM
Theorem (Jensen’s inequality)
E[g(X)] ≤ g(E[X]) if g(·) is concave.
By Jensen’s inequality,
k(Z|y, θ) ′
≤ lnE ′
k(Z|y, θ) ′ k(Z|y,θ′) | y,θ
E ln k(Z|y,θ′) | y,θ k(z|y,θ)
k(z|y,θ)dz = ln1 = 0.
= ln k(z|y,θ′)k(z|y,θ )dz = ln
⇒ E [ln k(Z|y, θ)|y, θ′] − E [ln k(Z|y, θ′)|y, θ′] ≤ 0 This implies that
H(θ, θ′) ≤ H(θ′, θ′). (3)
Missing Data and EM 36/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Deriving the EM
Given a value θ′, if we can find θ′′ such that
Q(θ′′, θ′) = max Q(θ, θ′), then by (2) and (3) we know
θ
ln g(y|θ′′) ≥ ln g(y|θ′).
(Note that Q(θ′′, θ′) ≥ Q(θ′, θ′) and H(θ′′, θ′) ≤ H(θ′, θ′).)
This suggests the following algorithm for calculating the MLE of θ which maximizes the observed-data log-likelihood lng(y|θ).
Missing Data and EM 37/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Deriving the EM
Theorem (Dempster, Laird and Rubin, 1977)
Every EM or GEM algorithm increases the observed-data log-likelihood lng(y|θ) at each iteration, i.e.
ln g(y|θ(r+1)) ≥ ln g(y|θ(r)),
with the equality holding iff Q(θ(r+1),θ(r)) = Q(θ(r),θ(r)).
Proof: lng(y|θ(r+1)) = Q(θ(r+1),θ(r))−H(θ(r+1),θ(r)) and lng(y|θ(r)) = Q(θ(r),θ(r))−H(θ(r),θ(r)).
Hence lng(y|θ(r+1)) ≥ lng(y|θ(r)) because
Q(θ(r+1), θ(r)) ≥ Q(θ(r), θ(r)) and H(θ(r+1), θ(r)) ≤ H(θ(r), θ(r)).
Missing Data and EM 38/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Deriving the EM
Theorem (Wu(1983) and Little and Rubin (1987))
Suppose a sequence of EM iterates θ(r) satisfies 1. ∂Q(θ,θ(r))|θ=θ(r+1)=0;
∂θ
2. θ(r) converges to some value θ0 as r → ∞, and k(z|y,θ) is
“sufficiently smooth”.
Then ∂lng(y|θ) |θ=θ0=0. ∂θ
This theorem implies that, if θ(r) converges, it will converge to a stationary point of lng(y|θ), which is the MLE if there is only one such stationary point. If there are multiple such stationary points, the EM may not converge to the global maximum.
Missing Data and EM 39/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Introduction
Newton-Raphson is a method for finding to the roots (or zeroes) of a real-valued function.
Newton-Raphson is a more general optimisation algorithm which can also be used to find the MLE.
NR can be faster than EM but often less stable numerically and less tractable analytically.
Missing Data and EM 40/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Intuition and graphical explanation
Graphical explanation in 2D
The algorithm starts from some guess for the root, x0. Then calculate the tangent line at this point.
Compute the x-intercept of this tangent line, x1
Calculate a new tangent line and the new x-intercept. Repeat this untill convergency.
Missing Data and EM 41/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Intuition and graphical explanation
Calculating the tangent line:
y = a + f ′(x)x
To find the intercepr let’s substitute the coordinate of our guess
point (x0 & f (x0)):
f(x0)=a+f′(x0)x0 => a=f(x0)−f′(x0)x0
So the tangent line is
y =f(x0)−f′(x0)x0 +f′(x0)x
Missing Data and EM 42/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Intuition and graphical explanation
To find the x-intercept of the tangent line we need to solve: 0=f(x0)−f′(x0)x0 +f′(x0)x
0 = f (x0) + f ′(x0)(x − x0) an x that solves this equation is
x1 = x0 − f (x0) f′(x0)
Missing Data and EM 43/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Notations and definitions
Calculation in higher dimention
xn = (x1, · · · , xn)⊤: a random sample of n observations from pdf f(X|θ).
θ=(θ1,···,θq)⊤: q×1parametervector.
Log-likelihoodfunction: l(θ)=lnL(θ)=ni=1lnf(xi|θ).
Score function: U(θ) = ∂ ln L(θ) = n ∂ ln f (xi |θ) , is a q × 1
vector.
Hessian function: H(θ)=∂U(θ) =∂2lnL(θ) =n
∂2lnf(xi|θ),isaq×qmatrix. ∂θ∂θ⊤
∂θ
i=1 ∂θ
∂θ⊤ ∂θ∂θ⊤ i=1
Observed information function: J(θ) = −H(θ).
Missing Data and EM 44/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Newton-Raphson algorithm
The objective of the N-R algorithm is to solve U(θ) = 0. Start with an appropriate initial value θ(0).
Compute θ(k), k = 0, 1, 2, · · · , successively with
(k+1) (k)(k)−1 (k)or(k)(k)−1 (k)
θ =θ − H(θ ) U(θ )=θ + J(θ ) U(θ ).
Continue until {θ(k)} converges, i.e. until |θ(k+1) − θ(k)| or |U(θ(k+1))| or |l(θ(k+1)) − l(θ(k))| is smaller than a small tolerance number (e.g. 10−6) computationally.
Missing Data and EM 45/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Newton-Raphson algorithm
It is called the Fisher-scoring algorithm if computing via (k+1) (k) (k) −1 (k)
θ =θ+I(θ)U(θ),
where, Fisher information function: I(θ) = E[J(θ)] = −E[H(θ)].
Fisher-scoring may be analytically more involving but is statistically more stable.
Missing Data and EM 46/47
§5.1 Introduction §5.2 Motivation §5.3 Expectation-Maximization §5.4 Derivation of the EM §5.5 Newton-Raphson and Fish
Explaining N-R algorithm
Suppose via Taylor expansion ln L(θ) around the MLE θˆ can be well approximated by a quadratic function
F(θ)=lnL(θ(k))+ ∂lnL(θ)|θ=θ(k)(θ−θ(k)) ∂θ
+1(θ−θ(k))⊤∂2 lnL(θ)|θ=θ(k)(θ−θ(k)) 2 ∂θ∂θ⊤
=lnL(θ(k))+U(θ(k))(θ−θ(k))+ 12(θ−θ(k))⊤H(θ(k))(θ−θ(k)) Solving ∂F(θ) =0 ⇒ U(θ(k)) + H(θ(k))(θ−θ(k))=0, we have
∂θ
(k+1) (k) (k) −1 (k)
θ =θ−H(θ)U(θ).
Itimpliesθˆ≈θ(k+1) =argmaxθF(θ).
Missing Data and EM 47/47