STAT 513/413: Lecture 22 EM Algorithm
(A big splash in the maximum likelihood world)
Not everything has to be done numerically
We may sometimes solve the problem in closed form
And sometimes, we may solve by numerics only some part of it
EM Algorithm: a general scheme for reducing (some) more difficult problems into a (sequence of) easier ones
Not a scheme for those just following recipes: needs invention
EM-algorithm, as formulated below, produces a sequence of iterates, with likelihood increasing along the sequence (-loglikehood decreasing)
(It does not produce the global maximum of the likelihood, the textbook has it wrong)
1
General structure
Suppose that data, y are generated by some random mechanism Y (random variable or vector: possibly many quantities are collapsed here into one letter Y or y)
The probability density of Y is f(y; θ); thus, when y is the observed value for Y, we may estimate θ via maximum likelihood
−logf(y;θ) min! θ
Suppose we can see a way in which there were some additional data, z, generated by Z – so that we would have a density g(y,z;θ)
Maximum likelihood estimation of θ would result in a problem
− log g(y, z; θ) min ! (M-step)
θ
If this is an easier problem, then we can use it for solving an original one – if we know how to calculate
E(− log g(Y, Z; θ)|Y = y; θ) (E-step)
We start by some initial iterate θ1; we do the E-step; we obtain something like in the M-step, we solve it; we obtain θ2, and then we repeat E-step and M-step to obtain θ3; etc.
2
Example: genetics
We have three letters, A, B, O – they are called alleles
Each individual is getting two of these, by random: their
configuration, regardless of the order, determines the genotype
Thus, the genotypes are: AA, AB, AO, BB, BO, OO
Hardy-Weinberg law: the probability of genotype is derived from independent sampling of two alleles, with probabilities pA, pB, pO – as A, B, O are the only letters possible, we have pA + pB + pO = 1
Thus, the genotypes have respectively probabilities p2A, 2pApB, 2pApO, p2B, 2pBpO, p2O
However, we do not observe genotypes but only phenotypes
A for AA and AO, B for BB and BO, AB for AB, and O for OO
with probabilities, respectively
p2A + 2pApO, p2B + 2pBpO, 2pApB, p2O
And we would like to estimate pA, pB and pO (and then we have also the estimates of probabilities of genotypes and phenotypes)
3
The easy problem
It is a “missing data problem” – the original motivation for the EM algorithm, where Z can be interpreted as “missing data” (although in some implementations purely imaginary)
If we observed genotypes, we would observe nAA,nAB,nAO,nBB,nBO,nOO of each genotype
out of total n observations; the likelihood is pnAApnABpnAOpnBBpnBOpnOO =
AA AB AO BB BO OO
= (p2A)nAA(2pApB)nAB(2pApO)nAO(p2B)nBB(2pBpO)nBO(p2O)nOO
= p2nAA+nAB+nAOp2nBB+nnAB+nBOpnAO+nBO+2nOO2nAB+nAO+2nBO ABO
so we have to minimize, under the condition pA + pB + pO = 1 −(2nAA + nAB + nAO) log pA
−(2nBB +nAB +nBO)logpB −(nAO +nBO +2nOO)logpO This yields pˆA = 2nAA +nAB +nAO, pˆB = 2nBB +nAB +nBO,
nn
pˆO = nAO +nBO +2nOO – M-step is easy n
4
The harder problem
If we observe only phenotypes, then we observe nA,nB,nAB,nO of each phenotype
out of total n observations; the likelihood is
(p2A + 2pApO)nA(p2B + 2pBpO)nB(2pApB)nAB(p2O)nO
so we have to minimize, under the condition pA + pB + pO = 1 −nA log(p2A + 2pApO) − nB log(p2B + 2pB)
−nABlogpA −nABlogpB −2nOlogpO
We could perhaps do Newton method, but for that we would have to evaluate 3 first and 6 second partial derivatives (or better just 2 and3,butexpressingpO as1−pA−pB)
5
The E-step
The E-step is determined through probabilistic calculations: evaluating conditional expectations; to this end, we need a bit more notation
Random variables corresponding to numbers of phenotypes in the sample
NA,,NB,NAB,NO are observed as nA,nB,nAB,nO The random variables corresponding to numbers of genotypes
NAA, NAB, NAO, NBB, NBO, NOO
are not observed, but they appear in the minimized -loglikelihood
−(NAA + NAB + NAO) log pA
−(NBB +NAB +NBO)logpB −(NAO +NBO +2NOO)logpO
and we have to calculate its conditional expectation for given (fixed, nonrandom, unknown) pA,pB,pO, conditional on
NA =nA, NB =nB, NAB =nAB, NO =nO
The conditional expectation is that of a linear combination of the conditional expectations of genotype random variables…
6
… so we can treat them one by one
The last two phenotypes are easy: as NAB = NAB, we have E(NAB|all conditions) = E(NAB|NAB = nAB) = nAB
Analogously
E(NO|all conditions) = E(NO|NO = nO) = nO as NO = NO
The other two take a bit of calculations: E(NAA|all conditions) = E(NAA|NA = nA)
= nA P[Geno = AA|Pheno = A]
why? because every observed phenotype A is genotype with AA with probability equal to that seen above, and for the expected value you sum nA of those; now use the definition of conditional probability
=n P[Geno=AA&Pheno=A] A P[Pheno = A]
= n
P[Geno = AA]
A P[Geno = AA] + P[Geno = AO]
=n p2A
A p 2A + 2 p A p O
7
AA
AO
|all conditions) = n
2pApO
A p2A + 2pApO
BB
|all conditions) = n
p2B
B p2B + 2pBpO
An then
The other expected values are calculated analogously; eventually (one can call it the use of the Bayes theorem)
E(N E(N
E(N E(N
|all conditions) = n p2A
A p2A + 2pApO
BO
|all conditions) = n 2pBpO
B p2B + 2pBpO
E(NAB|all conditions) = nAB
E(NOO|all conditions) = nO Note that, as expected
E(NAA) + E(NAO) = nA and E(NBB) + E(NBO) = nB
8
So finally
The data are nA, nB, nAB, nO, and also n = nA + nB + nAB + nO The EM algorithm:
takes some pA,pB,pO, and calculates the new iterates of them
using them first to calculating the conditional expectation of the -loglikelihood by plugging in the conditional expectations calculated above
obtaining thus an easy maximum likelihood formulation which can be solved in closed form to give the new iterates
pˆ =1 2nAp2A +n +nA2pApO=1n +2n p2A+pApO A n p2A+2pApO AB p2A+2pApO n AB A p2A+2pApO
pˆ =1 2nBp2B +n + nB2pBpO =1n +2n p2B+pBpO B n p2B+2pBpO AB p2B+2pBpO n AB B p2B+2pBpO
pˆ =1nA2pApO + nB2pBpO +2n O n p2A+2pApO p2B+2pBpO O
And the process repeats by making those pA,pB,pO… …until convergence
9
One more example: fitting the Cauchy distribution
The system of equations to solve is
and
n y i − μ
n ( y i − μ ) 2 n
σ2 +(yi −μ)2 =0 i=1
i=1
σ2 +(yi −μ)2 = 2
We concentrate on the first one – as the second one may be not that difficult: if we interepret it as an equation for σ > 0, then the left-hand side is monotonous in σ
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
For the purpose of finding μ out of the first equation, we may consider in what follows σ known and fixed (we might even consider it equal to 1 for notational simplicity, but that could mislead some)
10
1/(x^2 + 1)
0.2 0.4 0.6 0.8 1.0
n y i − μ
σ
=0
1 n with an easy solution μˆ=n yi
i=1
i=1
The easy problem?
The similar problem we know would be easy is the same one when the distribution of the yi’s is not Cauchy but normal. In such a case, the first equation becomes
How can the actual problem correspond to this one? Well, every statisticians know that Cauchy is in fact t distribution with 1 degree of freedom. And that t distribution is obtained as
U2
Z/k where U is N(0,1), Z is χ (k) and they are independent
so that Cauchy is in particular
√U where U is N(0,1), Z is χ2(1) and they are independent
Z
So, let us try to make some use of it
11
Hard vs. easy
The actual problem assumes that Y1,Y2,…,Yn are independent
and Yi − μ σ
Yi − μ That is σ
is Cauchy ≡ t with 1 df
has the same distribution as √Zi
Ui whenever Ui is N(0,1), Zi is χ2(1) and they are independent
and thus Z Yi − μ has the same distribution as U iσi
so then in the actual problem, if Zi is given (known: write zi) and √z Yi −μ is N(0,1)
This is not exactly the easy problem as before – but is still easy (… = fill in the details, if necessary)
the maximum likelihood estimate of μ (for known σ) is
ni=1 ziyi
μˆ = n z the weighted mean (by zi)
i=1 i
So, M-step will be again easy…
iσ
12
… and now the remainder of it
We need now to write the full joint likelihood of Yi and Zi – which is the product of conditional densities of all Yi (each for given Zi) and then the product of all densities of Zi. As the parameters μ and σ will be only in the first part, the conditional densities, we end up after taking logs (and minus) with (relevant) negative loglikehood
n (yi−μ)2 σ
Zi 2σ2 +log √Zi i=1
We can drop the second term if σ is known, so we end up with n ( y i − μ ) 2
Zi 2σ2 (… = fill in the details, if necessary) i=1
Now, we need to take the conditional expectation of it, given Yi (and μ and σ). Luckily again, it is linear in Zi, so it boils down to taking the conditional expectation of Zi itself
13
Densities, densities
For taking the conditional expectation of Zi, we eventually need its density: some more educated statisticians know – or can find out – that
(1/2)k/2
the general χ2(k) density is, for x > 0 Γ(k/2) xk/2−1e−x/2
and it is a special case of the density of the Gamma distribution
βα
Γ (α, β), which is for x > 0 Γ (α)xα−1e−βx (isn’t it?)
At the beginning, it seems like we will need only χ2(1), that is, the special case with k = 1, which is actually Γ (1/2, 1/2)…
… if we are good in calculus – otherwise, some trickery we have seen in the Bayesian analysis can help us
14
The magic of ∝ again
σ2
Conditional on Zi = zi, the density of Yi is N μ, zi
−z (yi − μ)2
that is, the density of Yi|Zi =zi is ∝ e i 2σ2
The joint density of Yi and Zi is the product of the latter and the
density of Γ(1/2,1/2)
(yi −μ)2 1 zi 1 (yi −μ)22 1
−zi 2 −2− −21+ σ2 zi−2 thatis ∝ e 2σ zi e 2=e zi
which says that the conditional distribution of Zi|Yi = yi is ∝ same 1 1 (yi−μ)2
some perhaps even remember) that its expected value is α= 1 =E(Zi|Yi =yi)
that is, it is Γ(α,β)=Γ 2,2 1+ σ2
and then every statistician can find out (and some calculate, and
β 1 + (yi − μ)2 σ2
15
The finale: EM-algorithm recipe
So the combination of E- and M-step is the weighted mean… More precisely (σ is still known!): select μ1
Calculate weights zi = 1
1 + (yi − μ1)2
σ 2 ni = 1 z i y i Calculate the weighted mean μ2 = ni=1 zi
and repeat…
16
Extension: regression
Why did people bother with it if it is so complicated? And moreover, one dimensional problem (in μ)??
Because it works exactly the same way in the regression setting: not only when yi = μ + σεi with εi Cauchy errors
but also when yi = xTi β + σεi with εi still Cauchy errors (and σ still known)
The EM-algorithm goes in an analogous way: select β1 Calculate weights zi = 1
1 + ( y i − x Ti β 1 ) 2 σ2
Calculate β2 as a weighted least squares estimate, solving
n
zi(yi − xTi β)2 min !
i=1 and repeat…
β
So it works via EM… but after all, it is nothing but…
17
Finally: mixtures beware
Good for their problem, but not (entirely) good for ours
18