Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L12 – Mixture Density Models,
EM Algorithm
Mixture Density Models
• Flexible models – able to fit lots of densities
• Fit parameters by maximum likelihood. Nonlinear equations
require iterative fitting procedure. Standard is Expectation –
Maximization (EM).
• “Soft” version of clustering.
General form is )|()|(
1
jxpxp
k
j
j
j
xpjxp
jj
k
j
j
kk
j
component mixturefor y probabilitprior is1,0
,,…,,
(vectors)parameter with densitiescomponent are)|()|(
1
11
j
2
Generative Model
Mixture model form
3
)|()|(
1
jxpxp
k
j
j
j
xpjxp
jj
k
j
j
kk
j
component mixturefor y probabilitprior is1,0
…,,,…,,
(vectors)parameter with densitiescomponent are)|()|(
1
11
j
Generating x is a two-fold sampling procedure:
1. Pick a component density with probability j
2. Generate a sample x from p(x | j )
Mixture Models
Most common example is mixture of Gaussians
There’s a universal approximation theorem1 for such mixtures that
states that with enough components, a mixture of Gaussians fit by
maximum likelihood can arbitrarily closely match any density on a
compact subset of Rn.
1. Jonathan Li and Andrew Barron. Mixture Density Estimation, in Solla, Leen, and Mueller
(eds.) Advances in Neural Information Processing Systems 12, The MIT Press, 2000.
1
11 1
2
(2 )
( | ) ( | )
with
( | ) exp ( ) ( )
n
j
k
j
j
T
j j j
p x p x j
p x j x x
4
Gaussian Mixture Model
5
Flexible — can make lots of shapes!
Fitting Mixture Models
Suppose we have a data set
we’d like to adjust the parameters
so as to maximize the data log likelihood
naa RxNaxD in vector aeach with …,,1,
kk
…,,,…,,
11
N
a
k
j
jaj
xpDPL
1 1
)|(ln)|(ln)(
6
Fitting Mixture Models
The data log likelihood
cannot be maximized in one step — the maximization equations
don’t have a closed form solution.
Instead, use an iterative approach — the EM algorithm
For the moment, rewrite the log-likelihood as (suppressing the
mixture form of p(x|Θ)
where ia is the unknown index of the component responsible for
generating xa.
N
a
k
j
jaj
xpDPL
1 1
)|(ln)|(ln)(
N
a
k
i
aa
N
a
a
a
xipxpDPL
1 11
)|,(ln)|(ln)|(ln)(
7
Fitting Mixture Models
Next, we write a lower bound for L. Introduce an average over any
probability distribution on the unknown indices ia , Q(ia)
Jensen’s inequality gives
The equality holds when Q(ia) is the posterior distribution on the
unknown indices
1 1 1 1 1
( , | )
ln | ln ( , | ) ln ( )
( )
a a
N N k N k
a a
a a a a
a a i a i a
p i x
L p x p i x Q i
Q i
)()(ln)(),(ln)(
)(
),(
ln)(
)(
),(
)(ln
1111
1111
k
i
aa
N
a
k
i
aaa
N
a
k
i a
aa
a
N
a
k
i a
aa
a
N
a
aa
aa
iQiQxipiQ
iQ
xip
iQ
iQ
xip
iQL
),|()(
aaa
xipiQ
8
EM Algorithm
Iterative optimization algorithm: Expectation Maximization (EM)
maximizes (which maximizes L). There are multiple optima, EM
only finds a local optimum.
Initialize the algorithm to some choice of the parameters. At the
n+1th iteration :
E Step: With fixed at n, estimate the index distribution as
k
j
jaj
iai
aaian
nxpn
nxpn
nxipnhiQ
1
,1
)(|)(
)(|)(
))(,|()1()(
9
EM
M Step: With Q = hi,a(n+1) fixed, maximize with respect to
subject to the condition
This gives
for the i i=1,…k
k
i
i
1
1
1 1
1 1
( 1) ( | , ( ) )
N N
i ia a
a a
n h p i x n
N N
N
a
k
i
iaiaiai
xpnhnhn
1 1
,,
|ln)1(maxarg1,maxarg)1(
10
EM
M Step (continued) With Q = hia(n+1) fixed,
maximize (,h) with respect to the j
N
a
k
i
iaiaiai
xpnhnhn
1 1
,,
|ln)1(maxarg1,maxarg)1(
Maximize with respect to each j (e.g. set 𝛻𝜃𝑗 0
separately, so the above reduces to
N
a
jajajj
xpnhn
j 1
,
)|(ln)1(maxarg)1(
11
Example – Mixture of Gaussians
Component densities
E-Step
M-Step
)()(exp
)2(
1
)|(
1
2
1
ii
T
i
i
n
j xxxp
N
a
aii
nh
N
n
1
,
)1(
1
)1(
a
ai
a
N
a
ai
i
nh
xnh
n
)1(
)1(
)1(
,
1
,
a
ai
T
iaia
N
a
ai
i
nh
nxnxnh
n
)1(
)1()1()1(
)1(
,
1
,
k
j
jaj
iai
aaia
nxpn
nxpn
nxipnh
1
,
))(|()(
))(|()(
))(,|()1(
12
Gaussian Mixtures
Let’s interpret equations for the M-Step
New estimate of prior for ith component is the
average over the data points of the posteriors for
ith component.
New mean for ith component is weighted average
of data points. Weighting is fraction of the data
point xa attributed to component i,
New covariance is
constructed from
weighted outer product.
N
a
aii
nh
N
n
1
,
)1(
1
)1(
a
ai
a
N
a
ai
i
nh
xnh
n
)1(
)1(
)1(
,
1
,
a
ai
T
iaia
N
a
ai
i
nh
nxnxnh
n
)1(
)1()1()1(
)1(
,
1
,
13
Gaussian Mixture Model
14
Flexible — can make lots of shapes!
EM Summary — Gaussian Mixtures
Initialize parameters
)0(
)0(
/1)0(
i
ii
i
x
k
all components equally likely
k randomly chosen points from training data
a positive symmetric, positive definite matrix e.g. 𝜎2 𝐼
15
EM Summary — Gaussian Mixtures
Iterate
E-Step (estimate posteriors)
M-Step
Re-estimate priors
Re-estimate means
Re-estimate covariances
N
a
aii
nh
N
n
1
,
)1(
1
)1(
a
ai
a
N
a
ai
i
nh
xnh
n
)1(
)1(
)1(
,
1
,
a
ai
T
iaia
N
a
ai
i
nh
nxnxnh
n
)1(
)1()1()1(
)1(
,
1
,
k
j
jaj
iai
aaia
nxpn
nxpn
nxipnh
1
,
))(|()(
))(|()(
))(,|()1(
16
Caveats
In high dimensions n, there are loads of covariance matrix
elements. Likely to overfit.
Fixes – constrain covariance matrices to have fewer components
Diagonal
Spherically symmetric
Some other clever form (???)
Note that any constraints modify the M-step equations
for the covariance — can you derive the forms?
0
0
2
1
i
i
i
matrixidentity thewith ,
2
II
ii
17
Caveats
There are regions of the parameter space where the likelihood
goes through the roof but the resulting model is bad
Regularization (has a grounding in Bayesian priors and MAP
estimation). After re-estimation, add a small diagonal matrix to
the covariance
One Gaussian wrapped around this one point
x’. Let i=x’, and take i
2 to 0.
Then
and the likelihood grows without bound. This is
particularly likely in high-dimensions where the
average distance between datapoints becomes large.
),|'(
2
iixp
( 1) ( 1)
i i
n n I
18
References
• Dempster, Laird, and Rubin. Maximum likelihood from incomplete
data via the EM algorithm, J. Royal Statistical Soc. B, 39, 1-39,
1977.
• Rener and Walker. Mixture densities, maximum likelihood and the
EM algorithm, SIAM Review, 26, 195-239, 1984.
• Ormoneit and Tresp. In Advances in Neural Information Processing
Systems 8, The MIT Press, 1996.
• Jonathan Li and Andrew Barron. Mixture Density Estimation, in
Solla, Leen, and Mueller (eds.) Advances in Neural Information
Processing Systems 12, The MIT Press, 2000.
19