程序代写代做代考 flex Bayesian algorithm AI Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L12 – Mixture Density Models,

EM Algorithm

Mixture Density Models

• Flexible models – able to fit lots of densities

• Fit parameters by maximum likelihood. Nonlinear equations

require iterative fitting procedure. Standard is Expectation –

Maximization (EM).

• “Soft” version of clustering.

General form is )|()|(
1

jxpxp
k

j

j

 

 

j

xpjxp

jj

k

j

j

kk

j

component mixturefor y probabilitprior is1,0

,,…,,

(vectors)parameter with densitiescomponent are)|()|(

1

11

j












2

Generative Model

Mixture model form

3

)|()|(
1

jxpxp
k

j

j

 

 

j

xpjxp

jj

k

j

j

kk

j

component mixturefor y probabilitprior is1,0

…,,,…,,

(vectors)parameter with densitiescomponent are)|()|(

1

11

j












Generating x is a two-fold sampling procedure:

1. Pick a component density with probability j
2. Generate a sample x from p(x | j )

Mixture Models

Most common example is mixture of Gaussians

There’s a universal approximation theorem1 for such mixtures that

states that with enough components, a mixture of Gaussians fit by

maximum likelihood can arbitrarily closely match any density on a

compact subset of Rn.

1. Jonathan Li and Andrew Barron. Mixture Density Estimation, in Solla, Leen, and Mueller

(eds.) Advances in Neural Information Processing Systems 12, The MIT Press, 2000.

 

1

11 1
2

(2 )

( | ) ( | )

with

( | ) exp ( ) ( )
n

j

k

j

j

T

j j j

p x p x j

p x j x x

 



 

   

4

Gaussian Mixture Model

5

Flexible — can make lots of shapes!

Fitting Mixture Models

Suppose we have a data set

we’d like to adjust the parameters

so as to maximize the data log likelihood

  naa RxNaxD in vector aeach with …,,1, 

 
kk

 …,,,…,,
11



 
 






N

a

k

j

jaj
xpDPL

1 1

)|(ln)|(ln)( 

6

Fitting Mixture Models

The data log likelihood

cannot be maximized in one step — the maximization equations

don’t have a closed form solution.

Instead, use an iterative approach — the EM algorithm

For the moment, rewrite the log-likelihood as (suppressing the

mixture form of p(x|Θ)

where ia is the unknown index of the component responsible for

generating xa.

 
 






N

a

k

j

jaj
xpDPL

1 1

)|(ln)|(ln)( 

 
 






N

a

k

i

aa

N

a

a

a

xipxpDPL
1 11

)|,(ln)|(ln)|(ln)(

7

Fitting Mixture Models

Next, we write a lower bound for L. Introduce an average over any
probability distribution on the unknown indices ia , Q(ia)

Jensen’s inequality gives

The equality holds when Q(ia) is the posterior distribution on the
unknown indices

 
1 1 1 1 1

( , | )
ln | ln ( , | ) ln ( )

( )
a a

N N k N k
a a

a a a a

a a i a i a

p i x
L p x p i x Q i

Q i    

      
       

      
    

)()(ln)(),(ln)(

)(

),(
ln)(

)(

),(
)(ln

1111

1111
























k

i

aa

N

a

k

i

aaa

N

a

k

i a

aa
a

N

a

k

i a

aa
a

N

a

aa

aa

iQiQxipiQ

iQ

xip
iQ

iQ

xip
iQL

),|()( 
aaa

xipiQ

8

EM Algorithm

Iterative optimization algorithm: Expectation Maximization (EM)

maximizes  (which maximizes L). There are multiple optima, EM

only finds a local optimum.

Initialize the algorithm to some choice of the parameters. At the

n+1th iteration :

E Step: With  fixed at n, estimate the index distribution as

 

 




k

j

jaj

iai
aaian

nxpn

nxpn
nxipnhiQ

1

,1

)(|)(

)(|)(
))(,|()1()(





9

EM

M Step: With Q = hi,a(n+1) fixed, maximize  with respect to 

subject to the condition

This gives

for the i i=1,…k



k

i

i

1

1

1 1

1 1
( 1) ( | , ( ) )

N N

i ia a

a a

n h p i x n
N N


 

    

     
 


N

a

k

i

iaiaiai
xpnhnhn

1 1

,,
|ln)1(maxarg1,maxarg)1( 

10

EM

M Step (continued) With Q = hia(n+1) fixed,

maximize (,h) with respect to the j

     
 


N

a

k

i

iaiaiai
xpnhnhn

1 1

,,
|ln)1(maxarg1,maxarg)1( 

Maximize  with respect to each j (e.g. set 𝛻𝜃𝑗  0

separately, so the above reduces to

 


N

a

jajajj
xpnhn

j 1

,
)|(ln)1(maxarg)1( 

11

Example – Mixture of Gaussians

Component densities

E-Step

M-Step

)()(exp
)2(

1
)|(

1

2
1

ii
T

i

i
n

j xxxp 

 




N

a

aii
nh

N
n

1

,
)1(

1
)1(

 

a

ai

a

N

a

ai

i
nh

xnh

n
)1(

)1(

)1(
,

1

,

   



 

a

ai

T

iaia

N

a

ai

i
nh

nxnxnh

n
)1(

)1()1()1(

)1(
,

1

,




k

j

jaj

iai
aaia

nxpn

nxpn
nxipnh

1

,

))(|()(

))(|()(
))(,|()1(





12

Gaussian Mixtures

Let’s interpret equations for the M-Step

New estimate of prior for ith component is the

average over the data points of the posteriors for

ith component.

New mean for ith component is weighted average

of data points. Weighting is fraction of the data

point xa attributed to component i,

New covariance is

constructed from

weighted outer product.



N

a

aii
nh

N
n

1

,
)1(

1
)1(

 

a

ai

a

N

a

ai

i
nh

xnh

n
)1(

)1(

)1(
,

1

,

   



 

a

ai

T

iaia

N

a

ai

i
nh

nxnxnh

n
)1(

)1()1()1(

)1(
,

1

,


13

Gaussian Mixture Model

14

Flexible — can make lots of shapes!

EM Summary — Gaussian Mixtures

Initialize parameters

)0(

)0(

/1)0(

i

ii

i

x

k

 all components equally likely

k randomly chosen points from training data

a positive symmetric, positive definite matrix e.g. 𝜎2 𝐼

15

EM Summary — Gaussian Mixtures

Iterate

E-Step (estimate posteriors)

M-Step

Re-estimate priors

Re-estimate means

Re-estimate covariances



N

a

aii
nh

N
n

1

,
)1(

1
)1(

 

a

ai

a

N

a

ai

i
nh

xnh

n
)1(

)1(

)1(
,

1

,

   



 

a

ai

T

iaia

N

a

ai

i
nh

nxnxnh

n
)1(

)1()1()1(

)1(
,

1

,




k

j

jaj

iai
aaia

nxpn

nxpn
nxipnh

1

,

))(|()(

))(|()(
))(,|()1(





16

Caveats

In high dimensions n, there are loads of covariance matrix
elements. Likely to overfit.

Fixes – constrain covariance matrices to have fewer components

Diagonal

Spherically symmetric

Some other clever form (???)

Note that any constraints modify the M-step equations

for the covariance — can you derive the forms?







0

0

2

1

i

i

i 

matrixidentity thewith ,
2

II
ii



17

Caveats

There are regions of the parameter space where the likelihood
goes through the roof but the resulting model is bad

Regularization (has a grounding in Bayesian priors and MAP
estimation). After re-estimation, add a small diagonal matrix to
the covariance

One Gaussian wrapped around this one point
x’. Let i=x’, and take i

2 to 0.

Then

and the likelihood grows without bound. This is
particularly likely in high-dimensions where the
average distance between datapoints becomes large.

),|'(
2
iixp 

( 1) ( 1)
i i

n n I     

18

References

• Dempster, Laird, and Rubin. Maximum likelihood from incomplete

data via the EM algorithm, J. Royal Statistical Soc. B, 39, 1-39,

1977.

• Rener and Walker. Mixture densities, maximum likelihood and the

EM algorithm, SIAM Review, 26, 195-239, 1984.

• Ormoneit and Tresp. In Advances in Neural Information Processing

Systems 8, The MIT Press, 1996.

• Jonathan Li and Andrew Barron. Mixture Density Estimation, in

Solla, Leen, and Mueller (eds.) Advances in Neural Information

Processing Systems 12, The MIT Press, 2000.

19