程序代写代做代考 Bayesian Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L10 – Parameter Estimation

Parameter Estimation
Maximum Likelihood and Bayes Estimates

The following lectures expand on our earlier discussion of

parameter estimates, introducing some formal grounding.

(A good supplemental source for this is chapter 2 of Neural

Networks for Pattern Recognition, Chris Bishop.)

We’ll discuss parametric density estimates in more detail,

including mixture models for densities.

2

Parametric Density Models

A model of specific functional form.

A small number of parameters that are estimated from
data.

e.g. Normal distribution

Data – D = {x1, x2, x3, …, xm}
Parameter estimates

but where did these forms for the estimators come from?

22

2 21 1

22
( | , ) exp ( )p x x


    

 
2

1

1
12

1

1 ˆˆ,ˆ 


m

i

im

m

i

im
xx 

3

Maximum Likelihood Estimation

Question — what’s the probability that the dataset D occurs,

given the form of the model density?

We assume each of the xi are sampled independently from the

underlying (normal in this example) distribution, then

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10

If the data (histogram) and

model look like this, the data will

have low probability given the

model (data at the tails of the

model distribution).

  2 2 21
1

( | , ) , … | , ( | , )
m

m i

i

p D p x x p x     

  

4

Maximum Likelihood Estimation

Question — what’s the probability that the dataset D occurs,

given the form of the model density?

We assume each of the xi are sampled independently from the

underlying (normal in this example) distribution, then

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10

We can adjust the model mean

so that the data has higher

probability under the model.

But this model still has low tails

where there’s plenty of data,

because the peak is too sharp.

  2 2 21
1

( | , ) , … | , ( | , )
m

m i

i

p D p x x p x     

  

5

Maximum Likelihood Estimation

Question — what’s the probability that the dataset D occurs,

given the form of the model density?

We assume each of the xi are sampled independently from the

underlying (normal in this example) distribution, then

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10

Here’s a Gaussian model that

chooses the mean and variance

so as to maximize the data

likelihood under the model.

  2 2 21
1

( | , ) , … | , ( | , )
m

m i

i

p D p x x p x     

  

6

Maximum Likelihood Estimation

So, we adjust the model parameters to maximize the data likelihood.

Since the log is monotonic in its arguments, and we often deal with

model distributions from the exponential family, it’s convenient to

maximize the log-likelihood.

 








m

i

i

m

i

i
xxpDpL

1

2

2

2

2
12

1

2

2

1
)2(ln),|(ln),|(ln 




 21
1 1

2 21

2
1

1
ˆ0

ˆ ˆ0 ( )

m m

i i

i i

m

im

i

L
x x

m

L
x


 

 

 


    


   

 

 Note 2 is biased.

7

Data Distributions and Cost Functions

Regression – Minimizing mean square error between the data and a

regression curve is equivalent to maximizing the data likelihood

under the assumption that the fitting error is Gaussian.

The data is the sequence of (x,y) coordinates. The data y values

are assumed Gaussian distributed with mean g(x). That is

y = g(x) + e e ~ N(0,2)

The data likelihood is

 
  




 

2

2
1

2

2
)(

2

1
exp

2

1
) ),( };{ }({

ii

m

i

ii
xgyxgxyp



8

Data Distributions and Cost Functions
Regression

Maximizing the data log-likelihood L with respect

to g(x)

is equivalent to minimizing the sum-squared

fitting error with respect to g(x).

         2
2

2

1

2
)(

2

1
2log

2

1
),( ; log

ii

m

i

xgyxgxypL 

 
 



 2
2

1

)(
2

1

ii

m

i

xgyE  
 

9

Data Distributions and Cost Functions
Classification

For a (two-class) classification problem, it’s natural to write the data

likelihood as a product of Bernouli distributions (since the target

values are y = 0 or 1 for each example)

where α(x) is the probability that for the feature vector x, the class label is 1

(rather than 0).

Maximizing this data likelihood is equivalent to minimizing its –log, the

cross entropy error

      




m

i

y

i

y

iii
ii xxxxypL

1

)1(
)(1 )( )( ; 

   )(1log )1( )(log
1

ii

m

i

ii
xyxyE   

10

Bayesian Estimation and
Parameter Posterior Distributions

Maximum likelihood estimation — there exits an actual value of the

parameters Θ0, that we estimate by maximizing the probability of the

data conditioned on the parameters

An arguably more natural approach is to find the most probable

values of the parameters, conditioned on the available data. Thus,

parameters are regarded as random variables with their own

distribution — the posterior distribution

where is the prior on Θp  D  
p D   P()

p(D)

  maxarg
0



Dp

)(P

11

Maximum A Posterior Estimation

Maximizing the log of the posterior, with respect to the parameters,

gives the maximum a posterior (or MAP) estimate

The prior distribution P( is chosen . Notice that if the prior is

independent of  flat prior, the MAP and maximum likelihood

estimates are the same.

Convenient to choose the prior distribution P( so that the consequent

posterior p( | D) has the same functional form as P( . The proper

form depends, of course, on the form of the likelihood function p(D| . 

Called conjugate priors.

ˆ   argmax

log p(D |    log P()  

12

Example: Posterior Distribution of
the Gaussian Model’s Mean

Suppose the data is Gaussian, and the variance is known, but

we just want to estimate the mean. The conjugate prior for this is

a Gaussian. The posterior on the mean is

where, l and 0 are the variance and mean of the prior
distribution on .

   

 
   








m

i

im
x

Dp

Dp

pDp
Dp

1

2

02

2

2
22

2
2

2

1

2

1
exp

22 )(

1

)(

)(,|
,|


l


l




13

Example: Posterior Distribution of
the Gaussian Model’s Mean

   
2

2

22

1 1
| , exp

22
p D



   


  

with the posterior mean and variance given by (show this!)

After some algebraic manipulation, we can rewrite the posterior

dist. as:

 
ml

2

ml
2

 
2

1

m
xi 


2

ml
2

 
2

0
i1

m

 
2


2
l

2

ml
2


2

14

Example: Posterior Distribution of
the Gaussian Model’s Mean

15

 
ml

2

ml
2

 
2

1

m
xi 


2

ml
2

 
2

0
i1

m

 
2


2
l

2

ml
2


2

Note that for the posterior mean

approaches the sample mean (the ML estimate), and the

posterior variance becomes small.

m  
2

/ l
2

Example: Posterior Distribution of
the Gaussian Model’s Mean

Without data, m=0, the posterior is just the original prior on  .

As we add samples, the posterior remains Gaussian (that’s the

point of a conjugate prior) but it’s mean and variance change in

response to the data.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-10 -5 0 5 10

no data

some data

lots of data

p( | D,2)

16