程序代写代做代考 Bayesian Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L11 – Map Estimates, Bayesian

Inference, Hyperparameter Choice

Continuing with Bayesian Methods

MAP Estimates, Bayesian Inference,

and Hyperparameter Choice

2

Why use a MAP Estimate, they’re Biased?

Consider the expected squared error of any

estimator:

          

      

  2

22

22

var

bias

EEE

EEEEMSE











Bias isn’t the only consideration – variance is also important.

There’s usually a trade-off; increase the bias, and the variance

drops, and vice-versa.

3

Bias-Variance Trade-Off
and MAP Estimates

Let’s go back to our MAP estimate of the mean for

Gaussian data:

 
m

2

m
2


2

1

m
xi 


2

m
2


2
0

i1

m

The bias and variance are (show these!)

bias
2

 E    
2


2

m2  2
0   













2

var    E  E   
2  

m
2

m
2


2















2

2

m

As

both go to zero

m  

4

Bias-Variance Trade-Off
and MAP Estimates

The curves look like this

0

0.5

1

1.5

2

0 5 10 15 20


2

var

bias 2

MSE

The curve of MSE has its minimum at

a non-zero value of .Specifically — opt
2  0   

2

5

MAP Estimates and Regularizers

The log of the posterior on the parameters is

log p( | D)   log p(D |)   log p()   log( p(D))

log-likelihood – bare cost log prior — regularizer

We saw that maximizing the data log-likelihood is equivalent

to minimizing some nice cost function — e.g. the mean-squared-error.

Maximizing the log-posterior is equivalent to minimizing a regularized

cost function. The effect of the regularizer is to reduce the

parameter variance at the cost of adding parameter bias.

6

MAP Regression

One can use the MAP estimate of ,and construct the
regression function

where is the value that maximizes the posterior p(|D).

One can also use this MAP value to estimate the target
density

ˆ 

 2
22

)ˆ,|(
2

1
exp

2

1
)ˆ,|( 


xtftxtp 

  )ˆ,|( ,|  xtfDxtE

7

Example of Map Regression – Ridge

Ridge regression uses a parameterized regressor f(x,ϴ), the
familiar SSE cost function (Gaussian likelihood for the targets), and
a Gaussian prior on the parameters, typically centered at zero

The regularized cost function is thus

That for linear regression, f(x,ϴ) is linear in ϴ so E is quadratic in ϴ
and the cost function can be minimized in closed form (just like
MLE estimation for linear regression).

8

2

22 /2

1
exp

/2

1
)( 




p

 


N

i

ii
xftE

1

22
),(),( 

Bayesian Estimation

Let’s continue. Suppose we have obtained the posterior on the

parameters p(| D) and we wish to find the probability of a new data

value x. A Bayesian says that you should calculate this from his

version of the distribution p(x)

  dDpxpDxp )|( )|( )|(

A Bayesian computes the mean of any function f(x) as

     dxdDpxpxfdxDxpxfDfE )|( )|( )( )|( )( |

9

Bayesian and MAP Estimates
Relation to MAP Estimates: Suppose the posterior is sharply peaked up

about its maximum value (the MAP estimate). Write a series expansion of

p(x|) about the maximum and substitute into the integral

    …|ˆ )|(
2

1
|)ˆ(

)|(
)ˆ|(

)|(…)ˆ(
)|(

2

1

)ˆ(
)|(

)ˆ|( )|( )|( )|(

2

ˆ

2

2

2

ˆ

2

2

ˆ

ˆ





 























DE
d

xpd
DE

d

xdp
xp

dDp
d

xpd

d

xdp
xpdDpxpDxp

10

Bayesian and MAP Estimates

Handwaving arg: As data increases, the posterior becomes more

sharply peaked about the MAP value Θ, trailing terms will become
small & integral is approximately

)ˆ|( )|(  xpDxp

    …|ˆ )|(
2

1
|)ˆ(

)|(
)ˆ|(

)|(…)ˆ(
)|(

2

1

)ˆ(
)|(

)ˆ|( )|( )|( )|(

2

ˆ

2

2

2

ˆ

2

2

ˆ

ˆ





 























DE
d

xpd
DE

d

xdp
xp

dDp
d

xpd

d

xdp
xpdDpxpDxp

11

Recursive Bayesian Estimation

Back to Bayesian estimation of p(x|D)

Denote the dataset with n points by Dn = {x1, x2, …, xn},
and its likelihood by

Using the last expression, the posterior can be written





 


 d

dpDp

pDp
xpdDpxpDxp

‘)'()’|(

)()|(
)|( )|( )|( )|(

)|()|()|()|(
1

1




n

n

n

k

k

n
DpxpxpDp

 




‘)|'()’|(

)|()|(
)|(

1

1

dDpxp

Dpxp
Dp

n

n

n

nn

12

Recursive Bayesian Estimation

We have written the posterior density for the n-sample
data set as

Starting with zero data, we take
and generate the sequence

and thus incrementally refine our estimate of the posterior
density as more and more data becomes available.

 




‘)|'()’|(

)|()|(
)|(

1

1

dDpxp

Dpxp
Dp

n

n

n

nn

)()|(
0

 pDp

…,)|(),|(
21

DpDp 

13

How Does a Bayesian do Regression?

Get a dataset

Choose a parameterized regression function f(x;to fit to the
data.

Choose a model distribution function for the targets, e.g. a
Gaussian

Choose a prior distribution on the parameters p(,2).

Calculate the data likelihood and the posterior distribution of the
parameters

D  xi , ti  i 1, …,m  

  






2

22

2
;(

2

1
exp

2

1
),,|( xftxtp



p(,
2

| D) 
1

p(D)
p(D |,

2
) p(,

2
)

14

How Does a Bayesian Do Regression?

Calculate the target density as a function of x by integrating over the
posterior distribution of the parameters

From the distribution on t, we can calculate several quantities.

• The conditional mean E[ t | x,D ] ) (called the regressor, and
equal to

for our Gaussian model.)

• The most likely value(s) of t

• The target variance var( t | x,D).

p(t | x,D)  p(t | x,
2

,) p( 
2

, | D) d
2

d






dDpxfdddtxtpt

dtDxtptDxtE

)|( );(),,|(

),|(],|[

22


),|( maxarg Dxtp
t

15

Hyperparameters and Model Selection

Our prior on model parameters is itself a parameterized

distribution. Recall for our Gaussian density model, we

put a prior on the distribution of the mean

p( | 0 ,
2

) 
1

22
exp  1

2
2

  0 
2









But how were the hyperparameters chosen 0 , 
2

16

Hyperparameter Selection

• We could calculate the likelihood function for particular values

and choose the values of the hyperparameters that maximizes it.

• We could set up a hyperprior on the hyperparameters and
choose maximum aposteriori values for the hyperparameters by
maximizing

(but the hyperprior is going to have its own parameters …).

• Use some sort of empirical technique.

p(D | 0 ,
2

)  p(D | ) p( | 0 ,
2

) d

p(0,
2

| D)  p(D | 0,
2

) p(0,
2

)

17

Empirical Hyperparameter Selection
Using a ‘validation’ set and MAP estimates.

Divide data into two pieces, development and evaluation

Development Eval.

Further divide the development set into fitting and validation

Fitting Valid.
Sweep the hyperparameters over a range

a<Λ