Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L11 – Map Estimates, Bayesian
Inference, Hyperparameter Choice
Continuing with Bayesian Methods
MAP Estimates, Bayesian Inference,
and Hyperparameter Choice
2
Why use a MAP Estimate, they’re Biased?
Consider the expected squared error of any
estimator:
2
22
22
var
bias
EEE
EEEEMSE
Bias isn’t the only consideration – variance is also important.
There’s usually a trade-off; increase the bias, and the variance
drops, and vice-versa.
3
Bias-Variance Trade-Off
and MAP Estimates
Let’s go back to our MAP estimate of the mean for
Gaussian data:
m
2
m
2
2
1
m
xi
2
m
2
2
0
i1
m
The bias and variance are (show these!)
bias
2
E
2
2
m2 2
0
2
var E E
2
m
2
m
2
2
2
2
m
As
both go to zero
m
4
Bias-Variance Trade-Off
and MAP Estimates
The curves look like this
0
0.5
1
1.5
2
0 5 10 15 20
2
var
bias 2
MSE
The curve of MSE has its minimum at
a non-zero value of .Specifically — opt
2 0
2
5
MAP Estimates and Regularizers
The log of the posterior on the parameters is
log p( | D) log p(D |) log p() log( p(D))
log-likelihood – bare cost log prior — regularizer
We saw that maximizing the data log-likelihood is equivalent
to minimizing some nice cost function — e.g. the mean-squared-error.
Maximizing the log-posterior is equivalent to minimizing a regularized
cost function. The effect of the regularizer is to reduce the
parameter variance at the cost of adding parameter bias.
6
MAP Regression
One can use the MAP estimate of ,and construct the
regression function
where is the value that maximizes the posterior p(|D).
One can also use this MAP value to estimate the target
density
ˆ
2
22
)ˆ,|(
2
1
exp
2
1
)ˆ,|(
xtftxtp
)ˆ,|( ,| xtfDxtE
7
Example of Map Regression – Ridge
Ridge regression uses a parameterized regressor f(x,ϴ), the
familiar SSE cost function (Gaussian likelihood for the targets), and
a Gaussian prior on the parameters, typically centered at zero
The regularized cost function is thus
That for linear regression, f(x,ϴ) is linear in ϴ so E is quadratic in ϴ
and the cost function can be minimized in closed form (just like
MLE estimation for linear regression).
8
2
22 /2
1
exp
/2
1
)(
p
N
i
ii
xftE
1
22
),(),(
Bayesian Estimation
Let’s continue. Suppose we have obtained the posterior on the
parameters p(| D) and we wish to find the probability of a new data
value x. A Bayesian says that you should calculate this from his
version of the distribution p(x)
dDpxpDxp )|( )|( )|(
A Bayesian computes the mean of any function f(x) as
dxdDpxpxfdxDxpxfDfE )|( )|( )( )|( )( |
9
Bayesian and MAP Estimates
Relation to MAP Estimates: Suppose the posterior is sharply peaked up
about its maximum value (the MAP estimate). Write a series expansion of
p(x|) about the maximum and substitute into the integral
…|ˆ )|(
2
1
|)ˆ(
)|(
)ˆ|(
)|(…)ˆ(
)|(
2
1
)ˆ(
)|(
)ˆ|( )|( )|( )|(
2
ˆ
2
2
2
ˆ
2
2
ˆ
ˆ
DE
d
xpd
DE
d
xdp
xp
dDp
d
xpd
d
xdp
xpdDpxpDxp
10
Bayesian and MAP Estimates
Handwaving arg: As data increases, the posterior becomes more
sharply peaked about the MAP value Θ, trailing terms will become
small & integral is approximately
)ˆ|( )|( xpDxp
…|ˆ )|(
2
1
|)ˆ(
)|(
)ˆ|(
)|(…)ˆ(
)|(
2
1
)ˆ(
)|(
)ˆ|( )|( )|( )|(
2
ˆ
2
2
2
ˆ
2
2
ˆ
ˆ
DE
d
xpd
DE
d
xdp
xp
dDp
d
xpd
d
xdp
xpdDpxpDxp
11
Recursive Bayesian Estimation
Back to Bayesian estimation of p(x|D)
Denote the dataset with n points by Dn = {x1, x2, …, xn},
and its likelihood by
Using the last expression, the posterior can be written
d
dpDp
pDp
xpdDpxpDxp
‘)'()’|(
)()|(
)|( )|( )|( )|(
)|()|()|()|(
1
1
n
n
n
k
k
n
DpxpxpDp
‘)|'()’|(
)|()|(
)|(
1
1
dDpxp
Dpxp
Dp
n
n
n
nn
12
Recursive Bayesian Estimation
We have written the posterior density for the n-sample
data set as
Starting with zero data, we take
and generate the sequence
and thus incrementally refine our estimate of the posterior
density as more and more data becomes available.
‘)|'()’|(
)|()|(
)|(
1
1
dDpxp
Dpxp
Dp
n
n
n
nn
)()|(
0
pDp
…,)|(),|(
21
DpDp
13
How Does a Bayesian do Regression?
Get a dataset
Choose a parameterized regression function f(x;to fit to the
data.
Choose a model distribution function for the targets, e.g. a
Gaussian
Choose a prior distribution on the parameters p(,2).
Calculate the data likelihood and the posterior distribution of the
parameters
D xi , ti i 1, …,m
2
22
2
;(
2
1
exp
2
1
),,|( xftxtp
p(,
2
| D)
1
p(D)
p(D |,
2
) p(,
2
)
14
How Does a Bayesian Do Regression?
Calculate the target density as a function of x by integrating over the
posterior distribution of the parameters
From the distribution on t, we can calculate several quantities.
• The conditional mean E[ t | x,D ] ) (called the regressor, and
equal to
for our Gaussian model.)
• The most likely value(s) of t
• The target variance var( t | x,D).
p(t | x,D) p(t | x,
2
,) p(
2
, | D) d
2
d
dDpxfdddtxtpt
dtDxtptDxtE
)|( );(),,|(
),|(],|[
22
),|( maxarg Dxtp
t
15
Hyperparameters and Model Selection
Our prior on model parameters is itself a parameterized
distribution. Recall for our Gaussian density model, we
put a prior on the distribution of the mean
p( | 0 ,
2
)
1
22
exp 1
2
2
0
2
But how were the hyperparameters chosen 0 ,
2
16
Hyperparameter Selection
• We could calculate the likelihood function for particular values
and choose the values of the hyperparameters that maximizes it.
• We could set up a hyperprior on the hyperparameters and
choose maximum aposteriori values for the hyperparameters by
maximizing
(but the hyperprior is going to have its own parameters …).
• Use some sort of empirical technique.
p(D | 0 ,
2
) p(D | ) p( | 0 ,
2
) d
p(0,
2
| D) p(D | 0,
2
) p(0,
2
)
17
Empirical Hyperparameter Selection
Using a ‘validation’ set and MAP estimates.
Divide data into two pieces, development and evaluation
Development Eval.
Further divide the development set into fitting and validation
Fitting Valid.
Sweep the hyperparameters over a range
a<Λ