Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L10 – Parameter Estimation
Parameter Estimation
Maximum Likelihood and Bayes Estimates
The following lectures expand on our earlier discussion of
parameter estimates, introducing some formal grounding.
(A good supplemental source for this is chapter 2 of Neural
Networks for Pattern Recognition, Chris Bishop.)
We’ll discuss parametric density estimates in more detail,
including mixture models for densities.
2
Parametric Density Models
A model of specific functional form.
A small number of parameters that are estimated from
data.
e.g. Normal distribution
Data – D = {x1, x2, x3, …, xm}
Parameter estimates
but where did these forms for the estimators come from?
22
2 21 1
22
( | , ) exp ( )p x x
2
1
1
12
1
1 ˆˆ,ˆ
m
i
im
m
i
im
xx
3
Maximum Likelihood Estimation
Question — what’s the probability that the dataset D occurs,
given the form of the model density?
We assume each of the xi are sampled independently from the
underlying (normal in this example) distribution, then
0
0.2
0.4
0.6
0.8
1
-10 -5 0 5 10
If the data (histogram) and
model look like this, the data will
have low probability given the
model (data at the tails of the
model distribution).
2 2 21
1
( | , ) , … | , ( | , )
m
m i
i
p D p x x p x
4
Maximum Likelihood Estimation
Question — what’s the probability that the dataset D occurs,
given the form of the model density?
We assume each of the xi are sampled independently from the
underlying (normal in this example) distribution, then
0
0.2
0.4
0.6
0.8
1
-10 -5 0 5 10
We can adjust the model mean
so that the data has higher
probability under the model.
But this model still has low tails
where there’s plenty of data,
because the peak is too sharp.
2 2 21
1
( | , ) , … | , ( | , )
m
m i
i
p D p x x p x
5
Maximum Likelihood Estimation
Question — what’s the probability that the dataset D occurs,
given the form of the model density?
We assume each of the xi are sampled independently from the
underlying (normal in this example) distribution, then
0
0.2
0.4
0.6
0.8
1
-10 -5 0 5 10
Here’s a Gaussian model that
chooses the mean and variance
so as to maximize the data
likelihood under the model.
2 2 21
1
( | , ) , … | , ( | , )
m
m i
i
p D p x x p x
6
Maximum Likelihood Estimation
So, we adjust the model parameters to maximize the data likelihood.
Since the log is monotonic in its arguments, and we often deal with
model distributions from the exponential family, it’s convenient to
maximize the log-likelihood.
m
i
i
m
i
i
xxpDpL
1
2
2
2
2
12
1
2
2
1
)2(ln),|(ln),|(ln
21
1 1
2 21
2
1
1
ˆ0
ˆ ˆ0 ( )
m m
i i
i i
m
im
i
L
x x
m
L
x
Note 2 is biased.
7
Data Distributions and Cost Functions
Regression – Minimizing mean square error between the data and a
regression curve is equivalent to maximizing the data likelihood
under the assumption that the fitting error is Gaussian.
The data is the sequence of (x,y) coordinates. The data y values
are assumed Gaussian distributed with mean g(x). That is
y = g(x) + e e ~ N(0,2)
The data likelihood is
2
2
1
2
2
)(
2
1
exp
2
1
) ),( };{ }({
ii
m
i
ii
xgyxgxyp
8
Data Distributions and Cost Functions
Regression
Maximizing the data log-likelihood L with respect
to g(x)
is equivalent to minimizing the sum-squared
fitting error with respect to g(x).
2
2
2
1
2
)(
2
1
2log
2
1
),( ; log
ii
m
i
xgyxgxypL
2
2
1
)(
2
1
ii
m
i
xgyE
9
Data Distributions and Cost Functions
Classification
For a (two-class) classification problem, it’s natural to write the data
likelihood as a product of Bernouli distributions (since the target
values are y = 0 or 1 for each example)
where α(x) is the probability that for the feature vector x, the class label is 1
(rather than 0).
Maximizing this data likelihood is equivalent to minimizing its –log, the
cross entropy error
m
i
y
i
y
iii
ii xxxxypL
1
)1(
)(1 )( )( ;
)(1log )1( )(log
1
ii
m
i
ii
xyxyE
10
Bayesian Estimation and
Parameter Posterior Distributions
Maximum likelihood estimation — there exits an actual value of the
parameters Θ0, that we estimate by maximizing the probability of the
data conditioned on the parameters
An arguably more natural approach is to find the most probable
values of the parameters, conditioned on the available data. Thus,
parameters are regarded as random variables with their own
distribution — the posterior distribution
where is the prior on Θp D
p D P()
p(D)
maxarg
0
Dp
)(P
11
Maximum A Posterior Estimation
Maximizing the log of the posterior, with respect to the parameters,
gives the maximum a posterior (or MAP) estimate
The prior distribution P( is chosen . Notice that if the prior is
independent of flat prior, the MAP and maximum likelihood
estimates are the same.
Convenient to choose the prior distribution P( so that the consequent
posterior p( | D) has the same functional form as P( . The proper
form depends, of course, on the form of the likelihood function p(D| .
Called conjugate priors.
ˆ argmax
log p(D | log P()
12
Example: Posterior Distribution of
the Gaussian Model’s Mean
Suppose the data is Gaussian, and the variance is known, but
we just want to estimate the mean. The conjugate prior for this is
a Gaussian. The posterior on the mean is
where, l and 0 are the variance and mean of the prior
distribution on .
m
i
im
x
Dp
Dp
pDp
Dp
1
2
02
2
2
22
2
2
2
1
2
1
exp
22 )(
1
)(
)(,|
,|
l
l
13
Example: Posterior Distribution of
the Gaussian Model’s Mean
2
2
22
1 1
| , exp
22
p D
with the posterior mean and variance given by (show this!)
After some algebraic manipulation, we can rewrite the posterior
dist. as:
ml
2
ml
2
2
1
m
xi
2
ml
2
2
0
i1
m
2
2
l
2
ml
2
2
14
Example: Posterior Distribution of
the Gaussian Model’s Mean
15
ml
2
ml
2
2
1
m
xi
2
ml
2
2
0
i1
m
2
2
l
2
ml
2
2
Note that for the posterior mean
approaches the sample mean (the ML estimate), and the
posterior variance becomes small.
m
2
/ l
2
Example: Posterior Distribution of
the Gaussian Model’s Mean
Without data, m=0, the posterior is just the original prior on .
As we add samples, the posterior remains Gaussian (that’s the
point of a conjugate prior) but it’s mean and variance change in
response to the data.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-10 -5 0 5 10
no data
some data
lots of data
p( | D,2)
16