Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L8 – Criterion Functions and
Bayes Optimality
Criterion Functions and Optimization
Need another tool – Calculus of Variations
We will be faced with the task of minimizing some integrals
with respect to functions that occur in the integrand. Thus
we’re doing minimization over function spaces.
The mathematical tool for this is calculus of variations.
Here’s the motivating problem.
What’s the curve of shortest distance between two points in a
plane?
2
Calculus of Variations
Write the curve as y(x). The length of the curve is
Next, vary the curve to , where l is small.
The perturbation g(x) is arbitrary except that g(a)=g(b)=0 so
|both and pass through the points a,b.
If y(x) is the curve of minimum length, then S[y+ lg] should be
first-order stationary in l. (What’s that mean?)
3
x
y
a
b
y(x)
y(x)+lg(x)
(
2
2 2 2
[ ] 1 1 ‘
b b b
a a a
x x x
dy
dx
x x x
S y dx dy dx y dx
( ) ( )y x g xl
( ) ( )y x g xl( )y x
Calculus of Variations
If y(x) is the curve of minimum length, then S[y+ lg]
(regarded as a function of l) should have zero
derivative at l=0
Evaluate this (using integration by parts to take derivatives off g(x))
x
y
a
b
y(x)
y(x)+lg(x)
2
0
0
[ ]
1 ‘ 0
b
a
x
x
d S y g d dg
y dx
d d dxl
l
l
l
l l
(
2 2
0
0
2 2
‘ ‘
[ ]
1 ‘
1 ‘
‘ ‘
( )
1 ‘ 1 ‘
b b
a a
b b
a a
x x
x x
x x
x x
dg dg dg
y y
d S y g dx dx dxdx dx
d ydg
y
dx
d y g d y
dx g x dx
dx dxy y
l
l
l
l
l
l
4
Calc. of Variations
We have
Since g(x) is arbitrary, if the last integral is to vanish, the
factor in curly braces must vanish at all x.
2 2
0
2 2
2
2 3/22 2
[ ] ‘ ‘
( )
1 ‘ 1 ‘
‘ ‘
( )
1 ‘ 1 ‘
‘ ” ‘ ”
( ) ( )
(1 ‘ )1 ‘ 1 ‘
b b
a a
b
b
a
a
b b
a a
x x
x x
x
x
x
x
x x
x x
d S y g d y g d y
dx g x dx
d dx dxy y
y g d y
g x dx
dxy y
d y y y y
g x dx g x dx
dx yy y
l
l
l
x
y
a
b
y(x)
5
Calc. of Variations
We have to find y that satisfies
Two solutions:
The first is too restrictive (can’t pass through arbitrary
endpoints). So the curve with extremum length is a straight
line
6
‘( ) 0 ( ) , constant
”( ) 0 ( )
y x y x c
y x y x c m x
( )y x c m x
𝑥𝑎
𝑥𝑏
𝑔 𝑥
𝑦′′
1 + 𝑦′2
−
𝑦′2𝑦′′
(1 + 𝑦′2)3/2
𝑑𝑥 = 0
x
y
a
b
y(x)
Calc. of Variations
Along the way, we’ve implicitly defined the functional derivative of S. Let
S[y] be a functional on y(x) (a map from functions y(x) to real numbers).
The functional derivative of S with respect to y(x), denoted
is defined by
To find the extremum of the functional S[y], set the functional derivative
equal to zero
.
If you have a calculus of variations problem, whenever you’re in doubt, the
apparatus above is the place to turn, as it often disambiguates calculations.
0
[ ( ) ( ) ] ( )
( )
d S
S y x g x g x dx
d y xl
l
l
( )
S
y x
0
( )
S
y x
7
Example Statistics Application
Minimum Mean Squared Error Regression
In regression we often ask for functions that minimize mean square error
(MSE) (criterion function!) between data and the model.
Let’s look at regression probabilistically. We have a bunch of data pairs
(x,y) and we want to fit a function (any function) h(x) to them that minimizes
the MSE.
What we really want to do is build a regression function that minimizes the
MSE over the joint distribution p(x,y) = p(y|x) p(x).
Mathematically, the MSE over the joint density is
x
y
h(x)
( dxdyxpxypxhy
xxy )()|()( |
2
E
8
Minimum MSE Regression
(
(
( (
2
|
0 0
|
|
( ) ( ) ( ) ( ) ( ) ( | ) ( )
2 ( ) ( ) ( ) ( | ) ( )
2 ( ) ( ) ( | ) ( ) ( )
2 | ( ) ( ) ( )
y x x
y x x
y x x
x
d d
h x g x y x h x g x p y x p x dy dx
d d
y x h x g x p y x p x dy dx
y x h x p y x dy g x p x dx
E y x h x g x p x dx
l l
l l
l l
E E
The variation with respect to h(x) is
Since g(x) is arbitrary, if E is to vanish, it must be true that the
quantity in { } vanishes identically at all x.
Thus, the best possible regressor is,
the function h(x) that equals the
conditional mean of y at each point x!
9
dyxypyxyExh xy )|(|)( |
Back to Classification — Cross-Entropy Criterion
An often used criterion function in classifier design is the cross-
entropy. Take target values
and a discriminant function that satisfies .
For example, we could let h be a logistic function
with argument u(x) (an unknown)
The cross-entropy criterion is
2
1
,0
,1
)(
x
x
x
1)(0 xh
))(exp(1
1
)(
xu
xh
u
h
))(1(ln))(1()(ln)( xhxxhxECE E
10
MSE for Classification
Here’s a slightly modified MSE for classification.
Select targets
and construct a discriminant function h(x) to minimize
2
1
,0
,1
)(
x
x
x
( 2)()( xhxE
MSE
E
11
Criterion Functions and Bayes Posteriors
Let’s minimize these two criterion functions – the MSE and cross entropy –
over all possible discriminant functions h(x).
MSE
Set the first order variation to zero
Again, since g(x) is arbitrary, it must be true that
for all x.
( ( dxxpxpxhxxhxEMSE )()|()()()()( 22
E
(
dxxpxgxpxhx
d
ghd
MSE )()()|()()(2
][
0
l
l
l
E
0
)|()()( xpxxh
12
MSE and Bayes Posteriors
)|()(|)( xpxxExh
We have
with
With this choice of targets, it’s clear that
The discriminant h(x) that minimizes the MSE is the class
posterior for w1 !!
2
1
,0
,1
)(
x
x
x
)|()|1()|()()( 1 xpxpxpxxh
13
Cross Entropy and Bayes Posteriors
Similarly for the cross-entropy, when we minimize over all possible
discriminant functions h(x)
(
dxxpxg
hh
xhxE
dxxpxpxgxhx
xgxhx
d
d
d
ghd
CE
)()(
)1(
)(]|[
)()|())()(1(ln))(1(
))()((ln)(
][
0
0
l
l
ll
l
l
l
E
0
Again, since g(x) is arbitrary, it must be true that
or with
)|()(|)( xpxxExh
2
1
,0
,1
)(
x
x
x )|(|)( 1 xpxExh
14
Criterion Functions and Bayes Posteriors
For both the MSE and for the cross-entropy, if we minimize
these criteria over all possible discriminant functions, the optimal discriminant
function is the Bayes a posteriori probability
.
Both of these derivations assume that we can generate any discriminant
function h(x). There were no restrictions.
If an implementation is to achieve a machine h(x) whose outputs are optimal,
the machine must be based on a family of functions that is rich enough to
model the posteriors.
As an example, we can restrict ourselves to linear classifiers, and minimize the
MSE or the cross-entropy, but the output will NOT be the posteriors (even if the
class-conditional densities are Gaussian with equal covariances).
)|(|)( 1 xpxExh
15