程序代写代做代考 Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L8 – Criterion Functions and

Bayes Optimality

Criterion Functions and Optimization

Need another tool – Calculus of Variations

We will be faced with the task of minimizing some integrals

with respect to functions that occur in the integrand. Thus

we’re doing minimization over function spaces.

The mathematical tool for this is calculus of variations.

Here’s the motivating problem.

What’s the curve of shortest distance between two points in a

plane?

2

Calculus of Variations

Write the curve as y(x). The length of the curve is

Next, vary the curve to , where l is small.

The perturbation g(x) is arbitrary except that g(a)=g(b)=0 so

|both and pass through the points a,b.

If y(x) is the curve of minimum length, then S[y+ lg] should be

first-order stationary in l. (What’s that mean?)

3

x

y

a

b

y(x)

y(x)+lg(x)

( 
2

2 2 2
[ ] 1 1 ‘

b b b

a a a

x x x

dy

dx

x x x

S y dx dy dx y dx       

( ) ( )y x g xl

( ) ( )y x g xl( )y x

Calculus of Variations

If y(x) is the curve of minimum length, then S[y+ lg]

(regarded as a function of l) should have zero

derivative at l=0

Evaluate this (using integration by parts to take derivatives off g(x))

x

y

a

b

y(x)

y(x)+lg(x)

2

0
0

[ ]
1 ‘ 0

b

a

x

x

d S y g d dg
y dx

d d dxl
l

l
l

l l

 
  

     
  
 

( 

2 2
0

0

2 2

‘ ‘
[ ]

1 ‘
1 ‘

‘ ‘
( )

1 ‘ 1 ‘

b b

a a

b b

a a

x x

x x

x x

x x

dg dg dg
y y

d S y g dx dx dxdx dx
d ydg

y
dx

d y g d y
dx g x dx

dx dxy y

l

l

l
l

l
l

 
   

   
  

  
   

  

   
    
       

 

 

4

Calc. of Variations
We have

Since g(x) is arbitrary, if the last integral is to vanish, the

factor in curly braces must vanish at all x.

2 2
0

2 2

2

2 3/22 2

[ ] ‘ ‘
( )

1 ‘ 1 ‘

‘ ‘
( )

1 ‘ 1 ‘

‘ ” ‘ ”
( ) ( )

(1 ‘ )1 ‘ 1 ‘

b b

a a

b
b

a
a

b b

a a

x x

x x

x
x

x
x

x x

x x

d S y g d y g d y
dx g x dx

d dx dxy y

y g d y
g x dx

dxy y

d y y y y
g x dx g x dx

dx yy y

l

l

l

   
    
       

 
  
   

    
      

      

 

 

x

y

a

b

y(x)

5

Calc. of Variations

We have to find y that satisfies

Two solutions:

The first is too restrictive (can’t pass through arbitrary

endpoints). So the curve with extremum length is a straight

line

6

‘( ) 0 ( ) , constant

”( ) 0 ( )

y x y x c

y x y x c m x

  

   

( )y x c m x 

𝑥𝑎

𝑥𝑏

𝑔 𝑥
𝑦′′

1 + 𝑦′2

𝑦′2𝑦′′

(1 + 𝑦′2)3/2
𝑑𝑥 = 0

x

y

a

b

y(x)

Calc. of Variations

Along the way, we’ve implicitly defined the functional derivative of S. Let

S[y] be a functional on y(x) (a map from functions y(x) to real numbers).

The functional derivative of S with respect to y(x), denoted

is defined by

To find the extremum of the functional S[y], set the functional derivative

equal to zero

.

If you have a calculus of variations problem, whenever you’re in doubt, the

apparatus above is the place to turn, as it often disambiguates calculations.

0

[ ( ) ( ) ] ( )
( )

d S
S y x g x g x dx

d y xl


l

l 

 
   

 

( )

S

y x

0
( )

S

y x


7

Example Statistics Application
Minimum Mean Squared Error Regression

In regression we often ask for functions that minimize mean square error

(MSE) (criterion function!) between data and the model.

Let’s look at regression probabilistically. We have a bunch of data pairs

(x,y) and we want to fit a function (any function) h(x) to them that minimizes

the MSE.

What we really want to do is build a regression function that minimizes the

MSE over the joint distribution p(x,y) = p(y|x) p(x).

Mathematically, the MSE over the joint density is

x

y

h(x)

(  dxdyxpxypxhy
xxy  )()|()( |

2
E

8

Minimum MSE Regression

  ( 

( 

( ( 

  

2

|

0 0

|

|

( ) ( ) ( ) ( ) ( ) ( | ) ( )

2 ( ) ( ) ( ) ( | ) ( )

2 ( ) ( ) ( | ) ( ) ( )

2 | ( ) ( ) ( )

y x x

y x x

y x x

x

d d
h x g x y x h x g x p y x p x dy dx

d d

y x h x g x p y x p x dy dx

y x h x p y x dy g x p x dx

E y x h x g x p x dx

l l

 l l
l l

 

       

 

 

 

 

E E

The variation with respect to h(x) is

Since g(x) is arbitrary, if E is to vanish, it must be true that the

quantity in { } vanishes identically at all x.

Thus, the best possible regressor is,

the function h(x) that equals the

conditional mean of y at each point x!

9

   dyxypyxyExh xy )|(|)( |

Back to Classification — Cross-Entropy Criterion

An often used criterion function in classifier design is the cross-

entropy. Take target values

and a discriminant function that satisfies .

For example, we could let h be a logistic function

with argument u(x) (an unknown)

The cross-entropy criterion is




2

1

,0

,1
)(



x

x
x

1)(0  xh

))(exp(1

1
)(

xu
xh



u

h

 ))(1(ln))(1()(ln)( xhxxhxECE  E

10

MSE for Classification

Here’s a slightly modified MSE for classification.

Select targets

and construct a discriminant function h(x) to minimize




2

1

,0

,1
)(



x

x
x

(  2)()( xhxE
MSE

 E

11

Criterion Functions and Bayes Posteriors

Let’s minimize these two criterion functions – the MSE and cross entropy –

over all possible discriminant functions h(x).

MSE

Set the first order variation to zero

Again, since g(x) is arbitrary, it must be true that

for all x.

(   (  dxxpxpxhxxhxEMSE )()|()()()()( 22 

  E

(  







dxxpxgxpxhx
d

ghd
MSE )()()|()()(2

][

0


l

l

l

E
0

)|()()( xpxxh 



12

MSE and Bayes Posteriors

  )|()(|)( xpxxExh 

We have

with

With this choice of targets, it’s clear that

The discriminant h(x) that minimizes the MSE is the class

posterior for w1 !!




2

1

,0

,1
)(



x

x
x

)|()|1()|()()( 1 xpxpxpxxh 

 

13

Cross Entropy and Bayes Posteriors

Similarly for the cross-entropy, when we minimize over all possible

discriminant functions h(x)

(

dxxpxg
hh

xhxE

dxxpxpxgxhx

xgxhx
d

d

d

ghd
CE

)()(
)1(

)(]|[

)()|())()(1(ln))(1(

))()((ln)(
][

0

0








l

l
ll

l

l

l

E
0

Again, since g(x) is arbitrary, it must be true that

or with

  )|()(|)( xpxxExh 






2

1

,0

,1
)(



x

x
x   )|(|)( 1 xpxExh  

14

Criterion Functions and Bayes Posteriors

For both the MSE and for the cross-entropy, if we minimize

these criteria over all possible discriminant functions, the optimal discriminant

function is the Bayes a posteriori probability

.

Both of these derivations assume that we can generate any discriminant

function h(x). There were no restrictions.

If an implementation is to achieve a machine h(x) whose outputs are optimal,

the machine must be based on a family of functions that is rich enough to

model the posteriors.

As an example, we can restrict ourselves to linear classifiers, and minimize the

MSE or the cross-entropy, but the output will NOT be the posteriors (even if the

class-conditional densities are Gaussian with equal covariances).

  )|(|)( 1 xpxExh  

15