程序代写代做代考 Bayesian algorithm Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L20 — Neural Nets II

MLP Output

Signal propagation (forward pass, bottom-up)

inputs

outputs

Ol  k
i

ikik netxwy    )(

wki 





 

…

)(
)(

)()(

regressionfornet

tionclassificafornet
netgwith

netgywgO

klkl



Gradient Descent in MLP

Cost function as before:

Learning by gradient descent

Let’s calculate the components of the gradient

 
 


D

m
D

xOtw
O

1
2
1 ))(()(





ijw






)(


E
wij 

number of outputs

MLP Error Gradient

 
 


D

m
D

xOtw
O

1
2
1 ))(()(





ijw






)(


E
wij 

1. Derivative with respect to a weight to the
output.

wlk

 
 









 D

m
D

lklk

xOt
ww

w O

1
2
1 ))((

)(






E

 
 







m lk
D

xOt
w

1
2
1 ))((





 
 







m lk
mmD

xO
w

xOt
O

1 1
2
1 ))(())((2





 
  




D N

m lk

l
mlmmD

xO
xOt

1 1
2
1 )()())((2




 



 




l
llD w

xO
xOt

1 )())((






MLP Error Gradient

Derivative with respect to a weight to the output.

We have so far:xi

wlk


 






 D

l
llD

lk w

xO
xOt

1 )())((
)(







E

 




l
llD w

netg
xOt

1 )())((



 




 




l
lllD w

net
netgxOt

1 )(‘))((










 




ii li
lllD w

yw
netgxOt

1
)(

)(‘))((







 



D

i
i

iklllD
ynetgxOt

1 )(‘))((


 






D

klllD
ynetgxOt

1 )(‘))((





MLP Error Gradient

1. Derivative with respect to a weight to the
output.

wlk







 D

klllD
lk

ynetgxOt
w

1 )(‘))((
)(







Weight from node k to
output node l.

Error at
node l

Slope of activation
function for node l

Signal from
node k

MLP Gradient

2. Derivative with respect to weights to hidden units

 
 









 D

n
D

kiki

xOt
ww

w O

1
2
1 ))((

)(






E

 
 







nn
ki

n
D

xOt
w

1
2
1 ))((





n
D

n
D w

xO
xOt




  

 

)(
))((2

1 1
2
1 







n
D

n
D w

xO
xOt




  

 






)(
))((

1 1




n
D

n
D w

net

xO
xOt




  

 

)()(
))((

1 1

1 











wki

MLP Gradient

ik
ki

j jkj

k
ki

k
k

k
xnet

xw
net

net
net

net





 


)(‘)(‘)(‘
)(














 

 
 









 D

n
D

kiki

xOt
ww

w O

1
2
1 ))((

)(






E

n
D

n
D w

net

xO
xOt




  

 

)()(
))((

1 1

1 











2. Derivative with respect to weights to hidden units

wki

nkn
k

inii
n

n
n

n
wnetg

yw
netg

net
netg

netg

xO
)(‘)(‘)(‘

)()(


































 


Now look at the two pieces:

Substitute into (1)

(1)

MLP Gradient

wki

Derivative with respect
to weights to hidden units

kiw



 )(


n
D

n
D w

net

xO
xOt




  

 

)()(
))((

1 1

1 











ik
ki

k
xnet

net


 


)(‘
)(





nkn

n
wnetg

xO
)(‘

)(












So
ik

nknnn

n
D

xnetwnetgxOt
w

w



  )(‘)(‘))((

)(

1 1

1
0

 
  















 


Pseudo-error at hidden node k Activation function
slope at node k

Signal at node i

MLP Error Gradient

nknnn

n
D

xnetwnetgxOt
w

w



  )(‘)(‘))((

)(

1 1

1
0

 
  















 


E
xi

wki

Derivative with respect to
weights to hidden units

Pseudo-error at hidden node k

Error at output node n

multiplied by slope at output node n

passed backwards through the weight
from node n to node k, hence “error back-propagation”

wnk

slope of activation
function at node k.

signal at node i.

Summary MLP Error Gradients

1. Derivative with respect to a weight to an output.

wlk







 D

klllD
lk

ynetgxOt
w

1 )(‘))((
)(







nknnn

n
D

xnetwnetgxOt
w

w



  )(‘)(‘))((

)(

1 1

1
0

 
  















 


2. Derivative with respect to weights to hidden units

wki

wnk

dklll
lk

ynetgxOt
w

w
)(‘))((

)(


 





E
stochastic ver.

stochastic version

iknknnn

nki

xnetwnetgxOt
w

w


  )(‘)(‘))((
)( 0

1 





















Backpropagation Learning Algorithm
Batch Mode (uses ALL data at each step)

choose learning rate 

initialize wij % Usually to “small” random numbers

while ( E / E > e ) % Fractional change e ~ 10-4–10-6

calculate mean square error

calculate all error derivatives and step downhill

endwhile

 
 


D

m
D

xOtw
O

1
2
1 ))(()(





ijw






)(


E
wij 

Backpropagation Learning Algorithm
Stochastic or On-Line Mode

(uses ONE input/target pair for each step)

choose learning rate 

initialize wij usually to “small” random numbers

while ( E / E > e ) % Fractional change e ~ 10-4–10-6

calculate mean square error

for  = 1 … D % Step through data, or do D random draws with replacement

change all weights wij as

end for

endwhile

 
 


D

m
D

xOtw
O

1
2
1 ))(()(





ijw






)(



E

wij

Comments

Cost function may not be convex, can have local optima, some
may be quite poor. In practice, this is not a show-stopper.

Usually initialize with random weights close to zero.
Then

will be small,

and

So early on, the network output will be
nearly linear in the input. Non-linearities
are added as training continues.


i

ikik
xwnet

kk
netnet )(

Comments

Learning algorithms are simply optimization methods. Trying to
find w that minimizes E(w). Several other optimization methods,

both classical and novel, are brought to bear on the problem.

Deep networks (several layers of sigmoidal hidden nodes) can

be very slow to train; gradient with respect to weights near inputs

will have multiple factors of ’ which decreases gradient signal.
(And the condition number of the Hessian of E become small.)

Power

Universal approximation theorem

– Any continuous function on a compact subset of the input

space (closed and bounded) can be approximated

arbitrarily closely by a feedforward network with one

layer of sigmoidal hidden units and linear output units.

That is, weighted sums of sigmoidal functions of the

inputs are universal approximators.

 



hn

i
j jiji

xwwxO
1

)()( 


Power

• Approximation Accuracy

– The magnitude of the approximation error decreases with increasing
number nh of hidden units as

– Techniques linear in the parameters (fixed basis functions with only
their weighting fit)

only achieve error bounded by

where d is the dimension of the input space.

CURSE OF DIMESIONALITY

)/1( hnOrderE 

)()(
1

xwxO
N

i
ii





 

d
n

/2
)/1(Order

Inductive Bias

The hypothesis space is the continuous weight space! Hard to

characterize inductive bias.

Bias can be imposed by adding a regularizer to the cost function

where F(w) carries the desired bias, and l characterizes the

strength with which the bias is imposed.

)())((),(
1

2
1 wFxOtw

O 
ll



  
 

Inductive Bias

Bias can be imposed by adding a regularizer to the cost function

where F(w) carries the desired bias, and l characterizes the
strength with which the bias is imposed.

Examples :

– small weights (L2 norm — decay)

– small curvature

– Sparse models (L1 norm)

)())((),(
1

2
1 wFxOtw

O 
ll



  
 


i

i
wwwF

22
)(



dx
x

xO
wF

2
)(

)( 








i

i
wwwF


)(

Generalization
Overfitting / Underfitting

• We’ve been talking about fitting the network function to the

training data {x, t } =1…D.

• But we really care about the performance of the network on

unseen data.

Generalization
Overfitting / Underfitting

• We can build a network with a very large hidden layer, train it to

the bottom of a deep local optimum and very closely

approximate the training data. This may be precisely the wrong

thing to do.

• Question is “how well does model generalize?” What’s the

average error on infinitely large test sets? (That is, “what’s the

statistical expectation of the error on the population?)

This is called the generalization error.

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-1.5

-1

-0.5

0.5

1.5

2.5

Overfitting

Regression problem, 20 data training points, six different neural

network fits with 1, 2, 3, 5, 10, and 20 hidden units.

MSEtrain= [ 0.255 0.146 0.074 0.025 0.019 0.014 ]

1 2 3 5

-1 -0.5 0 0.5

-1

-0.5

0.5

1.5

Overfitting

Here’s fits to training data overlaid on test or out-of-sample data

(i.e. data not used during the fitting).

MSEtest= [0.121 0.203 0.505 0.126 0.216 0.213]

Fixed Training Set Size, Variable Model Size
Expected Behavior

Model Size – Number of Hidden Units
(models trained to bottom of deep local optimum)

M
S

training error

generalization errorunderfitting overfitting

Fixed Model Size, Variable Training Set Size
Expected Behavior

Training Set Size
(model trained to bottom of deep local optimum)

M
S

training error

generalization error

Fixed Model and Training Data
Regularized Cost Function

Expected Behavior
M

S
E

training error

generalization errorunderfitting overfitting

Regularizer (l) Strength
1/l

Probabilistic Interpretations: Regression

Recall Lecture 8, page 9. The best possible function to minimize the

MSE for a regression problem (curve fitting) is

the conditional mean of y at each point x.

Since NN are universal approximators, we expect that if we have

a large enough (hidden nodes) network

enough data

our network trained to minimize MSE will have outputs O(x) that

approximate the regressor E[ t | x ].

Typically, regression networks have linear output nodes (but sigmoidal

hidden nodes).

   dyxtptxtExhxO xt )|(|)()( |

Probabilistic Interpretations: Classification

Consider a classification problem with L classes. (Each sample is a

member of one and only one class.) The usual NN for such a

problem has L output nodes with targets ti , i=1…L :

e.g. for a 4 class problem, for an input x from class 2, the target

outputs are [ 0 1 0 0]. This is called a one-of-many representation.










i
i xif

xif
xt




1
)(

Probabilistic Interpretations: Classification

  )|()|(|)( |
}1,0{

xpxtptxtExO iixti
t

ii i
i

 


)exp(1

1
)(

u
u




A simple extension of the result of Lecture 8, page 13 says that

the best possible function to minimize the MSE of the output for

such a representation is to have each output equal to the class

posterior

Typically, networks trained for classification use a sigmoidal

output function – usually the logistic

which is naturally bounded to the range [0,1].

Probabilistic Interpretations: Classification

A similar extension of the result in Lecture 8, page 14 says that the

absolute minimum of the cross-entropy error measure

is given when each network output equal to the class posterior.

Networks trained to minimize the cross-entropy typically use a

logistic output

Notice that this setup is also useful when an object can belong

to several classes simultaneously (e.g. medical diagnosis).

































  

  )(1

)(1
ln))(1(

)(

)(
ln)(

1 1

nl
nl

l nl

nl
nlN xt

xO
xt

xO
xtE

)exp(1

1
)(

l
ll

u
uO




Probabilistic Interpretations: Classification

Finally, for multiclass problems, a third cost function emerges. It is
based on the multinomial distribution for target values and is
exclusively for the case where each object is a member of one and
only one class

Taking negative log-likelihood for a set of examples xn n=1…N
results in the cost function





L

xt
lL

lxOxttp
1

)(
1 )()|},…,({

 
0

1 1

)(

)(
ln)()(ln)( E

xO
xtxOxtE

l nl

nl
nlN

l
nlnlN















   

  

Probabilistic Interpretations: Classification

Networks trained with this cost function (‘Cross-entropy 2’) typically

use the soft-max activation

Which is naturally bounded in the interval [0,1].

Note also that as must be the case when each

object belongs to one and only one class.





L

l
l

l
ll

u
uO

1’
‘ )exp(

)exp(
)(

1)(
1




Probabilistic Interpretations: Classification

Since NN are universal approximators, we expect that for

Large enough networks (hidden nodes)

Enough data

a classifier NN trained to minimize either the MSE or the
cross-entropy error measure will have network outputs that
approximate the class posteriors.

This is the usual interpretation of the NN classifier outputs – but
care is essential, since the pre-requisites (large networks and
enough data) may not be met.

Weight Estimation – Maximum Likelihood

Training a neural network is an estimation problem, where the

parameters being estimated are the weights in the network.

Recalling the results in Lecture 10, pages 9 and 10

– Minimizing the MSE is equivalent to maximum likelihood

estimation under the assumption of targets with a Gaussian

distribution (as usually assumed for regression problems).

– Minimizing the CROSS-ENTROPY error is equivalent to

maximum likelihood estimation under the assumption of targets

with a Bernouli distribution (as usually assumed for

classification problems).

Weight Estimation – Maximum Likelihood

– Minimizing CROSS-ENTROPY 2 is equivalent to maximum

likelihood estimation under the assumption of targets with a

multinomial distribution as given on p30.

Weight Estimation – MAP

Following earlier discussion of estimation methods, can introduce a

prior over network weights p(w) and move from the ML estimate, to

the MAP estimate.

This change is mirrored by the change from a cost function, to a

regularized cost function

Regularizers help improve generalization by reducing the variance of

the estimates of the network weights. They do so at the price of

introducing bias into the weight estimates.

))(ln( wPEU 

Weight Estimation – MAP

An often-used regularizer is weight decay

which is equivalent to a Gaussian prior on the weights with mean

zero, and covariance (spherically symmetric) S  1/2 l.





Q

j
jwEU

2
l

Bayesian Inference with Neural Networks

A Bayesian would not pick a set of network weights to use in a
regression or classification model, but would rather compute the
posterior distribution on the network weights

and perform inference by averaging models over this posterior
distribution.

For example, the predictor for a regression problem takes the form

Needless to say, this is an ambitious program (multimodal
posterior, intractable integrals) requiring Monte Carlo techniques, or
extreme approximations.

)(/)()|()|( DPwPwDpDwp 

 dwDwpwxODxO )|();()|(

Related Posts