Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L20 — Neural Nets II
2
MLP Output
Signal propagation (forward pass, bottom-up)
inputs
outputs
xi
yk
Ol k
i
ikik netxwy )(
yk
wki
…
)(
)(
)()(
regressionfornet
tionclassificafornet
netgwith
netgywgO
l
l
l
l
k
klkl
Ol
3
Gradient Descent in MLP
Cost function as before:
Learning by gradient descent
Let’s calculate the components of the gradient
D
mm
N
m
D
xOtw
O
1
2
1
2
1 ))(()(
E
ijw
w
)(
E
wij
number of outputs
4
MLP Error Gradient
D
mm
N
m
D
xOtw
O
1
2
1
2
1 ))(()(
E
ijw
w
)(
E
wij
1. Derivative with respect to a weight to the
output.
xi
Ol
wlk
yk
D
mm
N
m
D
lklk
xOt
ww
w O
1
2
1
2
1 ))((
)(
E
D
mm
N
m lk
D
xOt
w
O
1
2
1
2
1 ))((
D
m
N
m lk
mmD
xO
w
xOt
O
1 1
2
1 ))(())((2
D N
m lk
l
mlmmD
O
w
xO
xOt
1 1
2
1 )()())((2
D
lk
l
llD w
xO
xOt
1
1 )())((
5
MLP Error Gradient
Derivative with respect to a weight to the output.
We have so far:xi
Ol
wlk
yk
D
lk
l
llD
lk w
xO
xOt
w
w
1
1 )())((
)(
E
D
lk
l
llD w
netg
xOt
1
1 )())((
D
lk
l
lllD w
net
netgxOt
1
1 )(‘))((
D
lk
ii li
lllD w
yw
netgxOt
1
1
)(
)(‘))((
D
i
i
iklllD
ynetgxOt
1
1 )(‘))((
D
klllD
ynetgxOt
1
1 )(‘))((
6
MLP Error Gradient
1. Derivative with respect to a weight to the
output.
xi
Ol
wlk
yk
D
klllD
lk
ynetgxOt
w
w
1
1 )(‘))((
)(
E
Weight from node k to
output node l.
Error at
node l
Slope of activation
function for node l
Signal from
node k
7
MLP Gradient
2. Derivative with respect to weights to hidden units
D
nn
N
n
D
kiki
xOt
ww
w O
1
2
1
2
1 ))((
)(
E
D
nn
ki
N
n
D
xOt
w
O
1
2
1
2
1 ))((
ki
n
D
nn
N
n
D w
xO
xOt
O
)(
))((2
1 1
2
1
ki
k
k
n
D
nn
N
n
D w
y
y
xO
xOt
O
)(
))((
1 1
1
ki
k
k
n
D
nn
N
n
D w
net
y
xO
xOt
O
)()(
))((
1 1
1
wki
xi
Ol
yk
Om
8
MLP Gradient
ik
ki
j jkj
k
ki
k
k
ki
k
xnet
w
xw
net
w
net
net
w
net
)(‘)(‘)(‘
)(
D
nn
N
n
D
kiki
xOt
ww
w O
1
2
1
2
1 ))((
)(
E
ki
k
k
n
D
nn
N
n
D w
net
y
xO
xOt
O
)()(
))((
1 1
1
2. Derivative with respect to weights to hidden units
xi
Ol
yk
Om
wki
nkn
k
inii
n
k
n
n
k
n
k
n
wnetg
y
yw
netg
y
net
netg
y
netg
y
xO
)(‘)(‘)(‘
)()(
Now look at the two pieces:
Substitute into (1)
(1)
9
MLP Gradient
xi
Ol
yk
Om
wki
Derivative with respect
to weights to hidden units
kiw
w
)(
E
ki
k
k
n
D
nn
N
n
D w
net
y
xO
xOt
O
)()(
))((
1 1
1
ik
ki
k
xnet
w
net
)(‘
)(
nkn
k
n
wnetg
y
xO
)(‘
)(
So
ik
D
nknnn
N
n
D
ki
xnetwnetgxOt
w
w
)(‘)(‘))((
)(
1 1
1
0
E
Pseudo-error at hidden node k Activation function
slope at node k
Signal at node i
10
MLP Error Gradient
ik
D
nknnn
N
n
D
ki
xnetwnetgxOt
w
w
)(‘)(‘))((
)(
1 1
1
0
E
xi
Ol
yk
On
wki
Derivative with respect to
weights to hidden units
Pseudo-error at hidden node k
Error at output node n
multiplied by slope at output node n
passed backwards through the weight
from node n to node k, hence “error back-propagation”
wnk
slope of activation
function at node k.
signal at node i.
11
Summary MLP Error Gradients
1. Derivative with respect to a weight to an output.
xi
Ol
wlk
yk
D
klllD
lk
ynetgxOt
w
w
1
1 )(‘))((
)(
E
ik
D
nknnn
N
n
D
ki
xnetwnetgxOt
w
w
)(‘)(‘))((
)(
1 1
1
0
E
2. Derivative with respect to weights to hidden units
xi
Ol
yk
On
wki
wnk
dklll
lk
ynetgxOt
w
w
)(‘))((
)(
E
stochastic ver.
stochastic version
iknknnn
N
nki
xnetwnetgxOt
w
w
)(‘)(‘))((
)( 0
1
E
12
Backpropagation Learning Algorithm
Batch Mode (uses ALL data at each step)
choose learning rate
initialize wij % Usually to “small” random numbers
while ( E / E > e ) % Fractional change e ~ 10-4–10-6
calculate mean square error
calculate all error derivatives and step downhill
endwhile
D
mm
N
m
D
xOtw
O
1
2
1
2
1 ))(()(
E
ijw
w
)(
E
wij
13
Backpropagation Learning Algorithm
Stochastic or On-Line Mode
(uses ONE input/target pair for each step)
choose learning rate
initialize wij usually to “small” random numbers
while ( E / E > e ) % Fractional change e ~ 10-4–10-6
calculate mean square error
for = 1 … D % Step through data, or do D random draws with replacement
change all weights wij as
end for
endwhile
D
mm
N
m
D
xOtw
O
1
2
1
2
1 ))(()(
E
ijw
w
)(
E
wij
14
Comments
Cost function may not be convex, can have local optima, some
may be quite poor. In practice, this is not a show-stopper.
Usually initialize with random weights close to zero.
Then
will be small,
and
So early on, the network output will be
nearly linear in the input. Non-linearities
are added as training continues.
i
ikik
xwnet
kk
netnet )(
15
Comments
Learning algorithms are simply optimization methods. Trying to
find w that minimizes E(w). Several other optimization methods,
both classical and novel, are brought to bear on the problem.
Deep networks (several layers of sigmoidal hidden nodes) can
be very slow to train; gradient with respect to weights near inputs
will have multiple factors of ’ which decreases gradient signal.
(And the condition number of the Hessian of E become small.)
16
Power
Universal approximation theorem
– Any continuous function on a compact subset of the input
space (closed and bounded) can be approximated
arbitrarily closely by a feedforward network with one
layer of sigmoidal hidden units and linear output units.
That is, weighted sums of sigmoidal functions of the
inputs are universal approximators.
hn
i
j jiji
xwwxO
1
)()(
17
Power
• Approximation Accuracy
– The magnitude of the approximation error decreases with increasing
number nh of hidden units as
– Techniques linear in the parameters (fixed basis functions with only
their weighting fit)
only achieve error bounded by
where d is the dimension of the input space.
CURSE OF DIMESIONALITY
)/1( hnOrderE
)()(
1
xwxO
N
i
ii
d
n
/2
)/1(Order
18
Inductive Bias
The hypothesis space is the continuous weight space! Hard to
characterize inductive bias.
Bias can be imposed by adding a regularizer to the cost function
where F(w) carries the desired bias, and l characterizes the
strength with which the bias is imposed.
)())((),(
1
2
1
2
1 wFxOtw
D
mm
N
m
D
O
ll
E
19
Inductive Bias
Bias can be imposed by adding a regularizer to the cost function
where F(w) carries the desired bias, and l characterizes the
strength with which the bias is imposed.
Examples :
– small weights (L2 norm — decay)
– small curvature
– Sparse models (L1 norm)
)())((),(
1
2
1
2
1 wFxOtw
D
mm
N
m
D
O
ll
E
i
i
wwwF
22
)(
dx
x
xO
wF
2
2
2
)(
)(
i
i
wwwF
)(
20
Generalization
Overfitting / Underfitting
• We’ve been talking about fitting the network function to the
training data {x, t } =1…D.
• But we really care about the performance of the network on
unseen data.
21
Generalization
Overfitting / Underfitting
• We can build a network with a very large hidden layer, train it to
the bottom of a deep local optimum and very closely
approximate the training data. This may be precisely the wrong
thing to do.
• Question is “how well does model generalize?” What’s the
average error on infinitely large test sets? (That is, “what’s the
statistical expectation of the error on the population?)
This is called the generalization error.
22
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x
y
Overfitting
Regression problem, 20 data training points, six different neural
network fits with 1, 2, 3, 5, 10, and 20 hidden units.
MSEtrain= [ 0.255 0.146 0.074 0.025 0.019 0.014 ]
1 2 3 5
10
20
23
-1 -0.5 0 0.5
-1
-0.5
0
0.5
1
1.5
2
x
y
Overfitting
Here’s fits to training data overlaid on test or out-of-sample data
(i.e. data not used during the fitting).
MSEtest= [0.121 0.203 0.505 0.126 0.216 0.213]
Fixed Training Set Size, Variable Model Size
Expected Behavior
Model Size – Number of Hidden Units
(models trained to bottom of deep local optimum)
M
S
E
training error
generalization errorunderfitting overfitting
24
25
Fixed Model Size, Variable Training Set Size
Expected Behavior
Training Set Size
(model trained to bottom of deep local optimum)
M
S
E
training error
generalization error
26
Fixed Model and Training Data
Regularized Cost Function
Expected Behavior
M
S
E
training error
generalization errorunderfitting overfitting
Regularizer (l) Strength
1/l
27
Probabilistic Interpretations: Regression
Recall Lecture 8, page 9. The best possible function to minimize the
MSE for a regression problem (curve fitting) is
the conditional mean of y at each point x.
Since NN are universal approximators, we expect that if we have
a large enough (hidden nodes) network
enough data
our network trained to minimize MSE will have outputs O(x) that
approximate the regressor E[ t | x ].
Typically, regression networks have linear output nodes (but sigmoidal
hidden nodes).
dyxtptxtExhxO xt )|(|)()( |
28
Probabilistic Interpretations: Classification
Consider a classification problem with L classes. (Each sample is a
member of one and only one class.) The usual NN for such a
problem has L output nodes with targets ti , i=1…L :
e.g. for a 4 class problem, for an input x from class 2, the target
outputs are [ 0 1 0 0]. This is called a one-of-many representation.
i
i
i xif
xif
xt
0
1
)(
29
Probabilistic Interpretations: Classification
)|()|(|)( |
}1,0{
xpxtptxtExO iixti
t
ii i
i
)exp(1
1
)(
u
u
A simple extension of the result of Lecture 8, page 13 says that
the best possible function to minimize the MSE of the output for
such a representation is to have each output equal to the class
posterior
Typically, networks trained for classification use a sigmoidal
output function – usually the logistic
which is naturally bounded to the range [0,1].
30
Probabilistic Interpretations: Classification
A similar extension of the result in Lecture 8, page 14 says that the
absolute minimum of the cross-entropy error measure
is given when each network output equal to the class posterior.
Networks trained to minimize the cross-entropy typically use a
logistic output
Notice that this setup is also useful when an object can belong
to several classes simultaneously (e.g. medical diagnosis).
)(1
)(1
ln))(1(
)(
)(
ln)(
1 1
1
nl
nl
nl
N
n
L
l nl
nl
nlN xt
xO
xt
xt
xO
xtE
)exp(1
1
)(
l
ll
u
uO
31
Probabilistic Interpretations: Classification
Finally, for multiclass problems, a third cost function emerges. It is
based on the multinomial distribution for target values and is
exclusively for the case where each object is a member of one and
only one class
Taking negative log-likelihood for a set of examples xn n=1…N
results in the cost function
L
l
xt
lL
lxOxttp
1
)(
1 )()|},…,({
0
1 1
1
1 1
1
)(
)(
ln)()(ln)( E
xt
xO
xtxOxtE
N
n
L
l nl
nl
nlN
N
n
L
l
nlnlN
32
Probabilistic Interpretations: Classification
Networks trained with this cost function (‘Cross-entropy 2’) typically
use the soft-max activation
Which is naturally bounded in the interval [0,1].
Note also that as must be the case when each
object belongs to one and only one class.
L
l
l
l
ll
u
u
uO
1’
‘ )exp(
)exp(
)(
1)(
1
ll
L
l
uO
33
Probabilistic Interpretations: Classification
Since NN are universal approximators, we expect that for
Large enough networks (hidden nodes)
Enough data
a classifier NN trained to minimize either the MSE or the
cross-entropy error measure will have network outputs that
approximate the class posteriors.
This is the usual interpretation of the NN classifier outputs – but
care is essential, since the pre-requisites (large networks and
enough data) may not be met.
34
Weight Estimation – Maximum Likelihood
Training a neural network is an estimation problem, where the
parameters being estimated are the weights in the network.
Recalling the results in Lecture 10, pages 9 and 10
– Minimizing the MSE is equivalent to maximum likelihood
estimation under the assumption of targets with a Gaussian
distribution (as usually assumed for regression problems).
– Minimizing the CROSS-ENTROPY error is equivalent to
maximum likelihood estimation under the assumption of targets
with a Bernouli distribution (as usually assumed for
classification problems).
35
Weight Estimation – Maximum Likelihood
– Minimizing CROSS-ENTROPY 2 is equivalent to maximum
likelihood estimation under the assumption of targets with a
multinomial distribution as given on p30.
36
Weight Estimation – MAP
Following earlier discussion of estimation methods, can introduce a
prior over network weights p(w) and move from the ML estimate, to
the MAP estimate.
This change is mirrored by the change from a cost function, to a
regularized cost function
Regularizers help improve generalization by reducing the variance of
the estimates of the network weights. They do so at the price of
introducing bias into the weight estimates.
))(ln( wPEU
37
Weight Estimation – MAP
An often-used regularizer is weight decay
which is equivalent to a Gaussian prior on the weights with mean
zero, and covariance (spherically symmetric) S 1/2 l.
Q
j
jwEU
1
2
l
38
Bayesian Inference with Neural Networks
A Bayesian would not pick a set of network weights to use in a
regression or classification model, but would rather compute the
posterior distribution on the network weights
and perform inference by averaging models over this posterior
distribution.
For example, the predictor for a regression problem takes the form
Needless to say, this is an ambitious program (multimodal
posterior, intractable integrals) requiring Monte Carlo techniques, or
extreme approximations.
)(/)()|()|( DPwPwDpDwp
dwDwpwxODxO )|();()|(