PowerPoint Presentation
Machine Learning Lecture:
Multi-Layer ANNs
C.-C. Hung
Slides used in the classroom only
Reference
Simon Haykin, Neural Networks: A Comprehensive Foundation, IEEE Press, 1994
Lecture overview
Recall Perceptron
Multi-Layer ANNs
Backpropagation Neural Networks (BNN)
Training Backpropagation in ANNs
Recap: Can a single neuron learn a task?
In 1958, Frank Rosenblatt introduced a training algorithm that provided the first procedure for training a simple ANN: a perceptron.
The perceptron is the simplest form of a neural network. It consists of a single neuron with adjustable synaptic weights and a hard limiter.
Two-input perceptron
Multi-Layer Networks with Backpropagation
Multi-Layer Networks
Built from Perceptron Units
Perceptrons not able to learn certain concepts
Can only learn linearly separable functions
But they can be the basis for larger structures
Which can learn more sophisticated concepts
Say that the networks have “perceptron units”
Neurons in Multi-Layer Networks
The learning rule relies on differential calculus
Finding minima by differentiating, etc.
Step functions aren’t differentiable
They are not continuous at the threshold
Alternative threshold function sought
Must be differentiable
Must be similar to step function
i.e., exhibit a threshold so that units can “fire” or not fire
Sigmoid units used for backpropagation
There are other alternatives that are often used
Sigmoid Units
Take in weighted sum of inputs, S and output:
Advantages:
Looks very similar to the step function
Is differentiable
Derivative easily expressible in terms of σ itself:
Backpropagation Concept
*
Backpropagation
*
Simple Back-Prop Example
*
Repeated Subexpressions
*
Example: ANN with Sigmoid Units
Feedforward network
Feed inputs in on the left (input layer), propagate numbers forward
Suppose we have this ANN
With weights set arbitrarily
Input layer
Hidden layer
Output layer
Propagation of Example
Suppose input to ANN is 10, 30, 20
First calculate weighted sums to hidden layer:
SH1 = (0.2*10) + (-0.1*30) + (0.4*20) = 2-3+8 = 7
SH2 = (0.7*10) + (-1.2*30) + (1.2*20) = 7-6+24= -5
Next calculate the output from the hidden layer:
Using: σ(S) = 1/(1 + e-S)
σ(SH1) = 1/(1 + e-7) = 1/(1+0.000912) = 0.999
σ(SH2) = 1/(1 + e5) = 1/(1+148.4) = 0.0067
So, H1 has fired, H2 has not
A
B
A
B
C
D
C
D
Propagation of Example
Next calculate the weighted sums into the output layer:
SO1 = (1.1 * 0.999) + (0.1 * 0.0067) = 1.0996
SO2 = (3.1 * 0.999) + (1.17 * 0.0067) = 3.1047
Finally, calculate the output from the output layer (i.e. the ANN)
σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750
σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957
Output from O2 > output from O1
So, the ANN predicts category associated with O2 for the example input (10, 30, 20)
E
E
F
F
Backpropagation Learning Algorithm
Same task as in perceptron
Learn a multi-layer ANN to correctly categorise unseen examples (We’ll concentrate on ANNs with one hidden layer)
Overview of the routine
Fix architecture and sigmoid units within architecture
i.e., number of units in hidden layer; the way the input units represent example; the way the output units categorises examples
Randomly assign weights to the whole network
Use small values (between –0.5 and 0.5)
Use each example in the set to retrain the weights
Have multiple epochs (iterations through training set)
Until some termination condition is met (not necessarily 100% accuracy)
Weight Training
Calculations (Overview)
Use notation wij to specify:
Weight between unit i and unit j (i j)
Look at the calculation with respect to example E
Going to calculate a value Δij for each wij
And add Δij on to wij
Do this by calculating error terms for each unit
The error term for output units is found
And then this information is used to calculate the error terms for the hidden units
So, the error is propagated back through the ANN
Propagate E through the Network
Feed E through the network (as in example above)
Record the target and observed values for example E
i.e., determine weighted sum from hidden units, do sigmoid calculation
Let ti(E) be the target values for output unit i
Let oi(E) be the observed value for output unit i
Note that for categorisation learning tasks,
Each ti(E) will be 0, except for a single tj(E), which will be 1 (represents the class)
But oi(E) will be a real valued number between 0 and 1
Also record the outputs from the hidden units
Let hi(E) be the output from hidden unit i
Error terms for each unit
The Error Term for output unit k is calculated as:
The Error Term for hidden unit k is:
In English:
For hidden unit h, add together all the errors for the output units, multiplied by the appropriate weight.
Then multiply this sum by hk(E)(1 – hk(E))
Please note that
DO NOT get confused with notations here.
Final Calculations
Choose a learning rate, η (= 0.1 again, perhaps)
For each weight wij (Between input unit i and hidden unit j)
Calculate:
Where xi is the input to the system to input unit i for E
For each weight wij (Between hidden unit i and output unit j)
Calculate:
Where hi(E) is the output from hidden unit i for E
Finally, add on each Δij on to wij
Back-Propagation Algorithm (1/4):
page 209 – 210 in book “Image Texture Analysis” by Hung et al.
6
Back-Propagation Algorithm (2/4)
Back-Propagation Algorithm (3/4)
Back-Propagation Algorithm (4/4)
Worked Backpropagation
Example
Start with the previous ANN
We will retrain the weights
In the light of example E = (10, 30, 20)
Stipulate that E should have been categorised as O1
Will use a learning rate of η = 0.1
Wij
Previous Calculations
Need the calculations from when we propagated E through the ANN:
t1(E) = 1 and t2(E) = 0 [from categorisation]
o1(E) = 0.750 and o2(E) = 0.957
A
A
B
B
Error Values for Output Units
t1(E) = 1 and t2(E) = 0 [from categorisation]
o1(E) = 0.750 and o2(E) = 0.957
So:
Error Values for Hidden Units
δO1 = 0.0469 and δO2 = -0.0394
h1(E) = 0.999 and h2(E) = 0.0067 (previous slide)
So, for H1, we add together:
(w11*δ01) + (w12*δO2) = (1.1*0.0469)+(3.1*-0.0394) = -0.0706
And multiply by: h1(E)(1-h1(E)) to give us:
-0.0706 * (0.999 * (1-0.999)) = -0.0000705 = δH1
For H2, we add together:
(w21*δ01) + (w22*δO2) = (0.1*0.0469)+(1.17*-0.0394) = -0.0414
And multiply by: h2(E)(1-h2(E)) to give us:
-0.0414 * (0.0067 * (1-0.0067)) = -0.000276= δH2
B
A
Calculation of Weight Changes
For weights between the input and hidden layer
A
B
Note: need multiply learning rate and data input
Calculation of Weight Changes
For weights between hidden and output layer
(From Slide 28)
Weight changes are not very large
Small differences in weights can make big differences in calculations
But it might be a good idea to increase η
Calculation of Network Error
Could calculate Network error as
Proportion of mis-categorised examples
But there are multiple output units, with numerical output
So we use a more sophisticated measure:
Not as complicated as it looks
Square the difference between target and observed
Squaring ensures we get a positive number
Add up all the squared differences
For every output unit and every example in training set
Problems with Local Minima
Backpropagation is gradient descent search
Where the height of the hills is determined by error
But there are many dimensions to the space
One for each weight in the network
Therefore backpropagation
Can find its ways into local minima
One partial solution:
Random re-start: learn lots of networks
Starting with different random weight settings
Can take best network
Or can set up a “committee” of networks to categorise examples
Another partial solution: Momentum
Adding Momentum
Imagine rolling a ball down a hill
Without Momentum With Momentum
Gets stuck
here
Momentum in Backpropagation
For each weight
Remember what was added in the previous epoch
In the current epoch
Add on a small amount of the previous Δ
The amount is determined by
The momentum parameter, denoted α
α is taken to be between 0 and 1
How Momentum Works
If direction of the weight doesn’t change
Then the movement of search gets bigger
The amount of additional extra is compounded in each epoch
May mean that narrow local minima are avoided
May also mean that the convergence rate speeds up
Caution:
May not have enough momentum to get out of local minima
Also, too much momentum might carry search
Back out of the global minimum, into a local minimum
Problems with Overfitting
Plot training example error versus test example error:
Test set error is increasing!
Network is overfitting the data
Learning idiosyncrasies in data (i.e. a mode of behaviour), not general principles
Big problem in Machine Learning (ANNs in particular)
Underfitting: The two features of lightness and width for sea bass and salmon.
Overfitting: Overly complex models for the fish will lead to decision boundaries that are complicated
The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
Calculates the error and updates the model for each example in the training dataset.
*
Ref: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
Batch Gradient Descent
Batch Gradient Descent
Calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
*
Mini-Batch Gradient Descent
Mini-batch gradient descent
Splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.
It is the most common implementation of gradient descent used in the field of deep learning.
*
Overfitting for Deep learning
Solutions for overfitting on Neural Network
Regularization
Dropout/DropConnect
*
Ref: http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_regularization
Avoiding Overfitting
Bad idea to use training set accuracy to terminate
One alternative: Use a validation set
Hold back some of the training set during training
Like a miniature test set (not used to train weights at all)
If the validation set error stops decreasing, but the training set error continues decreasing
Then it’s likely that overfitting has started to occur, so stop
Be careful, because validation set error could get into a local minima itself
Worthwhile running the training for longer, and wait and see
Another alternative: use a weight decay factor
Take a small amount off every weight after each epoch
Networks with smaller weights aren’t as highly fine tuned (overfit)
Suitable Problems for ANNs
Examples and target categorisation
Can be expressed as real values
ANNs are just fancy numerical functions
Predictive accuracy is more important
Than understanding what the machine has learned
Black box non-symbolic approach, not easy to digest
Slow training times are OK
Can take hours and days to train networks
Execution of learned function must be quick
Learned networks can categorise very quickly
Very useful in time critical situations (is that a tank, car or old lady?)
Error: ANNs are fairly robust to noise in data
Key concepts
Perceptron
Epoch
Gradient Descent Method
Momentum Term
Back-Propagation Learning Algorithm
Hidden Layers
Recurrent Neural Networks
Questions & Suggestions?
The End
*
Appendix
Background: The Gradient Notion
*
*
Directional Derivatives: first, the one dimension derivative:
*
Directional Derivatives :
Along the Axes…
*
Directional Derivatives :
In general direction…
*
Directional Derivatives
*
In the plane
The Gradient: Definition in
*
The Gradient: Definition
*
Learning in biological systems
Learning = learning by adaptation
The young animal learns that the green fruits are sour, while the yellowish/reddish ones are sweet. The learning happens by adapting the fruit picking behavior.
At the neural level the learning happens by changing of the synaptic strengths, eliminating some synapses, and building new ones.
Learning as optimisation
The objective of adapting the responses on the basis of the information received from the environment is to achieve a better state. E.g., the animal likes to eat many energy rich, juicy fruits that make its stomach full, and makes it feel happy.
In other words, the objective of learning in biological organisms is to optimise the amount of available resources, happiness, or in general to achieve a closer to optimal state.
Learning in biological neural networks
The learning rules of Hebb:
synchronous activation increases the synaptic strength;
asynchronous activation decreases the synaptic strength.
These rules fit with energy minimization principles.
Maintaining synaptic strength needs energy, it should be maintained at those places where it is needed, and it shouldn’t be maintained at places where it’s not needed.
Learning principle for
artificial neural networks
ENERGY MINIMIZATION
We need an appropriate definition of energy for artificial neural networks, and having that we can use mathematical optimisation techniques to find how to change the weights of the synaptic connections between neurons.
ENERGY = measure of task performance error
Neural network mathematics
Inputs
Output
Neural network mathematics
Neural network: input / output transformation
W is the matrix of all weight vectors.
MLP neural networks
MLP = multi-layer perceptron
Perceptron:
MLP neural network:
x
yout
x
yout
RBF neural networks
RBF = radial basis function
Example:
Gaussian RBF
x
yout
Neural network tasks
control
classification
prediction
approximation
These can be reformulated in general as
FUNCTION APPROXIMATION
tasks.
Approximation: given a set of values of a function g(x) build a neural network that approximates the g(x) values for any input x.
Neural network approximation
Task specification:
Data: set of value pairs: (xt, yt), yt=g(xt) + zt; zt is random measurement noise.
Objective: find a neural network that represents the input / output transformation (a function) F(x,W) such that
F(x,W) approximates g(x) for every x
Learning to approximate
Error measure:
Rule for changing the synaptic weights:
c is the learning parameter (usually a constant)
Learning with a perceptron
Perceptron:
Data:
Error:
Learning:
A perceptron is able to learn a linear function.
Learning with RBF neural networks
RBF neural network:
Data:
Error:
Learning:
Only the synaptic weights of the output neuron are modified.
An RBF neural network learns a nonlinear function.
Learning with MLP neural networks
MLP neural network:
with p layers
Data:
Error:
It is very complicated to calculate the weight changes.
x
yout
1 2 … p-1 p
Learning with general optimisation
Synaptic weight change rules for the output neuron:
Synaptic weight change rules for the neurons of the hidden layer:
Symbol-to-Symbol Differentiation
*
Forward-propagate
*
Back-propagate
*
Update Parameters
*
Backpropagation
*
Backpropagation
*
Ref: http://neuralnetworksanddeeplearning.com/chap2.html
Backpropagation
*
Backpropagation Algorithm
*
Th
re
sh
ol
d
In
pu
ts
x
1
x
2
Ou
tp
ut
Y
Ha
rd
Li
mi
te
r
w
2
w
1
Li
ne
ar
Co
mb
in
er
q
:=
f
®
(
)
,
x
y
æ
è
ç
ç
ö
ø
÷
÷
cos
1
2
x
æ
è
ç
ç
ö
ø
÷
÷
cos
1
2
y
x
q
x
y
x
f
¶
¶
)
,
(
y
y
x
f
¶
¶
)
,
(
v
y
x
f
¶
¶
)
,
(
2
R
v
Î
1
=
v
2
R
R
R
f
®
2
:
÷
÷
ø
ö
ç
ç
è
æ
¶
¶
¶
¶
=
Ñ
y
f
x
f
y
x
f
:
)
,
(
)
,
(
y
x
f
Ñ
÷
÷
ø
ö
ç
ç
è
æ
¶
¶
¶
¶
=
Ñ
n
n
x
f
x
f
x
x
f
,…,
:
)
,…,
(
1
1
R
R
f
n
®
:
)
,
(
)
,
(
)
,
(
)
,
(
1
4
4
1
4
1
3
3
1
3
1
2
2
1
2
1
1
1
1
1
w
x
f
y
w
x
f
y
w
x
f
y
w
x
f
y
=
=
=
=
)
,
(
)
,
(
)
,
(
2
3
1
2
3
2
2
1
2
2
2
1
1
2
1
w
y
f
y
w
y
f
y
w
y
f
y
=
=
=
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
è
æ
=
1
4
1
3
1
2
1
1
1
y
y
y
y
y
)
,
(
3
1
2
w
y
f
y
Out
=
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
è
æ
=
2
3
2
3
2
3
2
y
y
y
y
)
,
(
W
x
F
y
out
=
x
w
y
T
out
=
2
3
2
1
2
3
2
2
2
1
2
2
1
3
1
2
1
1
1
1
)
,
(
2
,
1
,
1
1
)
,
,
(
3
,
2
,
1
,
1
1
2
1
2
1
1
y
w
y
w
y
y
y
y
k
e
y
y
y
y
y
k
e
y
T
k
k
k
out
T
a
y
w
k
T
a
x
w
k
k
kT
k
kT
=
=
=
=
+
=
=
=
+
=
å
=
–
–
–
–
2
2
2
||
||
)
(
||)
(||
)
(
a
w
x
e
x
f
c
x
r
x
r
–
–
=
–
=
å
=
–
–
×
=
4
1
)
(
2
||
||
2
2
2
,
1
k
a
w
x
k
out
k
k
e
w
y
å
=
–
=
N
t
t
t
y
W
x
F
N
E
1
2
)
)
;
(
(
1
j
i
j
i
new
j
i
j
i
j
i
w
w
w
W
w
E
c
w
D
+
=
¶
¶
×
–
=
D
,
)
(
)
,
(
),…,
,
(
),
,
(
2
2
1
1
N
N
y
x
y
x
y
x
2
2
)
)
(
(
)
)
(
(
)
(
t
t
T
t
out
y
x
t
w
y
t
y
t
E
–
=
–
=
t
j
m
j
j
T
t
i
t
t
T
i
i
i
t
t
T
i
i
i
i
x
t
w
x
t
w
x
y
x
t
w
c
t
w
t
w
w
y
x
t
w
c
t
w
w
t
E
c
t
w
t
w
×
=
×
–
×
–
=
+
¶
–
¶
×
–
=
¶
¶
×
–
=
+
å
=
1
2
)
(
)
(
)
)
(
(
)
(
)
1
(
)
)
(
(
)
(
)
(
)
(
)
1
(
2
1
)
(
2
||
||
2
2
)
)
(
(
)
)
(
(
)
(
2
2
,
1
t
M
k
a
w
x
k
t
out
y
e
t
w
y
t
y
t
E
k
k
t
–
×
=
–
=
å
=
–
–
2
2
,
1
)
(
2
||
||
2
2
2
2
)
))
(
,
(
(
2
)
(
)
(
)
(
)
1
(
i
i
t
a
w
x
t
t
i
i
i
i
e
y
t
W
x
F
w
t
E
w
t
E
c
t
w
t
w
–
–
×
–
×
=
¶
¶
¶
¶
×
–
=
+
å
=
–
–
×
=
=
M
k
a
w
x
k
out
k
k
e
w
W
x
F
y
1
)
(
2
||
||
2
2
2
,
1
)
,
(
2
2
)
)
;
(
(
)
)
(
(
)
(
t
t
t
out
y
W
x
F
y
t
y
t
E
–
=
–
=
1
2
2
1
2
2
2
1
1
1
1
1
1
)
;
(
…
)
,…,
(
,…,
1
,
1
1
)
,…,
(
,…,
1
,
1
1
2
2
1
2
1
1
1
–
–
–
–
–
=
=
=
=
+
=
=
=
+
=
p
pT
out
T
M
a
y
w
k
T
M
a
x
w
k
y
w
W
x
F
y
y
y
y
M
k
e
y
y
y
y
M
k
e
y
k
kT
k
kT
i
t
iT
a
x
w
t
t
i
i
i
i
e
y
t
W
x
F
w
t
E
w
t
E
c
t
w
t
w
–
–
+
×
–
×
=
¶
¶
¶
¶
×
–
=
+
,
1
1
1
)
))
(
,
(
(
2
)
(
)
(
)
(
)
1
(
2
2
2
2
(
)
(
)
(
)
(
)
)
(
1
)
))
(
,
(
(
2
)
(
)
1
(
1
1
1
1
1
)
))
(
,
(
(
2
)
(
)
(
)
(
)
1
(
2
,
1
,
1
,
1
,
1
,
1
,
1
2
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
,
1
t
j
a
x
w
a
x
w
t
t
i
j
i
j
t
j
i
t
iT
i
j
i
t
iT
i
j
a
x
w
a
x
w
a
x
w
i
j
a
x
w
i
j
t
t
i
j
i
j
i
j
i
j
x
e
e
y
t
W
x
F
c
t
w
t
w
x
a
x
w
w
a
x
w
w
e
e
e
w
e
w
y
t
W
x
F
w
t
E
w
t
E
c
t
w
t
w
i
t
iT
i
t
iT
i
t
iT
i
t
iT
i
t
iT
i
t
iT
–
×
+
×
–
×
×
–
=
+
–
=
–
–
¶
¶
–
–
¶
¶
×
+
=
÷
÷
ø
ö
ç
ç
è
æ
+
¶
¶
÷
÷
ø
ö
ç
ç
è
æ
+
¶
¶
×
–
×
=
¶
¶
¶
¶
×
–
=
+
–
–
–
–
–
–
–
–
–
–
–
–