CS计算机代考程序代写 chain AI data mining matlab deep learning algorithm COMP3308/3608, Lecture 8b & 9a

COMP3308/3608, Lecture 8b & 9a
ARTIFICIAL INTELLIGENCE
Multilayer Neural Networks and Backpropagation Algorithm
Reference: Russell and Norvig, pp. 731-737 Witten, Frank and Hall, pp. 232-241
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
1

• Multilayer perceptron NNs
• XOR problem
• neuron model
• Backpropagation algorithm
• Derivation
• Example
• Universality – Cybenco’s theorems
• Preventing overfitting
• Modifications • Momentum
• Learning rate
• Limitations and capabilities
• Interesting applications
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021
2
Outline

XOR problem – Example
• The XOR problem is not linearly separable and, hence, cannot be solved by a single layer perceptron network
• But it can be solved by a multilayer perceptron network:
• The 2 perceptrons in the input layer identify linearly separable parts, and their outputs are combined by another perceptron to form the final solution
0011 p1 = ,t1 =0 p2 = ,t2 =1 p3 = ,t3 =1 p4 =,t4 =0

0101

• Verify that the network with the weights shown in the diagram solves the XOR problem!
Image from Hagan, Demuth & Beale
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
3

Boundary Investigation for the XOR Problem
•1st layer boundaries:
1st layer, 1st neuron; 1st layer, 2d neuron: 1st decision boundary 2nd decision boundary
• 2nd layer combines the two boundaries together: 2nd layer, 1st neuron:
combined boundary
Images from Hagan, Demuth & Beale
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
4

Can We Train Such a Network?
• There exist a 2-layer perceptron network capable of solving the XOR task!
• Rosenblatt and Widrow (who proposed the perceptron) were aware of this
• But how can we train such a network to learn from examples?
• Rosenblatt and Widrow were not able to successfully modify the
perceptron’s rule to train these more complex networks
• 1969 – book “Perceptrons” by Minsky and Papert
• Discusses the limitations of the perceptrons (can solve linearly separable problems only)
• “There is no reason to suppose that any of the virtues of perceptrons carry over to the many-layered version”
• => The majority of scientific community walked away from the field of NNs, funding was suspended
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
5

• •

• •
Discovery of the Backropagation Algorithm
An algorithm for training multi-layer perceptron networks Proposed by Paul Werbos in 1974
• His thesis presented the algorithm in the context of general networks, with NNs as a special case, and was not disseminated in the NN community
Rediscovered by David Rumelhart, Geoffrey Hinton, Ronald Williams 1986; David Parker 1985; Yann LeCun 1985
• Book: Parallel Distributed Processing (1986) by Rumelhart and McClelland
Multilayer NNs trained with the backpropagation algorithm – the most widely used NNs
The backpropagation algorithm is also used to train deep NNs
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
6

Problem Statement – Supervised Learning:
• Given: A labeled training dataset
• Input vectors associated with target classes
• Task: Build a backpropagation NN that encapsulates this knowledge
(i.e. maps the inputs to the target classes) and can be used predictively • We would like the NN to predict well new input vectors (unseen during
training)
Training data
Get the prediction
sepal length
5.1
4.9
6.4
6.9
6.3
5.8

sepal width
petal length
petal width
iris type
3.5
1.4
0.2
Iris setosa
3.0
1.4
0.2
Iris setosa
3.2
4.5
1.5
iris versicolor
3.1
4.9
1.5
iris versicolor
3.3
6.0
2.5
iris virginica
2.7
5.1
1.9
iris virginica
New data
sepal sepal petal petal iris type length width length width
Build the classifier (i.e. train the NN)
4.1
3.6
1.6
0.3
?
5.9
3.1
2.4
0.1
?
7.4
3.3
3.5
1.2
?
7 Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021

Learning in NN (review)
• The “intelligence” of a NN is in the values of its weights
• The weights are the parameters of a NN
• We start from a random initialization and change them using a
learning rule to learn the input-output mapping as specified by the training data
• How to learn the input-output mapping using the backpropagation algorithm?
• We will formulate an error function (e.g. sum of squared errors between the target and actual output) and use a learning rule that minimizes this error
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
8

Neural Network Model
• Computational model consisting of 3 parts:
• 1) Architecture – neurons and connections
• Input, hidden, output neurons
• Fully or partially connected
• Neuron model – computation performed by each neuron, type of
transfer function
• Initialization of the weights
• 2) Learning algorithm
• How are the weights of the connections changed in order to
facilitate learning
• Goal for classification tasks: mapping between the input vectors
and their classes
• 3) Recall technique – once the NN training is completed, how is the information obtained from the trained NN?
• E.g. for classification tasks – how is the class of a new example
determined?
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
9

Multi-layer Perceptron Network – Architecture
1) A network with 1 or more hidden layers • e.g. a NN for the iris data:
e.g. 1 output neuron for each class
output neurons
hidden neurons (1 hidden layer)
inputs
sepal width
petal length
petal width
iris type
3.5
1.4
0.2
Iris setosa
3.0
1.4
0.2
Iris setosa
3.2
4.5
4.9
1.5
1.5
iris versicolor
3.1
iris versicolor
3.3
6.0
2.5
iris virginica
2.7
5.1
1.9
iris virginica
e.g. 1 input neuron for each attribute
2) A feedforward network – each neuron receives input only from the neurons in the previous layer
3) A fully connected network (typically) – all neurons in a layer are connected with all neurons in the next layer
4) Its weights are initialized to small random values, e.g. [-1,1]
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
10
sepal length
5.1
4.9
6.4
6.9
6.3
5.8

Multi-layer Perceptron Network – Architecture 2
5) Each neuron (except the input neurons) computes the weighed sum of its inputs, and then applies a differentiable transfer function
a = f(wp+b)
• any differentiable transfer function f can be used, i.e. the derivative should exist for every point;
• most frequently used: sigmoid and tan-sigmoid (hyperbolic tangent sigmoid)
a=1 1+e−n
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
11
a=en −e−n en +e−n

Architecture – Number of Input Neurons
• Input neurons (=inputs) do not perform any computation (e.g. summation and transfer function); they just transmit data
• Numerical data – 1 input neuron for each attribute
sepal width
petal length
petal width
iris type
3.5
1.4
0.2
Iris setosa
3.0
1.4
0.2
Iris setosa
3.2
4.5
4.9
1.5
1.5
iris versicolor
3.1
iris versicolor
3.3
6.0
2.5
iris virginica
2.7
5.1
1.9
iris virginica
e.g. 1 output neuron for each class
output neurons
hidden neurons (1 hidden layer)
inputs
e.g. 1 input neuron for each attribute
• For numerical data, the input examples are directly fed into the network, e.g. 5.1, 3.5, 1.4 and 0.2 to each of the 4 input neurons (no need for encoding, see next slide)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
12
sepal length
5.1
4.9
6.4
6.9
6.3
5.8


Architecture – Number of Input Neurons (2)
Categorical data – 1 input neuron for each attribute value • How many input neurons for the weather data?
output layer hidden layer(s)
sunny overcast rainy hot mild cool high normal true false
outlook temperature humidity windy
Input examples cannot be directly fed; need to be encoded
Encoding of the input examples – typically binary depending on the value of the attribute (on and off)
e.g. ex 1: 100 100 10 01
outlook
sunny
overcast
rainy
rainy
rainy
overcast

temp.
humidity
windy
play
hot
high
false
no
hot
high
false
yes
mild
cool
high
false
yes
normal
false
yes
cool
cool
normal
true
no
normal
true
yes
• •
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
13

Number of Output Neurons
• Typically 1 neuron for each class target class ex1: 1 0
hidden layer(s)
sunny overcast rainy hot mild cool high normal true false
outlook temperature humidity windy
ex.1: 1 0 0 1 0 0 1 0 0 1
• The targets (i.e. classes of the examples need to be encoded) – typically binary encoding is used
e.g. class1 (no): 1 0, class2 (yes): 0 1
outlook
sunny
overcast
rainy
rainy
rainy
overcast

temp.
humidity
windy
play
hot
high
false
no
hot
high
false
yes
mild
cool
high
false
yes
normal
false
yes
cool
cool
normal
true
no
normal
true
yes
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
14

For COMP3608 students only
More on Output Encoding
• How to encode the outputs and represent targets?
• Local encoding
•1 output neuron
• different output values represent different classes, e.g. <0.2 – class 1, >0.8 – class 2, in between – ambiguous class (class 3)
• Distributed (binary, 1-of-n) encoding – typically used in multi class problems • Number of outputs = number of classes
• Example: 3 classes, 3 output neurons; class 1 is represented as 1 0 0, class 2 – as 0 1 0 and class 3 – as 0 0 1
• Another representation of the targets: use 0.1 instead of 0 and 0.9 instead
of 1
• Motivation for choosing distributed over local encoding
• Provides more degree of freedom to represent the target function (n times as many weights available)
• The difference between the output with highest value and the second highest can be used as a measure how confident the prediction is (close
values => ambiguous classification)
temp.
Humid.
windy
play
sunny
hot
high
false
no
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
15
outlook
sunny
hot
high
true
no
overcast
rainy
hot
mild
high
high
false
false
yes
yes

Number of Hidden Layers and Neurons in Them
• An art! Typically – by trial and error
• The given task (data) constrains the number of inputs and output neurons but
not the number of hidden layers and neurons in them
• Too many hidden layers and neurons (i.e. too many weights) – overfitting
• Too few – underfitting, i.e. the NN is not able to learn the input-output mapping
• A heuristic to start with: 1 hidden layer with n hidden neurons, n=(inputs+output_neurons)/2
target class ex1: 1
0


sunny overcast rainy hot mild cool high normal true false
outlook temperature humidity windy
ex.1: 1 0 0 1 0 0 1 0 0 1
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
16

Learning in Multi-Layer Perceptron NNs
• Supervised learning – the training data:
• consists of labeled examples (p, d), i.e. the desired output d for them is
given (p – input vector; d – desired output)
• can be viewed as a teacher during the training process
• error – difference between the desired d and actual a network output
• Idea of backpropagation learning
For each training example p
• Propagate p through the network and calculate the output a. Compare the desired d with the actual output a and calculate the error.
• Update weights of the network to reduce the error Until error over all examples < threshold • Why “backpropagation”? Adjusts the weights backwards (from the output to the input neurons) by propagating the weight change How to calculate the weight change? w new =w old +w pq pq pq Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 w 17 Error Function • Sum of Squared Errors (E) is a classical measure of error • e.g. E for a single training example over all output neurons i: E = 12  e i 2 = 12  ( d i − a i ) 2 ii • di – desired value for the output neuron i • ai - actual NN value (i.e. predicted value by NN ) for the output neuron i • Thus, backpropagation learning can be viewed as an optimization search in the weight space • Goal state – the set of weights for which the performance index (error E) is minimum • Search method – hill climbing Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 18 Error Landscape in Weight Space • E is a function of the weights (e.g. E is a function of w1 and w2) • 1 state = 1 set of weights (w1 and w2 in this case) • Several local minima and one global minimum E w2 E as a function of w1 and w2 • How to minimize the error? Take steps downhill • Will find a local minimum (the one that is closest to the starting position), i.e. not guaranteed to find the global minimum (except when there is only one minimum) • How to get to the bottom as fast as possible? (i.e. we need to know what direction to move that will make the largest reduction in error) Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 19 w1 Steepest Gradient Descent • The direction of the steepest descent is called gradient and can be computed ( dE/dw ) • A function decreases most rapidly when the direction of movement is in the direction of the negative of the gradient • Hence, we want to adjust the weights so that the change moves the NN down the error surface in the direction of the locally steepest descent, given by the negative of the gradient •  - learning rate, defines the step; typically in the range (0,1) Image from Dunham • Gives the slope (gradient) of the error function for one weight • We want to find the weight where the slope (gradient) is 0 Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 20 Backpropagation Algorithm - Idea • The backpropagation algorithm adjusts the weights by working backwards from the output layer to the input layer • Calculates the error and propagates this error from layer to layer • 2 approaches • Batch – weights are adjusted once after all training examples are applied (the total error is calculated) • Incremental – the weights are adjusted after each training example is applied • Calledalsostochastic(orapproximate) gradient descent • Thisisthepreferredmethod–itisusually faster than batch learning and finds better solutions (can escape local optima) Image from Hertz, Krogh & Palmer • Solid lines - forward propagation of signals • Dashed lines – backward propagation of error Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 21 • • • Backpropagation - Notation A neural network with one hidden layer Indexes for the neurons: k over inputs, j over hidden, i over output neurons E – error function that we would like to minimize after each training example (i.e. incremental mode): E = 12  ( d i − o i ) 2 i di target value of neuron i for the current example p oi actual value of neuron i for the current example p • Let’s express E in terms of the weights of the network and the values of the current example! • Recall that the weights are the parameters of the NN Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 22 Expressing E in terms of the NN weights and input example 1. Input for the hidden neuron j for p netj =wkj.ok +bj k 2. Activation of neuron j as function of its input: oj = f(netj)= f(wkj.ok +bj) k 3. Input for the output neuron i : neti =wji.oj+bi=wjif(wkj.ok+tj)+bi jjk Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 23 Expressing E in terms of the NN weights and input example (2) 4. Output for the output neuron i : oi = f(neti )= f(wji.oj +bj )= j = f(wji.f(wkj.ok +bj)+bi ) jk 5. Substituting 4 into E: 121 2 E=2(di −oi )=2di −f(wji.f(wkj.ok +bj)+bi )) iijk Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 24 Backpropagation – Derivation • Steepest gradient descent: adjust the weights proportionally to the negative of the error gradient For a weight wji to an output neuron: E _ E oi neti wji =−w =o net w =<-chainrule ji i i ji =−21(di−oi)(−1)oi oj= 2 neti =(di −oi )f(neti )oj =ioj, where i=(di−oi)f(neti) For a weight wkj to a hidden neuron: wkj=−E =−E oj =(di −oi)f(neti)wjif(netj)ok wkj oj wkj i =iwji f(netj)ok = jok, where j= f(netj)iwji ii Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 25 Backpropagation Rule - Summary wpq(t) -weightfromnodeptonodeqattimet wpq (t +1) = wpq (t) + wpq wpq =δq op - weight change 1 k wq1 ... wqk qq q q  • The weight change is proportional to the output activation of neuron p and the error  of neuron q •  is calculated in 2 different ways: wpq op p op p wpq • q is an output neuron q = (dq − oq ) f (netq )  = f'(net )w  q qiqii • q is a hidden neuron Derivative of the activation function used in neuron q with respect to the input of q (netq) ( i is over the nodes in the layer above q) Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 26 Exercise Given: f(x)= 1 1+e−x Show that : f'(x)= f(x)(1− f(x)) Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 27 Solution Given f(x)= 1 , showthatf'(x)=f(x)(1−f(x)) 1+e−x f'(x)=1'(1+e−x)−(1+e−x)'1=0−(−1)e−x = (1+e−x)2 (1+e−x)2 e−x = (1+e−x)2 (1) f(x)(1− f(x))= 1 (1− 1 )= 1+e−x 1+e−x 1 e−x e−x =1+e−x1+e−x =(1+e−x)2 (2) As(1)=(2)= f'(x)= f(x)(1− f(x)) Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021 28 Complete Rule for Sigmoid Neurons • From the formulas for  => we must be able to calculate the derivatives for f. For a sigmoid transfer function:
1 o  −net 
f'(netm)= m=1+e m=  netm  netm
= e − n e t m
(1 + e − n e t m )
= o m  (1 − o m )
2
• Thus, backpropagation errors for a network with sigmoid transfer function:
• q is an output neuron
• q is a hidden neuron
Irena Koprinska, irena.koprinska@sydney.edu.au
op
f(netm)=om = 1 1+e−netm
The same derivation as in the example on the previous slide, just different notation
q q
wpq p
q =(dq −oq)oq (1−oq)
 =o (1−o )w  q q q i qi i
COMP3308/3608 AI, week 8b & 9a, 2021
29

Backpropagation Algorithm – Summary (1)
1. Determine the architecture of the network
– how many input and output neurons; what output encoding – how many hidden neurons and layers
2. Initialize all weights (biases incl.) to small random values, typically [-1,1]
3. Repeat until termination criterion satisfied:
(forward pass)
–Present a training example and propagate it through the network to calculate the actual output of the network
(backward pass)
–Compute the error (the  values for the output neurons) –Starting with output layer, repeat for each layer in the network:
–propagate the  values back to the previous layer –update the weights between the two layers
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
30

• •
1) The error on the training data is below a threshold
–This requires all training examples to be propagated again and the total
error to be calculated (as in the perceptron)
–Threshold is set heuristically, e.g. 0.3;
–Different type of errors may be used, e.g. mean square or mean absolute or
even accuracy (% of correctly classified examples)
2) Maximum number of epochs is reached
Instead of 1) or 1)+2) we can use an alternative stopping criterion called Early stopping using a validation set to prevent overfiting, see next slides
It typically takes hundreds or thousands of epochs for an NN to converge
Try Matlab’s demo nnd11bc!
Backpropagation Algorithm – Summary (2)
4. The stopping criterion is checked at the end of each epoch. Training stops when 1 of these 2 conditions is satisfied: epoch – 1 pass through the
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
31
training set

How to Determine if an Example is Correctly Classified?
• Binary encoding of the target outputs
• Apply each example and get the resulting output activations of the output neurons; the example will belong to the class corresponding to the output neuron with highest activation.
• Example: 3 classes; the outputs for ex. X are 0.3, 0.7, 0.5 => ex. X belongs to class 2
• i.e. each output value is regarded as the probability of the example to belong to the class corresponding to this output
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
32


Backpropagation – Example
2 classes, 2-dim. input data training set:
ex.1: 0.6 0.1 | class 1 (banana) ex.2: 0.2 0.3 | class 2 (orange)

Network architecture
• How many inputs?
• How many hidden neurons? Heuristic: n=(inputs+output_neurons)/2
(3 is used in this example not 2)
• How many output neurons?
• What encoding of the outputs? e.g. 10 for class 1, 01 for class 0
Initial weights and learning rate • Let’s =0.1 and the weights are set as shown
Sigmoid transfer function

• •
33
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021

Backpropagation – Example (cont. 1)
• 1. Forward pass for ex. 1 – calculate the outputs o6 and o7 ex.1: o1=0.6, o2=0.1, target output 1 0, i.e. class 1
• Activations of the hidden neurons:
net3= o1 *w13+ o2*w23+b3=0.6*0.1+0.1*(-0.2)+0.1=0.14 o3=1/(1+e-net3) =0.53
net4= o1 *w14+ o2*w24+b4=0.6*0+0.1*0.2+0.2=0.22 o4=1/(1+e-net4) =0.55
net5= o1 *w15+ o2*w25+b5=0.6*0.3+0.1*(-0.4)+0.5=0.64 o5=1/(1+e-net5) =0.65
• Activations of the output neurons:
•net6= o3 *w36+ o4*w46+ o5*w56 +b6=0.53*(-0.4)+0.55*0.1+0.65*0.6-0.1=0.13 o6=1/(1+e-net6) =0.53
net7= o3 *w37+ o4*w47+ o5*w57 +b7=0.53*0.2+0.55*(-0.1)+0.65*(-0.2)+0.6=0.52 o7=1/(1+e-net7) =0.63
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
34

Backpropagation – Example (cont. 2)
• 2. Backward pass for ex. 1
• Calculate the output errors 6 and 7 (note that d6=1, d7=0 for class 1) • 6 = (d6-o6) * o6 * (1-o6)=(1-0.53)*0.53*(1-0.53)=0.12
• 7 = (d7-o7) * o7 * (1-o7)=(0-0.63)*0.63*(1-0.63)=-0.15
• Calculate the new weights between the hidden and output neurons (=0.1) w36=  * 6 * o3 = 0.1*0.12*0.53=0.006
w36new = w36old + w36 = -0.4+0.006=-0.394
w37=  * 7 * o3 = 0.1*-0.15*0.53=-0.008
w37new = w37old + w37 = 0.2-0.008=-0.19 Similarly for w46new , w47new , w56new and w57new
For the biases b6 and b7 (remember: biases are weights with input 1):
b6=  * 6 * 1 = 0.1*0.12=0.012
b6new = b6old + b6 = -0.1+0.012=-0.012
Similarly for b7
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021
35

Backpropagation – Example (cont. 3)
• Calculate the errors of the hidden neurons 3, 4 and 5 3 = o3 * (1-o3) * (w36* 6 +w37 * 7 ) =
= 0.53*(1-0.53)(-0.4*0.12+0.2*(-0.15))=-0.019 Similarly for 4 and 5
• Calculate the new weights between the input and hidden neurons (=0.1)
w13=  * 3 * o1 = 0.1*(-0.019)*0.6=-0.0011
w13new = w13old + w13 = 0.1-0.0011=0.0989
Similarly for w23new , w14new , w24new , w15new and w25new; b3, b4 and b6
3. Repeat the same procedure for the other training examples • Forward pass for ex. 2…backward pass for ex.2…
• Forward pass for ex. 3…backward pass for ex. 3…
•…
• Note: it’s better to apply input examples in random order
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
36

Backpropagation – Example (cont. 4)
4. At the end of the epoch – check if the stopping criterion is satisfied:
• If yes: stop training
• If not, continue training:
• epoch++
• go to step 1
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021
37

Steepest Gradient Descent
• Does the gradient descent guarantee that after each adjustment the error will be reduced?
• No, if a local minimum is reached.
• Is the gradient descent optimal, i.e. will it find the global minimum?
• No. It is guaranteed to find a minimum but it might be a local minimum not the global minimum.
• However, a local minimum may be a good enough solution!
Backpropagation’s error space: many local minima and 1 global minimum
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
38

Universality of Backpropagation
• Boolean functions
• Every boolean function of the inputs can be represented by
network with a single hidden layer
• Continuous functions – universal approximation theorems
• Any continuous function can be approximated with arbitrary small error by a network with one hidden layer (Cybenko 1989, Hornik et al. 1989):
• Any function (including discontinuous) can be approximated to arbitrary small error by a network with two hidden layers (Cybenco 1988)
• These are existence theorems – they say that a solution (NN) exist but don’t say how to choose the number of hidden neurons!
• For a given NN it is hard to say exactly which functions can be represented and which can not
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
39


Occurs when
• Training examples are noisy
• Small dataset set not representative of the whole data
• Number of the free (trainable) parameters is bigger than the number of training examples
• Which are the trainable parameters in a NN? The weights. Preventing overtraining
• Use network that is just large enough to provide an adequate fit
• Ockham’s Razor – don’t use a bigger net when a smaller one will
work
• The network should not have more free parameters than the number of training examples
However, it is difficult to know beforehand how large a network should be for a specific application!


Overfitting
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
40

Preventing Overtraining – Validation Set Approach
• Also called an early stopping method
• Available data is divided into 3 subsets
• Training set
• Used for computing the gradient and updating the weights
• Validation set
• The error on the validation set is monitored during the training
• This error will normally decrease during the initial phase of training (as does the training error)
• However, when the network begins to overfit the data, the error on the validation set will typically begin to rise
• Training is stopped when the error on the validation set increases for a pre- specified number of iterations and the weights and biases at the minimum of the validation set error are returned
• Testing set
• Not used during training but to compare different algorithms once training
has completed
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
41

For COMP3608 students only
Preventing Overtraining – Cross Validation Approach
• Problems with the validation set approach – small data sets
• Not enough data may be available to provide a validation set
• Overfitting is most severe for small data sets
• K-fold cross validation may be used
• Perform k fold cross validation
• Each time determine the number of epochs ep that achieved best performance on the respective test partition
• Calculate ep_mean, the mean of ep
• Final run: train the network on all examples for ep_mean epochs
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
42

Error Surface and Convergence – Example
Try nnd12sd1!
Irena Koprinska, irena.koprinska@sydney.edu.au
b a
• Problem 1: Path b gets trapped in a local minimum
• What can be done? Try different initializations!
• Problem 2: Path a converges to the optimum solution but is very slow
• What can we do? COMP3308/3608 AI, week 8b & 9a, 2021
43

Speeding up the Convergence
• Solution 1: Increase the learning rate
• Faster on the flat part but unstable when falling into the steep valley
that contains the minimum point – overshooting the minimum
Try nnd12sd2!
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
44

Speeding up the Convergence (2)
• Solution 2: Smooth out the trajectory by averaging the weight update
• How? Make the current update dependent on the previous by introducing an extra term (called momentum term) in the weight adaptation equation
• Before:
𝑤 𝑡 + 1 = 𝑤 𝑡 + ∆𝑤 𝑡 , where
• Now:
∆𝑤 𝑡=𝜂𝛿𝑜+𝜇(𝑤 𝑡−𝑤 𝑡−1ቁ
𝑝𝑞 𝑝𝑞 𝑝𝑞
∆𝑤 𝑡=𝜂𝛿𝑜 𝑝𝑞 𝑞𝑝
momentum term
𝑝𝑞 𝑞𝑝 𝑝𝑞 𝑝𝑞
• If previous change was large -> current change will be large and vice versa
• Speeds up convergence along shallow gradient valleys (convergence is slow as
the direction that has to be followed has a very small gradient -> oscillations)
• The use of momentum typically smooths out the oscillations and produces a more stable trajectory
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
45

• •

Momentum
The theory behind momentum comes from linear filters Observation
• convergence might be improved if by smoothing out the trajectory by averaging the updates to the parameters
First order filter:
• w(k) – input, y(k) – output
•  – momentum coefficient w(k)
y(k)
For COMP3608 students only
y(k)=y(k−1)+(1−)w(k) 0 1
w ( k ) = 1 + s i n  2 k   16 
Based on Hagan, Demuth & Beale
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021
46

For COMP3608 students only
First Order Linear Filter
• Oscillation in the filter output y(k) is less than the oscillation in the filter input w(k)
• As the momentum coefficient increases, the oscillation in the output is reduced
• The average filter output is the same as the average filter input
• Although as the momentum increases the filter output is slow to
respond
 The filter tends to reduce the amount of oscillation, while still tracking the average value
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
47


Backpropagation with Momentum – Example
Example – the same learning rate and initial position:
• Smooth and faster
without momentum
with momentum 0.8
convergence
• Stable algorithm
squared error
• By the use of momentum we can use a larger learning rate while maintaining the stability of the algorithm
• Try nnd12mo!
• Typical momentum values used in practice: 0.6-0.9
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
48

More on the Learning Rate
• Constant throughout training (standard steepest descent)
• The performance is very sensitive to the proper setting of the
learning rate
• Too small – slow convergence
• Too big – oscillation, overshooting of the minimum
• It is not possible to determine the optimum learning rate before training as it changes during training and depends on the error surface
• Variable learning rate
• goal: keep the learning rate as large as possible while keeping learning
stable
• Several algorithms have been proposed
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
49

Some Practical Tricks
Y. LeCun, L. Bottou, G.B. Orr and K.-R Mueller, Efficient BackProp,
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
1. Stochastic (incremental) or batch learning
• Use stochastic – it is faster and often finds better solutions
2. Input examples
• Shuffle the training set so that successive training examples never (rarely) belong to the same class
• Present input examples that produce a large error more often that examples that produce a small error
• Normalise the input variables to a standard deviation of 1
• If possible, use de-correlated input variables
3. Transfer function
• Use a tangent sigmoid not the standard one – faster convergence
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
50

Some Practical Tricks (2)
4. Learning rate – a separate learning rate for each neuron so that all weights converge roughly with the same speed
• Learning rate should be proportional to the square root of the number of inputs to the neuron
• Learning rates in the lower layers should be higher than in the higher layers
5. Training – useful variations and extensions of the gradient descent method – Lavenberg-Marquardt method and conjugate gradient
• If the training set is large (> few hundred examples) and the task is classification, use stochastic gradient descent with careful tuning or the Lavenberg-Marquardt method
• If the training set is not too large or the task is regression, use the conjugate gradient method
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
51

Limitations and Capabilities
• Backpropagation NNs are used for supervised learning (classification or regression)
• Theoretically they can
• Form arbitrary decision boundaries (i.e. both linear and non-linear)
• Are universal approximators – can approximate any function arbitrary well
• In practice:
• May not always find a good solution – can be trapped in a local minimum
• Performance is sensitive to the starting conditions (weights initialization)
• Sensitive to the number of hidden layers and neurons
• Too few neurons – underfitting, unable to learn what you want it to learn
• Too many – overfitting, learns slowly
• => the architecture of a MLP network is not completely constrained by the problem to be solved as the number of hidden layers and neurons are left to the designer
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
52


Limitations and Capabilities – cont.
In practice (continuation):
• Sensitive to the value of the learning rate
• Too small – slow learning
• Too big – instability or poor performance
• The proper choices depends on the nature of examples
• Trial and error
• Refer to the choices that have worked well in similar problems
• => successful application of NNs requires time and experience
Backpropagation algorithm – summary:
• uses steepest Gradient Descent (GD) for minimizing the mean square error
• Standard GD is slow as it requires small learning rate for stable learning
• GD with momentum is faster as it allows higher learning rate while maintaining stability
• There are several variations of the backpropagation algorithm Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021

53

Interesting Classical NN Applications
• A few examples of the many significant applications of NNs (before deep learning)
• Network design was the result of several months trial and error experimentation
• Moral: NNs are widely applicable but they cannot magically solve problems; wrong choices lead to poor performance
• “NNs are the second best way of doing just about anything” John Denker
• i.e. NNs provide passable performance on many tasks that would be difficult to solve explicitly with other techniques
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
54

NETtalk
• Sejnowski and Rosenberg 1987
• Pronunciation of written English
• Fascinating problem in linguistics as English is not a phonetic language
• Task with high commercial profit
• How?
• Task1:Mappingthetextstreamtophonemes
e.g. cat -> [kat], believe->[bi’li:v]
• Task2:Passingthephonemestospeechgenerator
• Task for the NN: learning to map the text to phonemes (task 1)
• Good task for a NN as most of the mapping rules are
approximately correct • e.g. cat [k], century [s]
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
55

NETtalk -Architecture
• 203 input neurons – 7×29
• 7 – size of the sliding window (the character to be pronounced and
the 3 characters before and after it)
• 29 possible characters (26 letters + blank, period, other punctuation symbols)
• 80 hidden neurons
• 26 output neurons
• 1 for each phoneme
Irena Koprinska, irena.koprinska@sydney.edu.au
COMP3308/3608 AI, week 8b & 9a, 2021
56

NETtalk – Performance
• Training set
• 1024 pairs of word-phonemes (hand transcribed)
• Accuracy on training set: 90% after 50 epochs
• Why not 100% – possible reasons?
• A few dozen hours of training time + a few months of experimentation with different architectures
• Testing
• Accuracy 78%
• Importance
• A good showpiece for the philosophy of NNs
• The network appears to mimic the speech patterns of young children – incorrect bubble at first (as the weights are random), then gradually improving to become understandable
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
57

Driving Motor Vehicles
• Pomerleau et al., 1993
• ALVIN (Autonomous Land Vehicle In a Neural Network)
• Learns to drive a van along a single lane on a highway
• Once trained on a particular road, ALVIN can drive at speed > 40 miles per hour
• Chevy van and US Army HMMWV personnel carrier
• computer-controlled steering, acceleration and braking
• sensors: color stereo video camera, radar, positioning system, scanning laser finders
Pictures from Mitchell
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
58

ALVINN – Architecture
• Fully connected backpropagation NN with 1 hidden layer
• 960 input neurons – the signal from the camera is preprocessed to yield
30×32 image intensity grid
• 5 hidden neurons
• 32 output neurons corresponding to directions
• If the output node with the highest activation is
• The left most , than ALVINN turns sharply left
• The right most, than ALVINN turns sharply right
• A node between them, than ALVINN directs the van in a proportionally intermediate direction
• Smoothingthedirection–itiscalculatedasaveragesuggestednot only by the output node with highest activation but also by the node’s immediate neighbours
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
59

ALVINN – Training
• Training examples (image-direction pairs)
• Recording such pairs when human drives the vehicle
• After collecting 5 min such data and 10 min training, ALVINN can drive on its own
• Potential problem: as the human is too good and (typically) does not stray from the lane, there are no training examples that show how to recover when you are misaligned with the road
• Solution: ALVINN corrects this by creating synthetic training examples – it rotates each video image to create additional views of what the road would look like if the van were a little off course to the left or right
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
60

ALVINN – Results
• Impressive results
• ALVINN has driven at speeds up to 70 miles per hour for up to 90 miles
on public highways near Pittsburgh
• Also at normal speeds on single lane dirt roads, paved bike paths, and two lane suburban streets
• Limitations
• Unable to drive on a road type for which it hasn’t been trained
• Not very robust to changes in lighting conditions and presence of other vehicles
• Comparison with traditional vision algorithms
• Use image processing to analyse the scene and find the road and then
follow it
• Most of them achieve 3-4 miles per hour
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
61

ALVINN – Discussion
• Why is ALVINN so successful?
• Fast computation – once trained, the NN is able to compute a new steering direction 10 times a second => the computed direction can be off by 10% from the ideal as long as the system is able to make a correction in a few tenths of a second
• Learning from examples is very appropriate
• No good theory of driving but it is easy to collect examples
• Motivated the use of learning algorithm (but not necessary NNs)
• Driving is continuous, noisy domain, in which almost all features contribute some information => NNs are better choice than some other learning algorithms (e.g. DTs)
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
62

Acknowledgements
• J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley
• T. Mitchell, Machine Learning
• M. Dunham, Data Mining, Pearson Education
• Hagan, Demuth, Beale, Neural Network Design, PWS, Thomson
• Matlab NN toolbox
• Y. LeCun, L. Bottou, G.B. Orr and K.-R Mueller, Efficient BackProp, Neural networks: Tricks of the Trade, pp.9-48, http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
Irena Koprinska, irena.koprinska@sydney.edu.au COMP3308/3608 AI, week 8b & 9a, 2021
63