CS代写 COMP3308/3608, Lecture 8b & 9a

COMP3308/3608, Lecture 8b & 9a
ARTIFICIAL INTELLIGENCE
Multilayer Neural Networks and Backpropagation Algorithm
Reference: Russell and Norvig, pp. 731-737 Witten, Frank and Hall, pp. 232-241

Copyright By PowCoder代写 加微信 powcoder

, COMP3308/3608 AI, week 8b & 9a, 2022

• Multilayer perceptron NNs
• XOR problem
• neuron model
• Backpropagation algorithm
• Derivation
• Universality – Cybenco’s theorems
• Preventing overfitting
• Modifications • Momentum
• Learning rate
• Limitations and capabilities
• Interesting applications
COMP3308/3608 AI, week 8b & 9a, 2022

XOR problem – Example
• The XOR problem is not linearly separable and, hence, cannot be solved by a single layer perceptron network
• But it can be solved by a multilayer perceptron network:
• The 2 perceptrons in the input layer identify linearly separable parts, and their outputs are combined by another perceptron to form the final solution
0011 p1 = ,t1 =0 p2 = ,t2 =1 p3 = ,t3 =1 p4 =,t4 =0

• Verify that the network with the weights shown in the diagram solves the XOR problem!
Image from Hagan, Demuth & Beale
, COMP3308/3608 AI, week 8b & 9a, 2022

Boundary Investigation for the XOR Problem
•1st layer boundaries:
1st layer, 1st neuron; 1st layer, 2d neuron: 1st decision boundary 2nd decision boundary
• 2nd layer combines the two boundaries together: 2nd layer, 1st neuron:
combined boundary
Images from Hagan, Demuth & Beale
, COMP3308/3608 AI, week 8b & 9a, 2022

Can We Train Such a Network?
• There exist a 2-layer perceptron network capable of solving the XOR task!
• Rosenblatt and Widrow (who proposed the perceptron) were aware of this
• But how can we train such a network to learn from examples?
• Rosenblatt and Widrow were not able to successfully modify the
perceptron’s rule to train these more complex networks
• 1969 – book “Perceptrons” by Minsky and Papert
• Discusses the limitations of the perceptrons (can solve linearly separable problems only)
• “There is no reason to suppose that any of the virtues of perceptrons carry over to the many-layered version”
• => The majority of scientific community walked away from the field of NNs, funding was suspended
, COMP3308/3608 AI, week 8b & 9a, 2022

Discovery of the Backropagation Algorithm
An algorithm for training multi-layer perceptron networks Proposed by in 1974
• His thesis presented the algorithm in the context of general networks, with NNs as a special case, and was not disseminated in the NN community
Rediscovered by , , 1986; 1985; Cun 1985
• Book: Parallel Distributed Processing (1986) by Rumelhart and Mc NNs trained with the backpropagation algorithm – the most widely used NNs
The backpropagation algorithm is also used to train deep NNs
, COMP3308/3608 AI, week 8b & 9a, 2022

Problem Statement – Supervised Learning:
• Given: A labeled training dataset
• Input vectors associated with target classes
• Task: Build a backpropagation NN that encapsulates this knowledge
(i.e. maps the inputs to the target classes) and can be used predictively • We would like the NN to predict well new input vectors (unseen during
Training data
Get the prediction
sepal sepal petal petal length width length width 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 …
Iris setosa
Iris setosa
iris versicolor
iris versicolor
iris virginica
iris virginica
Build the classifier (i.e. train the NN)
sepal sepal petal petal iris type length width length width
4.1 3.6 1.6 0.3 ?
5.9 3.1 2.4 0.1 ?
7.4 3.3 3.5 1.2 ?
7 , COMP3308/3608 AI, week 8b & 9a, 2022

Learning in NN (review)
• The “intelligence” of a NN is in the values of its weights
• The weights are the parameters of a NN
• We start from a random initialization and change them using a
learning rule to learn the input-output mapping as specified by the training data
• How to learn the input-output mapping using the backpropagation algorithm?
• We will formulate an error function (e.g. sum of squared errors between the target and actual output) and use a learning rule that minimizes this error
, COMP3308/3608 AI, week 8b & 9a, 2022

Neural Network Model
• Computational model consisting of 3 parts:
• 1) Architecture – neurons and connections
• Input, hidden, output neurons
• Fully or partially connected
• Neuron model – computation performed by each neuron, type of
transfer function
• Initialization of the weights
• 2) Learning algorithm
• How are the weights of the connections changed in order to
facilitate learning
• Goal for classification tasks: mapping between the input vectors
and their classes
• 3) Recall technique – once the NN training is completed, how is the information obtained from the trained NN?
• E.g. for classification tasks – how is the class of a new example
determined?
, COMP3308/3608 AI, week 8b & 9a, 2022

Multi-layer Perceptron Network – Architecture
1) A network with 1 or more hidden layers • e.g. a NN for the iris data:
e.g. 1 output neuron for each class
output neurons
hidden neurons (1 hidden layer)
e.g. 1 input neuron for each attribute
2) A feedforward network – each neuron receives input only from the neurons in the previous layer
3) A fully connected network (typically) – all neurons in a layer are connected with all neurons in the next layer
4) Its weights are initialized to small random values, e.g. [-1,1]
, COMP3308/3608 AI, week 8b & 9a, 2022
sepal sepal petal petal iris type length width length width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
Iris setosa
Iris setosa
iris versicolor
iris versicolor
iris virginica
iris virginica

Multi-layer Perceptron Network – Architecture 2
5) Each neuron (except the input neurons) computes the weighed sum of its inputs, and then applies a differentiable transfer function
a = f(wp+b)
• any differentiable transfer function f can be used, i.e. the derivative should exist for every point;
• most frequently used: sigmoid and tan-sigmoid (hyperbolic tangent sigmoid)
, COMP3308/3608 AI, week 8b & 9a, 2022
a=en −e−n en +e−n

Architecture – Number of Input Neurons
• Input neurons (=inputs) do not perform any computation (e.g. summation and transfer function); they just transmit data
• Numerical data – 1 input neuron for each attribute
e.g. 1 output neuron for each class
output neurons
hidden neurons (1 hidden layer)
sepal length
sepal width
petal length
petal width
e.g. 1 input neuron for each attribute
• For numerical data, the input examples are directly fed into the network, e.g. 5.1, 3.5, 1.4 and 0.2 to each of the 4 input neurons (no need for encoding, see next slide)
, COMP3308/3608 AI, week 8b & 9a, 2022
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
Iris setosa
Iris setosa
iris versicolor
iris versicolor
iris virginica
iris virginica

Architecture – Number of Input Neurons (2)
Categorical data – 1 input neuron for each attribute value • How many input neurons for the weather data?
output layer hidden layer(s)
sunny overcast rainy hot mild cool high normal true false outlook temperature humidity windy
Input examples cannot be directly fed; need to be encoded
Encoding of the input examples – typically binary depending on the value of the attribute (on and off), i.e. one-hot encoding
e.g. ex 1: 100 100 10 01
, COMP3308/3608 AI, week 8b & 9a, 2022
outlook temp. humidity windy play
sunny hot high overcast hot high
rainy mild high rainy cool normal rainy cool normal overcast cool normal …
false no false yes false yes false yes true no true yes

Number of Output Neurons
• Typically 1 neuron for each class target class ex1: 1 0
hidden layer(s)
sunny overcast rainy hot mild cool high normal true false outlook temperature humidity windy
ex.1: 1 0 0 1 0 0 1 0 0 1
• The targets, i.e. classes of the examples, need to be encoded – typically binary encoding is used (one-hot encoding)
e.g. class1 (no): 1 0, class2 (yes): 0 1
, COMP3308/3608 AI, week 8b & 9a, 2022
outlook temp. humidity windy play
sunny hot high overcast hot high
rainy mild high rainy cool normal rainy cool normal overcast cool normal …
false no false yes false yes false yes true no true yes

Not examinable, for interested students only
More on Output Encoding
• How to encode the outputs and represent targets?
• Local encoding
•1 output neuron
• different output values represent different classes, e.g. <0.2 – class 1, >0.8 – class 2, in between – ambiguous class (class 3)
• Distributed (binary, 1-of-n) encoding – typically used in multi class problems • Number of outputs = number of classes
• Example: 3 classes, 3 output neurons; class 1 is represented as 1 0 0, class 2 – as 0 1 0 and class 3 – as 0 0 1
• Another representation of the targets: use 0.1 instead of 0 and 0.9
instead of 1
• Motivation for choosing distributed over local encoding
• Provides more degree of freedom to represent the target function (n times as many weights available)
• The difference between the output with highest value and the second highest can be used as a measure how confident the prediction is (close
values => ambiguous classification)
, COMP3308/3608 AI, week 8b & 9a, 2022
outlook temp. Humid. windy play
sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes …

Number of Hidden Layers and Neurons in Them
• An art! Typically – by trial and error
• The given task (data) constrains the number of inputs and output neurons but
not the number of hidden layers and neurons in them
• Too many hidden layers and neurons (i.e. too many weights) – overfitting
• Too few – underfitting, i.e. the NN is not able to learn the input-output mapping
• A heuristic to start with: 1 hidden layer with n hidden neurons, n=(inputs+output_neurons)/2
target class ex1: 1
sunny overcast rainy hot mild cool high normal true false outlook temperature humidity windy
ex.1: 1 0 0 1 0 0 1 0 0 1
, COMP3308/3608 AI, week 8b & 9a, 2022

Learning in Multi-Layer Perceptron NNs
• Supervised learning – the training data:
• consists of labeled examples (p, d), i.e. the desired output d for them is
given (p – input vector; d – desired output)
• can be viewed as a teacher during the training process
• error – difference between the desired d and actual a network output
• Idea of backpropagation learning
For each training example p
• Propagate p through the network and calculate the output a. Compare the desired d with the actual output a and calculate the error.
• Update weights of the network to reduce the error Until error over all examples < threshold • Why “backpropagation”? Adjusts the weights backwards (from the output to the input neurons) by propagating the weight change How to calculate the weight change? w new =w old +w pq pq pq , COMP3308/3608 AI, week 8b & 9a, 2022 Error Function • Sum of Squared Errors (E) is a classical measure of error • e.g. E for a single training example over all output neurons i: E = 12  e i 2 = 12  ( d i − a i ) 2 ii • di – desired value for the output neuron i • ai - actual NN value (i.e. predicted value by NN ) for the output neuron i • Thus, backpropagation learning can be viewed as an optimization search in the weight space • Goal state – the set of weights for which the performance index (error E) is minimum • Search method – hill climbing , COMP3308/3608 AI, week 8b & 9a, 2022 Error Landscape in Weight Space • E is a function of the weights (e.g. E is a function of w1 and w2) • 1 state = 1 set of weights (w1 and w2 in this case) • Several local minima and one global minimum E as a function of w1 and w2 • How to minimize the error? Take steps downhill! • Will find a local minimum (the one that is closest to the starting position), i.e. not guaranteed to find the global minimum (except when there is only one minimum) • How to get to the bottom as fast as possible? (i.e. we need to know what direction to move that will make the largest reduction in error) , COMP3308/3608 AI, week 8b & 9a, 2022 Steepest Gradient Descent • The direction of the steepest descent is called gradient and can be computed ( dE/dw ) • A function decreases most rapidly when the direction of movement is in the direction of the negative of the gradient • Hence, we want to adjust the weights so that the change moves the NN down the error surface in the direction of the locally steepest descent, given by the negative of the gradient •  - learning rate, defines the step; typically in the range (0,1) Image from Dunham • Gives the slope (gradient) of the error function for one weight • We want to find the weight where the slope (gradient) is 0 , COMP3308/3608 AI, week 8b & 9a, 2022 Backpropagation Algorithm - Idea • The backpropagation algorithm adjusts the weights by working backwards from the output layer to the input layer • Calculates the error and propagates this error from layer to layer • 2 approaches • Batch – weights are adjusted once after all training examples are applied (the total error is calculated) • Incremental – the weights are adjusted after each training example is applied • Calledalsostochastic(orapproximate) gradient descent • Thisisthepreferredmethod–itisusually faster than batch learning and finds better solutions (can escape local optima) Image from Hertz, Krogh & Palmer • Solid lines - forward propagation of signals • Dashed lines – backward propagation of error , COMP3308/3608 AI, week 8b & 9a, 2022 Backpropagation - Notation A neural network with one hidden layer Indexes for the neurons: k over inputs, j over hidden, i over output neurons E – error function that we would like to minimize after each training example (i.e. incremental mode): E = 12  ( d i − o i ) 2 i di target value of neuron i for the current example p oi actual value of neuron i for the current example p • Let’s express E in terms of the weights of the network and the values of the current example! • Recall that the weights are the parameters of the NN , COMP3308/3608 AI, week 8b & 9a, 2022 Expressing E in terms of the NN weights and input example 1. Input for the hidden neuron j for p netj =wkj.ok +bj k 2. Activation of neuron j as function of its input: oj = f(netj)= f(wkj.ok +bj) k 3. Input for the output neuron i : neti =wji.oj+bi=wjif(wkj.ok+tj)+bi jjk , COMP3308/3608 AI, week 8b & 9a, 2022 Expressing E in terms of the NN weights and input example (2) 4. Output for the output neuron i : oi = f(neti )= f(wji.oj +bj )= j = f(wji.f(wkj.ok +bj)+bi ) jk 5. Substituting 4 into E: 121 2 E=2(di −oi )=2di −f(wji.f(wkj.ok +bj)+bi )) , COMP3308/3608 AI, week 8b & 9a, 2022 Backpropagation – Derivation • Steepest gradient descent: adjust the weights proportionally to the negative of the error gradient For a weight wji to an output neuron: E _ E oi neti wji =−w =o net w =<-chainrule ji i i ji =−21(di−oi)(−1)oi oj= 2 neti =(di −oi )f(neti )oj =ioj, where i=(di−oi)f(neti) For a weight wkj to a hidden neuron: wkj=−E =−E oj =(di −oi)f(neti)wjif(netj)ok wkj oj wkj i =iwji f(netj)ok = jok, where j= f(netj)iwji , COMP3308/3608 AI, week 8b & 9a, 2022 Backpropagation Rule - Summary wpq(t) -weightfromnodeptonodeqattimet wpq (t +1) = wpq (t) + wpq wpq =δq op - weight change wq1 ... wqk • q is an output neuron q = (dq − oq ) f (netq ) wpq op p op p • The weight change is proportional to the output activation of neuron p and the error  of neuron q •  is calculated in 2 different ways:  = f'(net )w  q qiqii • q is a hidden neuron Derivative of the activation function used in neuron q with respect to the input of q (netq) ( i is over the nodes in the layer above q) , COMP3308/3608 AI, week 8b & 9a, 2022 Given: f(x)= 1 1+e−x Show that : f'(x)= f(x)(1− f(x)) , COMP3308/3608 AI, week 8b & 9a, 2022 Given f(x)= 1 , showthatf'(x)=f(x)(1−f(x)) 1+e−x f'(x)=1'(1+e−x)−(1+e−x)'1=0−(−1)e−x = (1+e−x)2 (1+e−x)2 = (1+e−x)2 (1) f(x)(1− f(x))= 1 (1− 1 )= 1+e−x 1+e−x 1 e−x e−x =1+e−x1+e−x =(1+e−x)2 (2) As(1)=(2)= f'(x)= f(x)(1− f(x)) , COMP3308/3608 AI, week 8b & 9a, 2022 Complete Rule for Sigmoid Neurons • From the formulas for  => we must be able to calculate the derivatives for f. For a sigmoid transfer function:
1 o  −net 
f'(netm)= m=1+e m=  netm  netm
= e − n e t m
(1 + e − n e t m )
= o m  (1 − o m )
• Thus, backpropagation errors for a network with sigmoid transfer function:
• q is an output neuron
• q is a hidden neuron
f(netm)=om = 1 1+e−netm
The same derivation as in the example on the previous slide, just different notation
q =(dq −oq)oq (1−oq)
 =o (1−o )w  q q q i qi i
COMP3308/3608 AI, week 8b & 9a, 2022

Backpropagation Algorithm – Summary (1)
1. Determine the architecture of the network
– how many input and output neurons; what output encoding – how many hidden neurons and layers
2. Initialize all weights (biases incl.) to small random values, typically [-1,1]
3. Repeat until termination criterion satisfied:
(forward pass)
–Present a training example and propagate it through the network to calculate the actual output of the network
(backward pass)
–Compute the error (the  values for the output neurons) –Starting with output layer, repeat for each layer in the network:
–propagate the  values back to the previous layer –update the weights between the two layers
, COMP3308/3608 AI, week 8b & 9a, 2022

1) The error on the training data is below a threshold
–This requires all training examples to be propagated again and the total
error to be calculated (as in the perceptron)
–Threshold is set heuristically, e.g. 0.1
–Different type of errors may be used, e.g. mean square or mean absolute or
even accuracy (% of correctly classified examples)
2) Maximum number of epochs is reached
Instead of 1) or 1)+2) we can use an alternative stopping criterion called Early stopping us

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com