9a_Neural_Networks.dvi
COMP9414 Neural Networks 1
This Lecture
� Neurons – Biological and Artificial
� Perceptron Learning
� Linear Separability
� Multi-Layer Networks
� Backpropagation
� Application – ALVINN
UNSW ©W. Wobcke et al. 2019–2021
COMP9414: Artificial Intelligence
Lecture 9a: Neural Networks
Wayne Wobcke
e-mail:w. .au
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 3
Brain Regions
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 2
Sub-Symbolic Processing
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 5
Structure of a Typical Neuron
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 4
Brain Functions
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 7
Variety of Neuron Types
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 6
Biological Neurons
The brain is made up of neurons (nerve cells) which have
� A cell body (soma)
� Dendrites (inputs)
� An axon (outputs)
� Synapses (connections between cells)
Synapses can be exitatory or inhibitory and may change over time
When the inputs reach some threshold, an action potential
(electrical pulse) is sent along the axon to the outputs
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 9
Artificial Neural Networks
(Artificial) Neural Networks are made up of nodes which have
� Input edges, each with some weight
� Output edges (with weights)
� An activation level (a function of the inputs)
Weights can be positive or negative and may change over time (learning)
The input function is the weighted sum of the activation levels of inputs
The activation level is a non-linear transfer function g of this input
activationi = g(si) = g(∑
j
wi jx j)
Some nodes are inputs (sensing), some are outputs (action)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 8
The Big Picture
� Human brain has 100 billion neurons with an average of 10,000
synapses each
� Latency is about 3-6 milliseconds
� Therefore, at most a few hundred “steps” in any mental computation,
but massively parallel
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 11
Transfer Function
Originally, a (discontinuous) step function
g(s) =
{ 1 if s≥ 0
0 if s < 0 Later, other transfer functions which are continuous and smooth UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 10 McCulloch & Pitts Model of a Single Neuron x1 x2 Σ ✲ g ✲ 1 ✟✟ ✟✟ ✟✟ ✟✯ ❍❍❍❍❍❍❍❥ ✁ ✁ ✁ ✁ ✁✕ w1 w2 w0 =−θ s g(s) s = w1x1 +w2x2−θ = w1x1 +w2x2 +w0 x1, x2 are inputs w1, w2 are synaptic weights θ is a threshold w0 is a bias weight g is transfer function UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 13 Perceptron Learning Rule Adjust the weights as each input is presented Recall s = w1x1 +w2x2 +w0 if g(s) = 0 but should be 1, wk ← wk +ηxk w0 ← w0 +η so s ← s+η(1+∑ k x 2 k ) if g(s) = 1 but should be 0, wk ← wk−ηxk w0 ← w0−η so s ← s−η(1+∑ k x 2 k ) otherwise weights are unchanged (η > 0 is called the learning rate)
Theorem: This will eventually learn to classify the data correctly,
as long as they are linearly separable.
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 12
Linear Separability
Question: What kind of functions can a perceptron compute?
Answer: Linearly separable functions
Examples
AND w1 = w2 = 1.0, w0 =−1.5
OR w1 = w2 = 1.0, w0 =−0.5
NOR w1 = w2 =−1.0, w0 = 0.5
Question: How can we train a perceptron to learn a new function?
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 15
Training Step 1
x
2
1
x
(1,1)
0.2 x1 +0.0 x2−0.1≥ 0
w1 ← w1−η x1 = 0.1
w2 ← w2−η x2 = −0.1
w0 ← w0−η = −0.2
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 14
Perceptron Learning Example
x1
x2
Σ→ (+/−) ✲
1
✟✟
✟✟
✟✟
✟✯
❍❍❍❍❍❍❍❥
✁
✁
✁
✁
✁✕
w1
w2
w0
w1 x1 +w2 x2 +w0 ≥ 0
learning rate η = 0.1
begin with random weights
w1 = 0.2
w2 = 0.0
w0 =−0.1
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 17
Training Step 3
x
2
1
x
(1.5,0.5)
(2,2)
0.3 x1 +0.0 x2−0.1≥ 0
3rd point correctly classified,
so no change
4th point:
w1 ← w1−η x1 = 0.1
w2 ← w2−η x2 = −0.2
w0 ← w0−η = −0.2
0.1 x1−0.2 x2−0.2≥ 0
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Neural Networks 16
Training Step 2
x
2
1
x
(2,1)
0.1 x1−0.1 x2−0.2 < 0 w1 ← w1 +η x1 = 0.3 w2 ← w2 +η x2 = 0.0 w0 ← w0 +η = −0.1 UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 19 Limitations Problem: Many useful functions are not linearly separable (e.g. XOR) I 1 I 2 I 1 I 2 I 1 I 2 ? (a) (b) (c)and or xor 0 1 0 1 0 1 1 0 0 1 0 1 I 2I 1I 1 I 2I 1 I 2 Possible solution x1 XOR x2 can be written as (x1 AND x2) NOR (x1 NOR x2) Recall that AND, OR and NOR can be implemented by perceptrons UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 18 Final Outcome x x 1 2 eventually, all the data will be correctly classified (provided it is linearly separable) UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 21 Historical Context In 1969, Minsky and Papert published a book highlighting the limitations of Perceptrons, and lobbied various funding agencies to redirect funding away from neural network research, preferring instead logic-based methods such as expert systems. It was known as far back as the 1960s that any given logical function could be implemented in a 2-layer neural network with step function activations. But the question of how to learn the weights of a multi-layer neural network based on training examples remained an open problem. The solution, which we describe in the next section, was found in 1976 by Paul Werbos, but did not become widely known until it was rediscovered in 1986 by Rumelhart, Hinton and Williams. UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 20 Multi-Layer Neural Networks XOR NOR AND NOR −1 +1 +1 −1 −1.5 −1 −1 +0.5 +0.5 Question: Given an explicit logical function, we can design a multi-layer neural network by hand to compute that function – but if we are just given a set of training data, can we train a multi-layer network to fit this data? UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 23 Local Search in Weight Space Problem: Because of the step function, the landscape will not be smooth but will instead consist almost entirely of flat local regions and “shoulders”, with occasional discontinuous jumps UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 22 Training as Cost Minimization Define an error function E to be (half) the sum over all input patterns of the square of the difference between actual output and desired output E = 1 2 ∑(z− t)2 If we think of E as “height”, this gives an error “landscape” on the weight space. The aim is to find a set of weights for which E is very low. UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 25 Gradient Descent Recall that the error function E is (half) the sum over all input patterns of the square of the difference between actual output and desired output E = 1 2 ∑(z− t)2 The aim is to find a set of weights for which E is very low. If the functions are smooth, use multi-variable calculus to define how to adjust the weights so error moves in steepest downhill direction w← w−η ∂E ∂w Parameter η is called the learning rate UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 24 Key Idea (a) Step function (b) Sign function +1 ai −1 ini +1 ai init (c) Sigmoid function +1 ai ini (d) Hyperbolic tan Replace the (discontinuous) step function with a differentiable function, such as the sigmoid g(s) = 1 1+ e−s or hyperbolic tangent g(s) = tanh(s) = es− e−s es + e−s = 2 ( 1 1+ e−2s ) −1 UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 27 Forward Pass b2 w22 b1 u1 v1 v2 z c u2 11w s w 21 w 12 1x 2x 1y 2y u1 = b1 +w11x1 +w12x2 y1 = g(u1) s = c+ v1y1 + v2y2 z = g(s) E = 1 2 ∑(z− t)2 UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 26 Chain Rule If y = y(u) u = u(x) Then ∂y ∂x = ∂y ∂u ∂u ∂x This principle can be used to compute the partial derivatives in an efficient and localized manner. Note that the transfer function must be differentiable (usually sigmoid, or tanh). Note: if z(s) = 1 1+ e−s z ′(s) = z(1− z) if z(s) = tanh(s) z′(s) = 1− z2 UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 29 Neural Networks – Applications � Autonomous Driving � Game Playing � Credit Card Fraud Detection � Handwriting Recognition � Financial Prediction UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 28 Backpropagation Partial Derivatives ∂E ∂z = z− t dz ds = g′(s) = z(1− z) ∂s ∂y1 = v1 dy1 du1 = y1(1− y1) Useful notation δout = ∂E ∂s δ1 = ∂E ∂u1 δ2 = ∂E ∂u2 Then δout = (z− t) z (1− z) ∂E ∂v1 = δout y1 δ1 = δout v1 y1 (1− y1) ∂E ∂w11 = δ1 x1 Partial derivatives can be calculated efficiently by backpropagating deltas through the network UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 31 ALVINN � Autonomous Land Vehicle In a Neural Network � Later version included a sonar range finder ◮ 8×32 range finder input retina ◮ 29 hidden units ◮ 45 output units � Supervised Learning, from human actions (Behavioural Cloning) ◮ Additional “transformed” training items to cover emergency situations � Drove autonomously from coast to coast across US UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 30 ALVINN UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 33 Summary � Neural networks are biologically inspired � Multi-layer networks can learn non-linearly separable functions � Backpropagation is effective and widely used UNSW ©W. Wobcke et al. 2019–2021 COMP9414 Neural Networks 32 Training Tips � Rescale inputs and outputs to be in the range 0 to 1 or −1 to 1 � Initialize weights to very small random values � On-line or batch learning � Three different ways to prevent overfitting ◮ Limit the number of hidden nodes or connections ◮ Limit the training time, using a validation set ◮ Weight decay � Adjust learning rate (and momentum) to suit the particular task UNSW ©W. Wobcke et al. 2019–2021