Neural Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2021
COMP9417 ML & DM Neural Learning Term 2, 2021 1 / 74
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Neural Learning
Term 2, 2021
2 / 74
Aims
Aims
This lecture will enable you to describe and reproduce machine learning approaches to the problem of neural (network) learning. Following it you should be able to:
• describe Perceptrons and how to train them
• relate neural learning to optimization in machine learning
• outline the problem of neural learning
• derive the method of gradient descent for linear models
• describe the problem of non-linear models with neural networks
• outline the method of back-propagation training of a multi-layer perceptron neural network
• describe the application of neural learning for classification
• describe some issues arising when training deep networks
COMP9417 ML & DM Neural Learning Term 2, 2021 3 / 74
Neural Learning
Introduction
• Neural Learning based on Artificial Neural Networks (ANNs) • “inspired by” Biological Neural Networks (BNNs) . . .
• but structures and learning methods are different • ANNs ̸= BNNs
• ANNs based on simple logical model1 of biological neuron • the “Perceptron”
• ANNs are the basis of Deep Learning (DL)
• entire course on Neural Networks/Deep Learning (COMP9444) • we focus on neural learning in relation to methods in ML
• DL̸=ML
• DL̸=AI!
1See: McCulloch and Pitts (1943).
COMP9417 ML & DM Neural Learning Term 2, 2021 4 / 74
Neural Learning
Artificial Neural Networks
Main ideas we will cover:
• Logical threshold units – i.e., Perceptrons
• Loss function for a Perceptron (review)
• Convergence theorem for a Perceptron
• Loss function for a Linear Unit (unthresholded Perceptron) • Gradient descent for a Linear Unit
• Multilayer networks
• Backpropagation: gradient descent for multilayer networks
COMP9417 ML & DM Neural Learning Term 2, 2021 5 / 74
Neural Learning
Connectionist Models
Consider humans:
• Neuron switching time ≈ .001 second
• Number of neurons ≈ 1010
• Connections per neuron ≈ 104−5
• Scene recognition time ≈ .1 second
• 100 inference steps doesn’t seem like enough
→ much parallel computation
COMP9417 ML & DM Neural Learning Term 2, 2021 6 / 74
Neural Learning
Connectionist Models
Properties of artificial neural nets (ANNs):
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process of learning
• Emphasis on tuning weights automatically
• ANNs learn distributed representions of target function
COMP9417 ML & DM Neural Learning Term 2, 2021 7 / 74
Neural Learning
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g., raw sensor input)
• Output can be discrete or real-valued
• Output can be a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Applications:
• Speech recognition (now the standard method)
• Image classification (also now the standard method) • manyothers…
COMP9417 ML & DM Neural Learning Term 2, 2021 8 / 74
Perceptrons
What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Neural Learning Term 2, 2021 9 / 74
Perceptrons
What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Neural Learning Term 2, 2021 10 / 74
Perceptrons
What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights: +1 ifw0+w1x1+···+wnxn>0
o(x1, . . . , xn) = −1 otherwise.
COMP9417 ML & DM Neural Learning Term 2, 2021 11 / 74
Perceptrons
What are perceptrons ?
Perceptron
Or in vector notation:
1 ifw·x>0 o(x) = −1 otherwise.
COMP9417 ML & DM
Neural Learning Term 2, 2021 12 / 74
Perceptrons
How to train
Perceptron training algorithm
1 2 3 4 5 6
7 8 9
10 11
Algorithm Perceptron(D, η) // perceptron training for linear classification
Input: labelled training data D in homogeneous coordinates; learning rate η. Output: weight vector w defining classifier yˆ = sign(w · x).
w ←0 // Other initialisations of the weight vector are possible
converged ←false
while converged = false do converged ←true
for i = 1 to |D| do
if yiw · xi ≤ 0 then
w←w + ηyixi
converged←false // We changed w so haven’t converged yet
end end
end
// i.e., yˆi ̸= yi
COMP9417 ML & DM Neural Learning Term 2, 2021 13 / 74
Perceptrons
How to train
Perceptron Convergence
Perceptron training will converge (under some mild assumptions) for linearly separable classification problems
A labelled data set is linearly separable if there is a linear decision boundary that separates the classes
COMP9417 ML & DM Neural Learning Term 2, 2021 14 / 74
Perceptrons
How to train
Perceptron Convergence
Assume:
Dataset D = {(x1,y1),…,(xn,yn)}
At least one example in D is labelled +1, and one is labelled -1. R = maxi ||xi||2
A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ · xi ≥ γ Perceptron Convergence Theorem (Novikoff, 1962)
The number of mistakes made by the perceptron is at most (Rγ )2. γ is typically referred to as the “margin”.
COMP9417 ML & DM Neural Learning Term 2, 2021 15 / 74
Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
x2
x2
+-
-+ (b)
+ +
+
–
– –
(a)
x1
x1
Represents some useful functions
• What weights represent o(x1,x2) = AND(x1,x2)? • What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Neural Learning
Term 2, 2021
16 / 74
Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
Unfortunately, as a linear classifier perceptrons are limited in expressive power
So some functions not representable
• e.g., Boolean function XOR is not linearly separable
For non-linearly separable data we’ll need something else
Fortunately, with fairly minor modifications many perceptrons can be combined together to form one model
• multilayer perceptrons, the classic “neural network”
COMP9417 ML & DM Neural Learning Term 2, 2021 17 / 74
Optimization
Optimization
Studied in many fields such as engineering, science, economics, . . . A general optimization algorithm: 2
1 start with initial point x = x0
2 select a search direction p, usually to decrease f(x)
3 select a step length η
4 sets=ηp
5 setx=x+s
6 go to step 2, unless convergence criteria are met
For example, could minimize a real-valued function f : Rn → R Note: convergence criteria will be problem-specific.
2B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP.
COMP9417 ML & DM Neural Learning Term 2, 2021 18 / 74
Optimization
Optimization
Usually, we would like the optimization algorithm to quickly reach an answer that is close to being the right one.
• typically, need to minimize a function • e.g., error or loss
• optimization is known as gradient descent or steepest descent • sometimes, need to maximize a function
• e.g., probability or likelihood
• optimization is known as gradient ascent or steepest ascent
Requires function to be differentiable.
COMP9417 ML & DM Neural Learning Term 2, 2021 19 / 74
Optimization
Perceptron learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ← wi + ∆wi
where the component-wise weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 ML & DM Neural Learning Term 2, 2021 20 / 74
Optimization
Perceptron learning
Let
∆wi = η(t − o)xi
Where:
• t = c(x) is target value in {0, 1}
• o is perceptron output in {0, 1}
• η is a small constant called learning rate
• learning rate is a positive number, typically between 0 and 1 • to simplify things we sometimes assume η = 1
• but in practice usually set at less than 0.2, e.g., 0.1
• η can be varied during learning
Unfortunately, the output o is discontinuous, so not differentiable.
COMP9417 ML & DM Neural Learning Term 2, 2021 21 / 74
Training a Linear Unit by Gradient Descent
Gradient Descent
Consider linear unit, where
o = w0 + w1x1 + · · · + wnxn
Let’s learn wi’s that minimize the squared error (loss function) E[w] ≡ 1 (td − od)2
2 d∈D Where D is set of training examples
COMP9417 ML & DM Neural Learning Term 2, 2021 22 / 74
Training a Linear Unit by Gradient Descent
Gradient Descent
25 20 15 10
5
0 2
1
-2
w0
-13 2
w1
010
-1
COMP9417 ML & DM
Neural Learning
Term 2, 2021
23 / 74
E[w]
Training a Linear Unit by Gradient Descent
Gradient Descent
Gradient: derivative of E wrt each component of weight vector w
∂E ∂E ∂E ∇E[w] ≡ ∂w , ∂w ,··· ∂w
01n
Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want Training rule:
i.e., component-wise
∆w = −η∇E[w] ∆wi = −η ∂E
COMP9417 ML & DM
∂wi Neural Learning
Term 2, 2021
24 / 74
Training a Linear Unit by Gradient Descent
Derivation of Gradient Descent for Linear Unit
∂E ∂wi
= ∂ 1(td−od)2 ∂wi2d
= 1 ∂ (td−od)2 2 d ∂wi
= 12(td−od) ∂ (td−od) 2d ∂wi
= (td −od) ∂ (td −w·xd) d ∂wi
∂E ∂wi
= (td − od)(−xi,d) d
COMP9417 ML & DM
Neural Learning
Term 2, 2021
25 / 74
Training a Linear Unit by Gradient Descent
Training a Linear Unit by Gradient Descent
Gradient-Descent(training examples,η)
Each training example is a pair ⟨x, t⟩, where x is the vector of input
values, and t is the target output value. η is the learning rate (e.g., .05). Initialize each wi to some small random value
Until the termination condition is met, Do Initialize each ∆wi to zero
For each ⟨x, t⟩ in training examples, Do
Input the instance x to the unit and compute the output o For each linear unit weight wi
∆wi ←∆wi +η(t−o)xi For each linear unit weight wi
wi ← wi + ∆wi
COMP9417 ML & DM Neural Learning Term 2, 2021 26 / 74
Training a Linear Unit by Gradient Descent
Training Perceptron vs. Linear unit
Perceptron training rule guaranteed to succeed if • Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training rule uses gradient descent
• Guaranteed to converge to hypothesis with minimum squared error
• optimization
• Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not separable by H
COMP9417 ML & DM Neural Learning Term 2, 2021 27 / 74
Training a Linear Unit by Gradient Descent
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until termination condition is satisfied • Compute the gradient ∇ED[w]
• w←w−η∇ED[w]
Incremental mode (Stochastic) Gradient Descent:
Do until satisfied
• For each training example d in D
• Compute the gradient ∇Ed[w] • w←w−η∇Ed[w]
COMP9417 ML & DM Neural Learning Term 2, 2021 28 / 74
Training a Linear Unit by Gradient Descent
Incremental (Stochastic) Gradient Descent
Batch:
Incremental:
E D [ w ] ≡ 21 ( t d − o d ) 2 d∈D
E d [ w ] ≡ 12 ( t d − o d ) 2
Incremental or Stochastic Gradient Descent (SGD) can approximate Batch
Gradient Descent arbitrarily closely, if η made small enough
Very useful for training large networks (mini-batches), or online learning
from data streams
Stochastic implies examples should be selected at random
COMP9417 ML & DM Neural Learning Term 2, 2021 29 / 74
The Multi-layer Perceptron
Multilayer Networks of Sigmoid Units
COMP9417 ML & DM Neural Learning Term 2, 2021 30 / 74
The Multi-layer Perceptron
A Multilayer Perceptron for Speech Recognition: Model
head hid
who’d hood
… …
F1
F2
COMP9417 ML & DM
Neural Learning
Term 2, 2021 31 / 74
The Multi-layer Perceptron
A Multilayer Perceptron for Speech Recognition: Decision Boundaries
COMP9417 ML & DM Neural Learning Term 2, 2021 32 / 74
The Multi-layer Perceptron
ALVINN drives 70 mph on highways
COMP9417 ML & DM Neural Learning Term 2, 2021 33 / 74
The Multi-layer Perceptron
ALVINN
Sharp Left
Straight Ahead
Sharp Right
30 Output Units
4 Hidden Units
30×32 Sensor Input Retina
COMP9417 ML & DM Neural Learning Term 2, 2021 34 / 74
The Multi-layer Perceptron
Sigmoid Unit
COMP9417 ML & DM Neural Learning Term 2, 2021 35 / 74
The Multi-layer Perceptron
Sigmoid Unit
Same as a perceptron except that the step function has been replaced by a smoothed version, a sigmoid function.
Note: in practice, particularly for deep networks, sigmoid functions are much less common than other non-linear activation functions that are easier to train.
For example, the default activation function for deep networks is the Rectified Linear Unit (ReLU) or variants.
However, sigmoids are mathematically convenient.
COMP9417 ML & DM Neural Learning Term 2, 2021 36 / 74
The Multi-layer Perceptron
Sigmoid Unit
Why use the sigmoid function σ(x) ? 1
1+e−x Nice property: dσ(x) = σ(x)(1 − σ(x))
dx
We can derive gradient descent rules to train
• One sigmoid unit
• Multi-layer networks of sigmoid units → Backpropagation
We will use this to derive Backpropagation to train a Multi-layer Perceptron (MLP)
COMP9417 ML & DM Neural Learning Term 2, 2021 37 / 74
The Multi-layer Perceptron
COMP9417 ML & DM Neural Learning Term 2, 2021 38 / 74
The Multi-layer Perceptron
Notation:
• xji =theithinputtounitj
• wji = weight associated with ith input to unit j
• netj = i wjixji = (weighted sum of inputs for unit j)
• oj = output computed by unit j
• tj = the target output for unit j
• σ = the sigmoid function
• outputs = the set of units in the final layer of the network
• Downstream(j) = the set of units whose immediate inputs include the output of unit j
COMP9417 ML & DM Neural Learning Term 2, 2021 39 / 74
The Multi-layer Perceptron
Derivation of SGD Training for MLP
Stochastic gradient descent means we need to descend the gradient of the error Ed with respect to each training example d ∈ D.
Update each weight wji by adding to it ∆wji, where ∆wji = −η ∂Ed
∂wji
Ed is error on example d, summed over all output units in the network
Ed(w)=1 (tk−ok)2 2 k∈outputs
COMP9417 ML & DM Neural Learning Term 2, 2021 40 / 74
The Multi-layer Perceptron
Derivation of SGD Training for MLP
Weight wji can influence the rest of the network only through netj Apply the chain rule:
∂Ed = ∂Ed ∂netj ∂wji ∂netj ∂wji
= ∂Ed xji ∂netj
? Two cases to consider, where: • unit j is an output node of the network
• unit j is an internal node (“hidden unit”) of the network
COMP9417 ML & DM Neural Learning Term 2, 2021 41 / 74
What about ∂Ed ∂netj
The Multi-layer Perceptron
Case 1: Training rule for output unit weights
netj can influence the network only through oj, so apply chain rule again: ∂Ed = ∂Ed ∂oj
∂netj ∂oj ∂netj Taking the first term and applying the chain rule:
∂Ed = ∂ 1 (tk−ok)2 ∂oj ∂oj 2 k∈outputs
= −(tj−oj) because we only need to consider output k = j.
COMP9417 ML & DM Neural Learning Term 2, 2021 42 / 74
The Multi-layer Perceptron
Case 1: Training rule for output unit weights
For the second term, note that oj = σ(netj), and recall that the derivative ∂oj is the derivative of the sigmoid function, that is,
∂netj
σ(netj )(1 − σ(netj ), so
∂oj = ∂σ(netj) ∂netj ∂netj
= oj(1−oj)
Substituting the results for both terms into the original expression:
∂Ed ∂netj
=
−(tj − oj) oj(1 − oj)
Neural Learning Term 2, 2021 43 / 74
COMP9417 ML & DM
The Multi-layer Perceptron
Case 1: Training rule for output unit weights
We can now implement weight update as:
∆wji =−η∂Ed =η(tj −oj)oj(1−oj)xji ∂wji
We will use the notation δi to denote the quantity − ∂Ed for unit i. ∂neti
COMP9417 ML & DM Neural Learning
Term 2, 2021
44 / 74
The Multi-layer Perceptron
Case 2: Training rule for hidden unit weights
Internal unit j can only influence the output by all paths through Downstream(j), i.e., all nodes directly connected to netj.
∂Ed ∂netj
= ∂Ed ∂netk
k∈Downstream(j) ∂netk ∂netj = −δ ∂netk
k ∂netj k∈Downstream(j)
= −δ∂netk∂oj
k ∂oj ∂netj = −δw∂oj
k∈Downstream(j)
= −δk wkj oj(1 − oj) k∈Downstream(j)
k∈Downstream(j)
k kj ∂netj
COMP9417 ML & DM
Neural Learning Term 2, 2021 45 / 74
The Multi-layer Perceptron
Case 2: Training rule for hidden unit weights
Rearranging terms and using δj to denote − ∂Ed ∂netj
δj =oj(1−oj) k∈Downstream(j)
and the weight update
∆wji = η δj xji
δk wkj
COMP9417 ML & DM Neural Learning
Term 2, 2021
46 / 74
The Multi-layer Perceptron
Backpropagation Algorithm
Initialize all weights to small random numbers. Until termination condition satisfied, Do
For each training example, Do
Input training example ⟨x, t⟩ to the network and compute the network outputs
For each output unit k
δk ← ok(1 − ok)(tk − ok)
For each hidden unit h
δh ← oh(1 − oh) k∈outputs wkhδk
Update each network weight wji wji ← wji + ∆wji
where
∆wji = ηδjxji
COMP9417 ML & DM Neural Learning Term 2, 2021 47 / 74
The Multi-layer Perceptron
More on Backpropagation
A solution for learning highly complex models . . .
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Can learn probabilistic models by maximising likelihood
Minimizes error over all training examples
• Training can take thousands of iterations → slow! • Using network after training is very fast
COMP9417 ML & DM Neural Learning Term 2, 2021 48 / 74
The Multi-layer Perceptron
More on Backpropagation
Will converge to a local, not necessarily global, error minimum • May be many such local minima
• In practice, often works well (can run multiple times)
• Often include weight momentum α
∆wji(n) = ηδjxji + α∆wji(n − 1) • Stochastic gradient descent using “mini-batches”
Nature of convergence
• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions possible as training progresses
COMP9417 ML & DM Neural Learning Term 2, 2021 49 / 74
The Multi-layer Perceptron
More on Backpropagation
Models can be very complex
• Will network generalize well to subsequent examples?
• may underfit by stopping too soon • may overfit …
Many ways to regularize network, making it less likely to overfit
• Add term to error that increases with magnitude of weight vector
E(w)≡1 (t −o )2+γw2 2 kd kd ji
d∈D k∈outputs i,j
• Other ways to penalize large weights, e.g., weight decay
• Using ”tied” or shared set of weights, e.g., by setting all weights to their mean after computing the weight updates
• Many other ways . . .
COMP9417 ML & DM Neural Learning Term 2, 2021 50 / 74
The Multi-layer Perceptron
Expressive Capabilities of ANNs
Boolean functions:
• Every Boolean function can be represented by network with single hidden layer
• but might require exponential (in number of inputs) hidden units Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
Being able to approximate any function is one thing, being able to learn it is another …
COMP9417 ML & DM Neural Learning Term 2, 2021 51 / 74
The Multi-layer Perceptron
How complex should the model be ?
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
John von Neumann
COMP9417 ML & DM Neural Learning Term 2, 2021 52 / 74
The Multi-layer Perceptron
“Goodness of fit” in ANNs
Can neural networks overfit/underfit ?
Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks.
Note difference between training set and off-training set (validation set) error on both tasks !
Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting).
COMP9417 ML & DM Neural Learning Term 2, 2021 53 / 74
The Multi-layer Perceptron
Overfitting in ANNs
0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002
Error versus weight updates (example 1)
0 5000
10000 15000 Number of weight updates
20000
Training set error
Validation set error
COMP9417 ML & DM
Neural Learning
Term 2, 2021
54 / 74
Error
The Multi-layer Perceptron
Underfitting in ANNs
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
0
0 1000
2000 3000 4000 5000 6000 Number of weight updates
Error versus weight updates (example 2)
Training set error
Validation set error
COMP9417 ML & DM
Neural Learning
Term 2, 2021
55 / 74
Error
The Multi-layer Perceptron
“Goodness of fit” in ANNs
Moral of the story:
with such complex networks, need to take care in choosing: • training/validation datasets
• network architecture
• loss function, training rule (optimization)
• regularization method(s) • hyperparameters
COMP9417 ML & DM Neural Learning Term 2, 2021 56 / 74
The Multi-layer Perceptron
Neural networks for classification
Sigmoid unit computes output o(x) = σ(w · x) Output ranges from 0 to 1
Example: binary classification
predict class 1 if o(x) ≥ 0.5 o(x) = predict class 0 otherwise.
Questions:
• what error (loss) function should be used ? • how can we train such a classifier ?
COMP9417 ML & DM Neural Learning Term 2, 2021 57 / 74
The Multi-layer Perceptron
Neural networks for classification
Minimizing square error (as before) does not work so well for classification If we take the output o(x) as the probability of the class of x being 1, the
preferred loss function is the cross-entropy
−td logod +(1−td)log(1−od)
d∈D
where:
td ∈ {0, 1} is the class label for training example d, and od is the output of the sigmoid unit, interpreted as the probability of the class of training example d being 1.
To train sigmoid units for classification using this setup, can use gradient ascent with a similar weight update rule as that used to train neural networks by gradient descent – this will yield the maximum likelihood
solution.
COMP9417 ML & DM Neural Learning Term 2, 2021 58 / 74
The Multi-layer Perceptron
A practical application: Face Recognition
Dataset: 624 images of faces of 20 different people.
• image size 120×128 pixels
• grey-scale, 0-255 pixel value range • different poses
• different expressions
• wearing sunglasses or not
Raw images compressed to 30×32 pixels, each is mean of 4×4 pixels. MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes.
COMP9417 ML & DM Neural Learning Term 2, 2021 59 / 74
The Multi-layer Perceptron
Neural Nets for Face Recognition – Task
left straight right up
Four pose classes: looking left, straight ahead, right or upwards.
Use a 1-of-n encoding: more parameters; can give confidence of prediction. Selected single hidden layer with 3 nodes by experimentation.
COMP9417 ML & DM Neural Learning Term 2, 2021 60 / 74
The Multi-layer Perceptron
Neural Nets for Face Recognition – after 1 epoch
left straight right up
COMP9417 ML & DM Neural Learning
Term 2, 2021 61 / 74
The Multi-layer Perceptron
Neural Nets for Face Recognition – after 100 epochs
left straight right up
COMP9417 ML & DM Neural Learning
Term 2, 2021 62 / 74
The Multi-layer Perceptron
Neural Nets for Face Recognition – Results
Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks.
Leftmost block corresponds to the bias (threshold) weight
Weights from each of 30×32 image pixels into each hidden unit are plotted in position of corresponding image pixel.
Classification accuracy: 90% on test set (default: 25%)
Question: what has the network learned ?
For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html
COMP9417 ML & DM Neural Learning Term 2, 2021 63 / 74
The Multi-layer Perceptron
Deep Learning
Y. Lecun et al. (2015) Nature (521) 436–444.
COMP9417 ML & DM Neural Learning Term 2, 2021 64 / 74
The Multi-layer Perceptron
Deep Learning
Deep learning is a vast area that has exploded in the last 15 years. Beyond scope of this course to cover in detail.
See: “Deep Learning” I. Goodfellow et al. (2017) – there is an online copy freely available.
Course COMP9444 Neural Networks / Deep Learning
COMP9417 ML & DM Neural Learning Term 2, 2021 65 / 74
The Multi-layer Perceptron
Deep Learning
Question: How much of what we have seen carries over to deep networks ? Answer: Most of the basic concepts.
We mention some important issues that differ in deep networks.
COMP9417 ML & DM Neural Learning Term 2, 2021 66 / 74
The Multi-layer Perceptron
Deep Learning: Architectures
Most successful deep networks do not use the fully connected network architecture we outlined above.
Instead, they use more specialised architectures for the application of interest (inductive bias).
Example: Convolutional neural nets (CNNs) have an alternating layer-wise architecture inspired by the brain’s visual cortex. Works well for image processing tasks, but also for applications like text processing.
Example: Long short-term memory (LSTM) networks have recurrent network structure designed to capture long-range dependencies in sequential data, as found, e.g., in natural language (although now often superseded by transformer architectures).
Example: Autoencoders are are kind of unsupervised learning method. They learn a mapping from input examples to the same examples as output via a compressed (lower dimension) hidden layer, or layers.
COMP9417 ML & DM Neural Learning Term 2, 2021 67 / 74
The Multi-layer Perceptron
Deep convolutional networks learn features
H. Lee et al. (2009) ICML.
COMP9417 ML & DM Neural Learning Term 2, 2021 68 / 74
The Multi-layer Perceptron
Deep convolutional networks learn features
H. Lee et al. (2009) ICML.
COMP9417 ML & DM Neural Learning Term 2, 2021 69 / 74
The Multi-layer Perceptron
Deep convolutional networks learn features
M. Zeiler & R. Fergus (2014) ECCV.
COMP9417 ML & DM Neural Learning Term 2, 2021 70 / 74
The Multi-layer Perceptron
Deep Learning: Activation Functions
Problem: in very large networks, sigmoid activation functions can saturate, i.e., can be driven close to 0 or 1 and then the gradient becomes almost 0 – effectively halts updates and hence learning for those units.
Solution: use activation functions that are non-saturating., e.g., “Rectified Linear Unit” or ReLu, defined as f (x) = max(0, x).
Problem: sigmoid activation functions are not zero-centred, which can cause gradients and hence weight updates become “non-smooth”.
Solution: use zero-centred activation function, e.g., tanh, with range [−1, +1]. Note that tanh is essentially a re-scaled sigmoid.
Derivative of a ReLu is simply
∂f 0ifx≤0
COMP9417 ML & DM
∂x = 1 otherwise.
Neural Learning Term 2, 2021 71 / 74
The Multi-layer Perceptron
Deep Learning: Regularization
Deep networks can have millions or billions or parameters. Hard to train, prone to overfit.
What techniques can help ?
Example: dropout
• for each unit u in the network, with probability p, “drop” it, i.e., ignore it and its adjacent edges during training
• this will simplify the network and prevent overfitting
• can take longer to converge
• but will be quicker to update on each epoch
• also forces exploration of different sub-networks formed by removing p of the units on any training run
COMP9417 ML & DM Neural Learning Term 2, 2021 72 / 74
The Multi-layer Perceptron
Back-propagation and computational graphs
Most deep learning models do not rely on manual derivation of training rules as we did, but rely on automatic differentiation based on computational graphs. See, e.g., Strang (2019).
COMP9417 ML & DM Neural Learning Term 2, 2021 73 / 74
Summary
Summary
• ANNs since 1940s; popular in 1980s, 1990s; recently a revival Complex function fitting. Generalise core techniques from machine
learning and statistics based on linear models for
regression and classification.
Learning is typically stochastic gradient descent. Networks are too
complex to fit otherwise.
Many open problems remain. How are these networks actually
learning ? How can they be improved ? What are the limits to neural learning ?
COMP9417 ML & DM Neural Learning Term 2, 2021 74 / 74
References
McCulloch, W. and Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathemetical Biophysics, 5:115–133.
Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley – Cambridge Press.
COMP9417 ML & DM Neural Learning Term 2, 2021 74 / 74