Neural Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Neural Learning Term 2, 2020 1 / 66
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Neural Learning
Term 2, 2020
2 / 66
Aims
Aims
This lecture will enable you to describe and reproduce machine learning approaches to the problem of neural (network) learning. Following it you should be able to:
• describe Perceptrons and how to train them
• relate neural learning to optimization in machine learning
• outline the problem of neural learning
• derive the method of gradient descent for linear models
• describe the problem of non-linear models with neural networks
• outline the method of back-propagation training of a multi-layer perceptron neural network
• describe the application of neural learning for classification
• outline back-propagation in terms of computational graphs
• describe some issues arising when training deep neural networks
COMP9417 ML & DM Neural Learning Term 2, 2020 3 / 66
Perceptrons
What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Neural Learning Term 2, 2020 4 / 66
Perceptrons
What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Neural Learning Term 2, 2020 5 / 66
Perceptrons
What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights: +1 ifw0+w1x1+···+wnxn>0
o(x1, . . . , xn) = −1 otherwise.
COMP9417 ML & DM Neural Learning Term 2, 2020 6 / 66
Perceptrons
What are perceptrons ?
Perceptron
Or in vector notation:
1 ifw·x>0 o(x) = −1 otherwise.
COMP9417 ML & DM
Neural Learning Term 2, 2020 7 / 66
Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
+ +
+
–
x2
x2
+-
-+ (b)
– –
(a)
x1
x1
Represents some useful functions
• What weights represent o(x1,x2) = AND(x1,x2)? • What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Neural Learning
Term 2, 2020
8 / 66
Perceptrons
How to train
Perceptron learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ← wi + ∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 ML & DM Neural Learning Term 2, 2020 9 / 66
Perceptrons
How to train
Perceptron learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have yi =+1andw·xi
• This can be achieved by calculating the new weight vector as
w′ = w+ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set to1). Wethenhavew′·xi =w·xi+ηxi·xi >w·xi asrequired.
• Similarly, if xj is a misclassified negative example, then we have
yj =−1andw·xj >t. Inthiscasewecalculatethenewweight vector as w′ = w−ηxj, and thus w′ ·xj = w·xj −ηxj ·xj < w·xj.
COMP9417 ML & DM Neural Learning Term 2, 2020 10 / 66
Perceptrons
How to train
Perceptron learning
• The two cases can be combined in a single update rule: w′ =w+ηyixi
• Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified
• This is the basis of the perceptron training algorithm for linear classification
• The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified
• If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps.
COMP9417 ML & DM Neural Learning Term 2, 2020 11 / 66
Perceptrons
How to train
Perceptron training algorithm
1 2 3 4 5 6
7 8 9
10 11
Algorithm Perceptron(D, η) // perceptron training for linear classification
Input: labelled training data D in homogeneous coordinates; learning rate η. Output: weight vector w defining classifier yˆ = sign(w · x).
w ←0 // Other initialisations of the weight vector are possible
converged ←false
while converged = false do converged ←true
for i = 1 to |D| do
if yiw · xi ≤ 0 then
w←w + ηyixi
converged←false // We changed w so haven’t converged yet
end end
end
// i.e., yˆi ̸= yi
COMP9417 ML & DM Neural Learning Term 2, 2020 12 / 66
Perceptrons
How to train
Perceptron training – varying learning rate
333 222 111
000
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence.
The starting point in all three cases was the basic linear classifier.
COMP9417 ML & DM Neural Learning Term 2, 2020 13 / 66
Perceptrons
How to train
Perceptron Convergence
Perceptron training will converge (under some mild assumptions) for linearly separable classification problems
A labelled data set is linearly separable if there is a linear decision boundary that separates the classes
COMP9417 ML & DM Neural Learning Term 2, 2020 14 / 66
Perceptrons
How to train
Perceptron Convergence
Assume:
Dataset D = {(x1,y1),...,(xn,yn)}
At least one example in D is labelled +1, and one is labelled -1.
R = maxi ||xi||2
A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ · xi ≥ γ
Perceptron Convergence Theorem (Novikoff, 1962)
The number of mistakes made by the perceptron is at most (R)2. γ
γ is typically referred to as the “margin”.
COMP9417 ML & DM Neural Learning Term 2, 2020 15 / 66
Perceptrons
How to train
Decision Surface of a Perceptron
Unfortunately, as a linear classifier perceptrons are limited in expressive power
So some functions not representable • e.g., not linearly separable
For non-linearly separable data we’ll need something else
However, with a fairly minor modification many perceptrons can be combined together to form one model
• multilayer perceptrons, the classic “neural network”
COMP9417 ML & DM Neural Learning Term 2, 2020 16 / 66
Optimization
Optimization
Studied in many fields such as engineering, science, economics, . . . A general optimization algorithm: 1
1 start with initial point x = x0
2 select a search direction p, usually to decrease f(x)
3 select a step length η
4 sets=ηp
5 setx=x+s
6 go to step 2, unless convergence criteria are met
For example, could minimize a real-valued function f : Rn → R Note: convergence criteria will be problem-specific.
1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP.
COMP9417 ML & DM Neural Learning Term 2, 2020 17 / 66
Optimization
Optimization
Usually, we would like the optimization algorithm to quickly reach an answer that is close to being the right one.
• typically, need to minimize a function • e.g., error or loss
• optimization is known as gradient descent or steepest descent • sometimes, need to maximize a function
• e.g., probability or likelihood
• optimization is known as gradient ascent or steepest ascent
COMP9417 ML & DM Neural Learning Term 2, 2020 18 / 66
Neural Learning
Connectionist Models
Consider humans:
• Neuron switching time ≈ .001 second
• Number of neurons ≈ 1010
• Connections per neuron ≈ 104−5
• Scene recognition time ≈ .1 second
• 100 inference steps doesn’t seem like enough
→ much parallel computation
COMP9417 ML & DM Neural Learning Term 2, 2020 19 / 66
Neural Learning
Connectionist Models
Properties of artificial neural nets (ANN’s):
• Many neuron-like threshold switching units
• Many weighted interconnections among units • Highly parallel, distributed process
• Emphasis on tuning weights automatically
COMP9417 ML & DM Neural Learning Term 2, 2020 20 / 66
Neural Learning
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g., raw sensor input)
• Output can be discrete or real-valued
• Output can be a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:
• Speech recognition (now the standard method)
• Image classification (also now the standard method) • manyothers...
COMP9417 ML & DM Neural Learning Term 2, 2020 21 / 66
Training a Linear Unit by Gradient Descent
Gradient Descent
To understand, consider simpler linear unit, where
o = w0 + w1x1 + · · · + wnxn
Let’s learn wi’s that minimize the squared error E[w] ≡ 1 (td − od)2
2 d∈D Where D is set of training examples
COMP9417 ML & DM Neural Learning Term 2, 2020 22 / 66
Training a Linear Unit by Gradient Descent
Gradient Descent
25 20 15 10
5
0 2
E[w]
COMP9417 ML & DM
Neural Learning
Term 2, 2020
23 / 66
1
-2
w0
-13 2
w1
010
-1
Training a Linear Unit by Gradient Descent
Gradient Descent
Gradient
i.e.,
∆w = −η∇E[w] ∆wi = −η ∂E
∂E ∂E ∂E ∇E[w] ≡ ∂w , ∂w ,··· ∂w
01n
Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want Training rule:
COMP9417 ML & DM
∂wi Neural Learning
Term 2, 2020
24 / 66
Training a Linear Unit by Gradient Descent
Gradient Descent
∂E ∂wi
= ∂ 1(td−od)2 ∂wi2d
= 1 ∂ (td−od)2 2 d ∂wi
= 12(td−od) ∂ (td−od) 2d ∂wi
= (td −od) ∂ (td −w·xd) d ∂wi
∂E ∂wi
= (td − od)(−xi,d) d
COMP9417 ML & DM
Neural Learning
Term 2, 2020
25 / 66
Training a Linear Unit by Gradient Descent
Gradient Descent
Gradient-Descent(training examples,η)
Each training example is a pair ⟨x, t⟩, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05).
Initialize each wi to some small random value
Until the termination condition is met, Do Initialize each ∆wi to zero
For each ⟨x, t⟩ in training examples, Do
Input the instance x to the unit and compute the output o For each linear unit weight wi
∆wi ←∆wi +η(t−o)xi For each linear unit weight wi
wi ← wi + ∆wi
COMP9417 ML & DM Neural Learning Term 2, 2020 26 / 66
Training a Linear Unit by Gradient Descent
Training Perceptron vs. Linear unit
Perceptron training rule guaranteed to succeed if • Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training rule uses gradient descent
• Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not separable by H
COMP9417 ML & DM Neural Learning Term 2, 2020 27 / 66
Training a Linear Unit by Gradient Descent
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent: Do until satisfied
• Compute the gradient ∇ED[w] • w←w−η∇ED[w]
Incremental mode Gradient Descent: Do until satisfied
• For each training example d in D • Compute the gradient ∇Ed[w] • w←w−η∇Ed[w]
COMP9417 ML & DM Neural Learning Term 2, 2020 28 / 66
Training a Linear Unit by Gradient Descent
Incremental (Stochastic) Gradient Descent
Batch:
Incremental:
ED[w] ≡ 1 (td − od)2 2 d∈D
Ed[w] ≡ 1(td − od)2 2
Incremental or Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if η made small enough
Very useful for training large networks, or online learning from data streams Stochastic implies examples should be selected at random
COMP9417 ML & DM Neural Learning Term 2, 2020 29 / 66
The Multi-layer Perceptron
Multilayer Networks of Sigmoid Units
COMP9417 ML & DM Neural Learning Term 2, 2020 30 / 66
The Multi-layer Perceptron
ALVINN drives 70 mph on highways
COMP9417 ML & DM Neural Learning Term 2, 2020 31 / 66
The Multi-layer Perceptron
ALVINN
Sharp Left
Straight Ahead
Sharp Right
30 Output Units
4 Hidden Units
30x32 Sensor Input Retina
COMP9417 ML & DM Neural Learning Term 2, 2020 32 / 66
The Multi-layer Perceptron
A Multilayer Perceptron for Speech Recognition: Model
head hid
who’d hood
... ...
F1
F2
COMP9417 ML & DM
Neural Learning
Term 2, 2020 33 / 66
The Multi-layer Perceptron
A Multilayer Perceptron for Speech Recognition: Decision Boundaries
COMP9417 ML & DM Neural Learning Term 2, 2020 34 / 66
The Multi-layer Perceptron
Sigmoid Unit
COMP9417 ML & DM Neural Learning Term 2, 2020 35 / 66
The Multi-layer Perceptron
Sigmoid Unit
Same as a perceptron except that the step function has been replaced by a smoothed version, a sigmoid function.
Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easier to train, but sigmoids are mathematically convenient.
COMP9417 ML & DM Neural Learning Term 2, 2020 36 / 66
The Multi-layer Perceptron
Sigmoid Unit
Why use the sigmoid function σ(x) ? 1
1+e−x Nice property: dσ(x) = σ(x)(1 − σ(x))
dx
We can derive gradient descent rules to train
• One sigmoid unit
• Multilayer networks of sigmoid units → Backpropagation
over a set of training examples D.
COMP9417 ML & DM Neural Learning
Start by assuming we want to minimize squared error 1
2 d∈D
(td − od)2
Term 2, 2020 37 / 66
The Multi-layer Perceptron
Error Gradient for a Sigmoid Unit
∂E ∂wi
= ∂ 1(td−od)2 ∂wi 2 d∈D
= 1 ∂ (td−od)2 2 d ∂wi
= 12(td−od) ∂ (td−od) 2d ∂wi
∂od = (td−od) −∂wi
d
= − (td − od) d
∂od ∂netd ∂netd ∂wi
COMP9417 ML & DM
Neural Learning
Term 2, 2020
38 / 66
The Multi-layer Perceptron
Error Gradient for a Sigmoid Unit
But we know:
∂od = ∂σ(netd) = od(1 − od) ∂ netd ∂ netd
∂netd =∂(w·xd)=xi,d ∂wi ∂wi
So:
∂E ∂wi
=
− (td − od)od(1 − od)xi,d d∈D
COMP9417 ML & DM
Neural Learning
Term 2, 2020
39 / 66
The Multi-layer Perceptron
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do
For each training example, Do
Input the training example to the network and compute the network outputs
For each output unit k
δk ← ok(1 − ok)(tk − ok)
For each hidden unit h
δh ← oh(1 − oh) k∈outputs wkhδk
Update each network weight wji wji ← wji + ∆wji
where
∆wji = ηδjxji
COMP9417 ML & DM Neural Learning Term 2, 2020 40 / 66
The Multi-layer Perceptron
More on Backpropagation
A solution for learning highly complex models . . .
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Can learn probabilistic models by maximising likelihood
Minimizes error over all training examples
• Training can take thousands of iterations → slow! • Using network after training is very fast
COMP9417 ML & DM Neural Learning Term 2, 2020 41 / 66
The Multi-layer Perceptron
More on Backpropagation
Will converge to a local, not necessarily global, error minimum • May be many such local minima
• In practice, often works well (can run multiple times)
• Often include weight momentum α
∆wji(n) = ηδjxji + α∆wji(n − 1) • Stochastic gradient descent using “mini-batches”
Nature of convergence
• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions possible as training progresses
COMP9417 ML & DM Neural Learning Term 2, 2020 42 / 66
The Multi-layer Perceptron
More on Backpropagation
Models can be very complex
• Will network generalize well to subsequent examples?
• may underfit by stopping too soon • may overfit ...
Many ways to regularize network, making it less likely to overfit
• Add term to error that increases with magnitude of weight vector
E(w)≡1 (t −o )2+γw2 2 kd kd ji
d∈D k∈outputs i,j
• Other ways to penalize large weights, e.g., weight decay
• Using ”tied” or shared set of weights, e.g., by setting all weights to their mean after computing the weight updates
• Many other ways . . .
COMP9417 ML & DM Neural Learning Term 2, 2020 43 / 66
The Multi-layer Perceptron
Expressive Capabilities of ANNs
Boolean functions:
• Every Boolean function can be represented by network with single hidden layer
• but might require exponential (in number of inputs) hidden units Continuous functions:
• Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
Being able to approximate any function is one thing, being able to learn it is another ...
COMP9417 ML & DM Neural Learning Term 2, 2020 44 / 66
The Multi-layer Perceptron
How complex should the model be ?
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
John von Neumann
COMP9417 ML & DM Neural Learning Term 2, 2020 45 / 66
The Multi-layer Perceptron
“Goodness of fit” in ANNs
Can neural networks overfit/underfit ?
Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks.
Note difference between training set and off-training set (validation set) error on both tasks !
Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting).
COMP9417 ML & DM Neural Learning Term 2, 2020 46 / 66
The Multi-layer Perceptron
Overfitting in ANNs
0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002
Error versus weight updates (example 1) Training set error
Validation set error
0 5000
10000 15000 Number of weight updates
20000
COMP9417 ML & DM
Neural Learning
Term 2, 2020
47 / 66
Error
The Multi-layer Perceptron
Underfitting in ANNs
0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01
0
Error versus weight updates (example 2) Training set error
Validation set error
0 1000
2000 3000 4000 5000 6000 Number of weight updates
COMP9417 ML & DM
Neural Learning
Term 2, 2020
48 / 66
Error
The Multi-layer Perceptron
Neural networks for classification
Sigmoid unit computes output o(x) = σ(w · x) Output ranges from 0 to 1
Example: binary classification
predict class 1 if o(x) ≥ 0.5 o(x) = predict class 0 otherwise.
Questions:
• what error (loss) function should be used ? • how can we train such a classifier ?
COMP9417 ML & DM Neural Learning Term 2, 2020 49 / 66
The Multi-layer Perceptron
Neural networks for classification
Minimizing square error (as before) does not work so well for classification If we take the output o(x) as the probability of the class of x being 1, the
preferred loss function is the cross-entropy
−td logod +(1−td)log(1−od)
d∈D
where:
td ∈ {0, 1} is the class label for training example d, and od is the output of the sigmoid unit, interpreted as the probability of the class of training example d being 1.
To train sigmoid units for classification using this setup, can use gradient ascent with a similar weight update rule as that used to train neural networks by gradient descent – this will yield the maximum likelihood
solution.
COMP9417 ML & DM Neural Learning Term 2, 2020 50 / 66
The Multi-layer Perceptron
A practical application: Face Recognition
Dataset: 624 images of faces of 20 different people.
• image size 120x128 pixels
• grey-scale, 0-255 pixel value range • different poses
• different expressions
• wearing sunglasses or not
Raw images compressed to 30x32 pixels, each is mean of 4x4 pixels. MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes.
COMP9417 ML & DM Neural Learning Term 2, 2020 51 / 66
The Multi-layer Perceptron
Neural Nets for Face Recognition - Task
left straight right up
Four pose classes: looking left, straight ahead, right or upwards.
Use a 1-of-n encoding: more parameters; can give confidence of prediction. Selected single hidden layer with 3 nodes by experimentation.
COMP9417 ML & DM Neural Learning Term 2, 2020 52 / 66
The Multi-layer Perceptron
Neural Nets for Face Recognition - after 1 epoch
left straight right up
COMP9417 ML & DM Neural Learning
Term 2, 2020 53 / 66
The Multi-layer Perceptron
Neural Nets for Face Recognition - after 100 epochs
left straight right up
COMP9417 ML & DM Neural Learning
Term 2, 2020 54 / 66
The Multi-layer Perceptron
Neural Nets for Face Recognition - Results
Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks.
Leftmost block corresponds to the bias (threshold) weight
Weights from each of 30x32 image pixels into each hidden unit are plotted in position of corresponding image pixel.
Classification accuracy: 90% on test set (default: 25%)
Question: what has the network learned ?
For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html
COMP9417 ML & DM Neural Learning Term 2, 2020 55 / 66
The Multi-layer Perceptron
Deep Learning
Y. Lecun et al. (2015) Nature (521) 436–444.
COMP9417 ML & DM Neural Learning Term 2, 2020 56 / 66
The Multi-layer Perceptron
Deep Learning
Deep learning is a vast area that has exploded in the last 15 years. Beyond scope of this course to cover in detail.
See: “Deep Learning” I. Goodfellow et al. (2017) – there is an online copy freely available.
Course COMP9444 Neural Networks (next semester).
COMP9417 ML & DM Neural Learning Term 2, 2020 57 / 66
The Multi-layer Perceptron
Deep Learning
Question: How much of what we have seen carries over to deep networks ? Answer: Most of the basic concepts.
We mention some important issues that differ in deep networks.
COMP9417 ML & DM Neural Learning Term 2, 2020 58 / 66
The Multi-layer Perceptron
Deep Learning: Architectures
Most successful deep networks do not use the fully connected network architecture we outlined above.
Instead, they use more specialised architectures for the application of interest (inductive bias).
Example: Convolutional neural nets (CNNs) have an alternating layer-wise architecture inspired by the brain’s visual cortex. Works well for image processing tasks, but also for applications like text processing.
Example: Long short-term memory (LSTM) networks have recurrent network structure designed to capture long-range dependencies in sequential data, as found, e.g., in natural language (although now often superseded by transformer architectures).
Example: Autoencoders are are kind of unsupervised learning method. They learn a mapping from input examples to the same examples as output via a compressed (lower dimension) hidden layer, or layers.
COMP9417 ML & DM Neural Learning Term 2, 2020 59 / 66
The Multi-layer Perceptron
Deep convolutional networks learn features
H. Lee et al. (2009) ICML.
COMP9417 ML & DM Neural Learning Term 2, 2020 60 / 66
The Multi-layer Perceptron
Deep convolutional networks learn features
H. Lee et al. (2009) ICML.
COMP9417 ML & DM Neural Learning Term 2, 2020 61 / 66
The Multi-layer Perceptron
Deep convolutional networks learn features
M. Zeiler & R. Fergus (2014) ECCV.
COMP9417 ML & DM Neural Learning Term 2, 2020 62 / 66
The Multi-layer Perceptron
Deep Learning: Activation Functions
Problem: in very large networks, sigmoid activation functions can saturate, i.e., can be driven close to 0 or 1 and then the gradient becomes almost 0 – effectively halts updates and hence learning for those units.
Solution: use activation functions that are non-saturating., e.g., “Rectified Linear Unit” or ReLu, defined as f (x) = max(0, x).
Problem: sigmoid activation functions are not zero-centred, which can cause gradients and hence weight updates become “non-smooth”.
Solution: use zero-centred activation function, e.g., tanh, with range [−1, +1]. Note that tanh is essentially a re-scaled sigmoid.
Derivative of a ReLu is simply
∂f 0ifx≤0
COMP9417 ML & DM
∂x = 1 otherwise.
Neural Learning Term 2, 2020 63 / 66
The Multi-layer Perceptron
Deep Learning: Regularization
Deep networks can have millions or billions or parameters. Hard to train, prone to overfit.
What techniques can help ?
Example: dropout
• for each unit u in the network, with probability p, “drop” it, i.e., ignore it and its adjacent edges during training
• this will simplify the network and prevent overfitting
• can take longer to converge
• but will be quicker to update on each epoch
• also forces exploration of different sub-networks formed by removing p of the units on any training run
COMP9417 ML & DM Neural Learning Term 2, 2020 64 / 66
The Multi-layer Perceptron
Back-propagation and computational graphs
See the accompanying hand-written notes, based on Strang (2019).
COMP9417 ML & DM Neural Learning Term 2, 2020 65 / 66
Summary
Summary
• ANNs since 1940s; popular in 1980s, 1990s; recently a revival Complex function fitting. Generalise core techniques from machine
learning and statistics based on linear models for
regression and classification.
Learning is typically stochastic gradient descent. Networks are too
complex to fit otherwise.
Many open problems remain. How are these networks actually
learning ? How can they be improved ? What are the limits to neural learning ?
COMP9417 ML & DM Neural Learning Term 2, 2020 66 / 66
References
Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley - Cambridge Press.
COMP9417 ML & DM Neural Learning Term 2, 2020 66 / 66