Neural Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain
Aims
This lecture will develop your understanding of Neural Network Learning & will extend that to Deep Learning
– describe Perceptrons and how to train them
– relate neural learning to optimization in machine learning
– outline the problem of neural learning
– derive the Gradient Descent for linear models
– describe the problem of non-linear models with neural networks – outline the method of back-propagation training of a multi-layer – understand Convolutional Neural Networks (CNNs)
– understand the main difference between CNN and regular NN – know the basics of training CNN
COMP9417 T1, 2021 1
Artificial Neural Networks
Artificial Neural Networks are inspired by human nervous system
NNs are composed of a large number of interconnected processing elements known as neurons
They use supervised error correcting rules with back-propagation to learn a specific task
http://statsmaths.github.io/stat665/lectures/lec12/lecture12.pdf
COMP9417 T1, 2021 2
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron – a simplified neuron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 T1, 2021 3
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 T1, 2021 4
Perceptron
Each neuron has multiple dendrites and a single axon. The neuron receives its inputs from its dendrites and transmits its output through its axon. Both inputs and outputs take the form of electrical impulses. The neuron sums up its inputs, and if the total electrical impulse strength exceeds the neuron’s firing threshold, the neuron fires off a new impulse along its single axon. The axon, in turn, distributes the signal along its branching synapses which collectively reach thousands of neighboring neurons.
https://towardsdatascience.com/from-fiction-to-reality-a-beginners-guide-to-artificial-neural-networks-d0411777571b
COMP9417 T1, 2021 5
Perceptron
Output o is thresholded sum of products of inputs and their weights:
COMP9417 T1, 2021 6
Perceptron
Or in vector notation:
COMP9417 T1, 2021 7
Decision Surface of a Perceptron
• Perceptron is able to represent some useful functions which are linearly separable (a)
• But functions that are not linearly separable are not representable (e.g. (b) XOR)
COMP9417 T1, 2021 8
Perceptron Learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ←wi+∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 T1, 2021 9
Perceptron Learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
§ Let xi be a misclassified positive example, then we have yi=+1andw·xi<0.Wethereforewanttofindw'suchthat w'·xi> w·xi, which moves the decision boundary towards and hopefully past xi.
§ This can be achieved by calculating the new weight vector as
w’ = w + ηxi, where 0< η ≤ 1 is the learning rate (again, assume set to 1). We thenhavew'·xi = w·xi + ηxi·xi > w·xi asrequired.
§ Similarly, if xj is a misclassified negative example, then we have yj=−1andw·xj>0.Inthiscasewecalculatethenewweight vectoras w’ = w − ηxj , and thus w’· xj = w· xj − ηxj· xj < w· xj .
COMP9417 T1, 2021 10
Perceptron Learning
§ The two cases can be combined in a single update rule: w' = w + η yi xi
§ Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified
§ This is the basis of the perceptron training algorithm for linear classification
§ The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified
§ If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps.
COMP9417 T1, 2021 11
Training Perceptron
COMP9417 T1, 2021 12
Perceptron Learning Rate
(left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update.
(middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence.
(right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence.
COMP9417 T1, 2021 13
Perceptron Convergence
Perceptron training will converge (under some mild assumptions) for linearly separable classification problems
A labelled data set is linearly separable if there is a linear decision boundary that separates the classes
COMP9417 T1, 2021 14
Perceptron Convergence
Dataset D = {(x1, y1), . . . , (xn, yn)}
At least one example in D is labelled +1, and one is labelled -1. A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ ·xi ≥ γ
γ is typically referred to as the “margin”
COMP9417 T1, 2021 15
Perceptron Convergence Perceptron Convergence Theorem (Novikoff, 1962)
R=maxi ║xi║!
The number of mistakes made by the perceptron is at most (Rγ )2
COMP9417 T1, 2021 16
Decision Surface of a Perceptron
§ Unfortunately, as a linear classifier perceptrons are limited in expressive power
§ So some functions not representable, e.g., not linearly separable
§ For non-linearly separable data we’ll need something else
§ However, with a relatively minor modification many perceptrons can be combined together to form one model
§ multilayer perceptrons, the classic “neural network”
COMP9417 T1, 2021 17
Optimisation
Studied in many fields such as engineering, science, economics, . . . A general optimisation algorithm: 1
1) start with initial point x = x0
2) select a search direction p, usually to decrease f (x)
3) select a step length η
4) sets=ηp
5) setx=x+s
6) go to step 2, unless convergence criteria are met
For example, could minimise a real-valued function f Note: convergence criteria will be problem-specific.
1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP.
COMP9417 T1, 2021 18
Optimisation
Usually, we would like the optimisation algorithm to quickly reach an answer that is close to being the right one.
§ typically, we need to minimise a function
§ e.g., error or loss
§ optimisation is known as gradient descent or steepest descent
§ sometimes, we need to maximise a function
§ e.g., probability or likelihood
§ optimisation is known as gradient ascent or steepest ascent
COMP9417 T1, 2021 19
Gradient Descent
To understand, consider the simple linear unit, where
Let’s learn 𝑤𝑖 that minimise the squared error
Where D is the set of training samples
COMP9417 T1, 2021 20
Gradient Descent
COMP9417 T1, 2021 21
Gradient Descent
Gradient:
Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want
Training rule: i.e.,
COMP9417 T1, 2021 22
Gradient Descent
COMP9417 T1, 2021 23
Gradient Descent
COMP9417 T1, 2021 24
Perceptron vs. Linear Unit
Perceptron training rule guaranteed to succeed if § Training examples are linearly separable § Sufficiently small learning rate η
Linear unit training rule uses gradient descent
§ Guaranteed to converge to hypothesis with minimum squared error
§ Given sufficiently small learning rate η
§ Even when training data contains noise
§ Even when training data not separable by H
COMP9417 T1, 2021 25
Stochastic (Incremental) Gradient Descent Batch mode Gradient Descent:
Stochastic (incremental) mode Gradient Descent:
COMP9417 T1, 2021 26
Stochastic Gradient Descent
§ Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if 𝜂 made small enough
§ Very useful for training large networks, or online learning from data streams
§ Stochastic implies examples should be selected at random
COMP9417 T1, 2021 27
Multilayer Networks
COMP9417 T1, 2021 28
Multilayer Networks
§ Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult
§ A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next (with activation feeding forward)
§ The weights determine the function computed.
§ Given an arbitrary number of hidden units, any boolean function can be
computed with a single hidden layer
COMP9417 T1, 2021 29
General Structure of ANN
COMP9417 T1, 2021 30
General Structure of ANN
Properties of Artificial Neural Networks (ANN’s): § Many neuron-like threshold switching units
§ Many weighted interconnections among units § Highly parallel, distributed process
§ Emphasis on tuning weights automatically
Artificial Neural Network (Source: VIASAT)
COMP9417 T1, 2021 31
When to Consider Neural Networks
§ Input is high-dimensional discrete or real-valued (e.g., raw sensor input)
§ Output can be discrete or real-valued
§ Output can be a vector of values
§ Possibly noisy data
§ Form of target function is unknown
§ Interpretability of result is not important
Examples:
§ Speech recognition § Image classification § many others . . .
COMP9417 T1, 2021 32
ALVINN drives 70 mph on highways
COMP9417 T1, 2021 33
ALVINN
COMP9417 T1, 2021 34
MLP Speech Recognition
Decision Boundaries
COMP9417 T1, 2021 35
Sigmoid Unit
COMP9417 T1, 2021 36
Sigmoid Unit
§ Same as a perceptron except that the step function has been replaced by a nonlinear sigmoid function.
§ Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output.
COMP9417 T1, 2021 37
Sigmoid Unit
COMP9417 T1, 2021 38
Sigmoid Unit
Why use the sigmoid function σ (x) ?
Nice property:
We can derive gradient descent rules to train
•One sigmoid unit
•Multilayer networks of sigmoid units → Backpropagation
Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easiertotrain, butSigmoidsaremathematicallyconvenient.
COMP9417 T1, 2021 39
Error Gradient of Sigmoid Unit
Start by assuming we want to minimise squared error ( ) over a set of training examples D.
COMP9417 T1, 2021 40
Error Gradient of Sigmoid Unit
We know:
So:
COMP9417 T1, 2021 41
Backpropagation Algorithm
COMP9417 T1, 2021 42
More on Backpropagation
A solution for learning highly complex models . . .
§ Gradient descent over entire network weight vector
§ Easily generalised to arbitrary directed graphs
§ Can learn probabilistic models by maximising likelihood
Minimises error over all training examples
§ Training can take thousands of iterations → slow! § Using network after training is very fast
COMP9417 T1, 2021 43
More on Backpropagation
Will converge to a local, not necessarily global, error minimum § Might exist many such local minima
§ In practice, often works well (can run multiple times)
§ Often include weight momentum α
∆𝑤𝑗𝑖(𝑛) = 𝜂𝛿𝑗𝑥𝑗𝑖 + 𝛼 ∆𝑤𝑗𝑖(𝑛 − 1)
§ Stochastic gradient descent using “mini-batches” Nature of convergence
§ Initialise weights near zero
§ Therefore, initial networks near-linear
§ Increasingly non-linear functions become possible as training progresses
COMP9417 T1, 2021 44
More on Backpropagation
Models can be very complex
§ Will network generalise well to subsequent examples?
• may underfit by stopping too soon
• may overfit . . .
Many ways to regularise network, making it less likely to overfit
§ Add term to error that increases with magnitude of weight vector
§ Other ways to penalise large weights, e.g., weight decay
§ Using "tied" or shared set of weights, e.g., by setting all weights to
their mean after computing the weight updates
§ Many other ways . . .
COMP9417 T1, 2021 45
Expressive Capabilities of ANNs
Boolean functions:
§ Every Boolean function can be represented by a network with single
hidden layer
§ but might require exponential (in number of inputs) hidden units
Continuous functions:
§ Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
§ Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
Being able to approximate any function is one thing, being able to learn it is another . . .
COMP9417 T1, 2021 46
How complex should the model be?
“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
John von Neumann
COMP9417 T1, 2021 47
“Goodness of fit” in ANNs
Can neural networks overfit/underfit ?
Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks.
Note the difference between training set and off-training set (validation set) error on both tasks !
Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting).
COMP9417 T1, 2021 48
Overfitting in ANNs
COMP9417 T1, 2021 49
Underfitting in ANNs
COMP9417 T1, 2021 50
Neural Networks for Classification Sigmoid unit computes output o(x) = σ (w·x)
Output ranges from 0 to 1 Example: binary classification
Questions:
§ what error (loss) function should be used ? § how can we train such a classifier ?
COMP9417 T1, 2021 51
Neural Networks for Classification
Minimizing square error (as before) does not work so well for classification
If we take the output o(x) as the probability of the class of x being 1, the preferred
loss function is the cross-entropy
where:
td∈{0,1}istheclasslabelfortrainingexampled,andod istheoutputofthe sigmoid unit, interpreted as the probability of the class of training example d being 1.
To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm – this will yield the maximum likelihood solution.
COMP9417 T1, 2021 52
Application: Face Pose Recognition
Dataset: 624 images of faces of 20 different people § image size 120x128 pixels
§ grey-scale, 0-255 intensity value range § different poses
§ different expressions
§ wearing sunglasses or not
Raw images compressed to 30x32 pixels
MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes
COMP9417 T1, 2021 53
Application: Face Pose Recognition
left straight right up
Four pose classes: looking left, straight ahead, right or upwards
Use a 1-of-n encoding: more parameters; can give confidence of prediction Selected single hidden layer with 3 nodes by experimentation
COMP9417 T1, 2021 54
Application: Face Pose Recognition After 1 epoch
left straight right up
COMP9417 T1, 2021 55
Application: Face Pose Recognition After 100 epochs
left straight right up
COMP9417 T1, 2021 56
Application: Face Recognition
Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks.
Leftmost block corresponds to the bias (threshold) weight
Weightsfromeachof30x32imagepixelsintoeachhiddenunitareplotted in position of corresponding image pixel.
Classification accuracy: 90% on test set (default: 25%) Question: what has the network learned ?
For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html
COMP9417 T1, 2021 57
Deep Learning: Convolutional Neural Networks
COMP9417 T1, 2021 58
A Bit of History
§ Earliest studies about visual mechanics of animals emphasised the importance of edge detection for solving Computer Vision problems
§ Early processing in cat visual cortex looks like it is performing convolutions that are looking for oriented edges and blobs
§ Certain cells are looking for edges with a particular orientation at a particular spatial location in the visual field
§ This inspired convolutional neural networks but was limited by the lack of computational power
§ Hinton et al. reinvigorated research into deep learning and proposed a greedy layer wise training technique
§ In 2012, Alex Krizhevesky et al. won ImageNet challenge and proposed the well recognised AlexNet Convolutional Neural Network
COMP9417 T1, 2021 59
A Bit of History
• LeNet-5
• The very first CNN
• Handwritten digit recognition
COMP9417 T1, 2021 60
A Bit of History
• LeNet-5
• The very first CNN
• Handwritten digit recognition
COMP9417 T1, 2021 61
A Bit of History
§ AlexNet (2012)
§ ImageNet classification
§ Images showing 1000 object categories
COMP9417 T1, 2021 62
A Bit of History
§ AlexNet (2012)
§ ImageNet classification
§ Images showing 1000 object categories
COMP9417 T1, 2021 63
Deep Learning
§ Deep learning is a collection of artificial neural network techniques that are widely used at present
§ Predominantly, deep learning techniques rely on large amounts of data and deeper learning architectures
§ Some well-known paradigms:
§ Convolutional Neural Networks (CNNs) § Recurrent Neural Networks
§ Auto-encoders
§ Restricted Boltzmann Machines
COMP9417 T1, 2021 64
CNNs
§ CNNs are very similar to regular Neural Networks § Made up of neurons with learnable weights
§ CNN architecture assumes that inputs are images § So that we have local features
§ Which allows us to
§ encode certain properties in the architecture that makes the forward pass more efficient and
§ significantly reduces the number of parameters needed for the network
COMP9417 T1, 2021 65
Why CNNs?
The problem with regular NNs is that they do not scale well with dimensions (i.e. larger images)
§ Eg: 32x32 image with 3 channels (RGB) – a neuron in first hidden layer would have 32x32x3 = 3,072 weights : manageable.
§ Eg: 200x200 image with 3 channels – a neuron in first hidden layer would have 200x200x3 = 120,000 weights and we need at least several of these neurons which makes the weights explode.
COMP9417 T1, 2021 66
What is different?
§ In contrast, CNNs consider 3-D volumes of neurons and propose a parameter sharing scheme that minimises the number of parameters required by the network.
§ CNN neurons are arranged in 3 dimensions: Width, Height and Depth.
§ Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected)
COMP9417 T1, 2021 67
What is different?
NN
CNN
COMP9417 T1, 2021 68
CNN architecture
Main layers:
§ Convolutional
§ Pulling
§ ReLU
§ Fully-connected § Drop-out
§ Output layers
COMP9417 T1, 2021 69
Convolutional Layer
Suppose we want to classify an image as a bird, sunset, dog, cat, etc.
If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image.
We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image.
COMP9417 T1, 2021 70
Convolutional Layer
COMP9417 T1, 2021 71
Convolutional Layer
Convolution
COMP9417 T1, 2021 72
Convolutional Layer
COMP9417 T1, 2021 73
Convolutional Layer
Original link: https://i.stack.imgur.com/nOLCe.gif
COMP9417 T1, 2021 74
Convolutional Layer
§ The output of the Conv layer can be interpreted as holding neurons arranged in a 3D volume.
§ The Conv layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
§ During the forward pass, each filter is slid (convolved) across the width and height of the input volume, producing a 2-dimensional activation map of that filter.
§ Network will learn filters (via backpropagation) that activate when they see some specific type of feature at some spatial position in the input.
COMP9417 T1, 2021 75
Convolutional Layer
§ Stacking these activation maps for all filters along the depth dimension forms the full output volume
§ E.g., with 6 filters, we get 6 activation maps
COMP9417 T1, 2021 76
Convolutional Layer
§ Three hyperparameters control the size of the output volume: the depth, stride and zero-padding
§ Depth controls the number of neurons in the Conv layer that connect to the same region of the input volume
COMP9417 T1, 2021 77
Convolutional Layer
§ Three hyperparameters control the size of the output volume: the depth, stride and zero-padding
§ Stride is the distance that the filter is moved by in spatial dimensions
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html
COMP9417 T1, 2021 78
Convolutional Layer
§ Three hyperparameters control the size of the output volume: the depth, stride and zero-padding
§ Zero-padding is padding of the input with zeros spatially on the border of the input volume
https://deeplizard.com/learn/video/qSTv_m-KFk0
COMP9417 T1, 2021 79
Convolutional Layer
§ We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border:
(W−F+2P)/S+1
§ If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way.
COMP9417 T1, 2021 80
Example: AlexNet
For example, in the first convolutional layer of AlexNet, W = 227, F = 11, P = 0, S = 4.
The width of the output is
(W−F+2P)/S+1 =(227−11+0)/4+1=55
There are 96 filters in this layer, the output volume of this layer is thus 55 × 55 × 96
COMP9417 T1, 2021 81
Example: AlexNet
COMP9417 T1, 2021 82
Convolutional Layer
§ Main property – local connectivity:
§ Each neuron only connects to a local region of the input volume.
§ The spatial extent of this connectivity is a hyperparameter called receptive field of the neuron.
§ The extent of the connectivity along the depth axis is always equal to the depth of the input volume.
COMP9417 T1, 2021 83
Convolutional Layer
§ Examples:
§ Eg1: Suppose that the input volume has size [32x32x3]. If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights. Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.
§ Eg2: Suppose an input volume had size [16x16x20], i.e. . Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
COMP9417 T1, 2021 84
Convolutional Layer
§ Main property – parameter sharing:
§ Parameter sharing scheme used in Convolutional Layers to control the
number of parameters
§ In other words, denoting a single 2-D slice as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias
§ This is exactly what we do with spatial filters for signals/images!
COMP9417 T1, 2021 85
Convolutional Layer
§ Main property – parameter sharing: § Motivation of parameter sharing
§ If one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2).
COMP9417 T1, 2021 86
Convolutional Layer
§ Example:
§ In AlexNet, without parameter sharing, there are 55*55*96 = 290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias.
§ Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high.
§ With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique sets of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases).
§ Alternatively, it can be viewed as all 55*55 neurons in each depth slice will now be using the same parameters.
COMP9417 T1, 2021 87
Example: AlexNet
For example, in the first convolutional layer of AlexNet, W = 227, F = 11, P = 0, S = 4.
The width of the output is
(W−F+2P)/S+1 =(227−11+0)/4+1=55
There are 96 filters in this layer. Compute the number of:
weights per neuron? neurons? connections?
independent parameters?
COMP9417 T1, 2021 88
Example: AlexNet
For example, in the first convolutional layer of AlexNet,
W = 227, F = 11, P = 0, S = 4. The width of the output is
(W−F+2P)/S+1 =(227−11+0)/4+1=55 There are 96 filters in this layer. Compute the number of:
weights per neuron? neurons? connections?
independent parameters?
1+11×11×3 55 × 55 × 96
55×55×96×364 96 × 364
= 364 = 290, 400
= 105,705,600 = 34,944
COMP9417 T1, 2021
89
Pooling Layer
§ The function of pooling layer
§ to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and
§ hence to also control overfitting § The Pooling Layer operates
§ independently on every depth slice of the input and resizes it spatially, typically using the MAX operation (i.e.: max pooling)
§ The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2, which downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations
COMP9417 T1, 2021 90
Pooling Layer
COMP9417 T1, 2021 91
Pooling Layer
Max pooling
COMP9417 T1, 2021 92
Pooling Layer
§ If the previous layer is J × K, and max pooling is applied with width F and stride S, the size of the output will be
(1 + (𝐽 −𝐹)/𝑠)×(1 + (𝐾 −𝐹)/𝑠)
§ If max pooling with width 3 and stride 2 is applied to the feature map of size 55 ×55 in the first convolutional layer of AlexNet, what is the output size after pooling?
Answer: 1+(55−3)/2=27.
§ How many independent parameters does this add to the model? Answer: None! (no weights to be learned, just computing max)
COMP9417 T1, 2021 93
ReLU Layer
§ Although ReLU (Rectified Linear Unit) is considered as a layer, it is
really an activation function:
§ f(x) = max(0, x)
§ This is favoured in deep learning as opposed to the traditional
activation functions like Sigmoid or Tanh
§ To accelerate the convergence of stochastic gradient descent § Be computationally inexpensive compared to traditional ones
§ However, ReLu units can be fragile during training and ‘die’. Leaky ReLUs were proposed to handle this problem.
COMP9417 T1, 2021 94
Fully-connected Layer
§ Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset.
COMP9417 T1, 2021 95
Dropout Layer
§ Problem with overfitting – model performs well on training data but generalises poorly to testing data
§ Dropout is a simple and effective method to reduce overfitting
§ In each forward pass, randomly set some neurons to zero
§ Probability of dropping is a hyperparameter, such as 0.5
COMP9417 T1, 2021 96
Dropout Layer
§ Makes the training process noisy
§ Forcing nodes within a layer to probabilistically take on more or less
responsibility for the inputs
§ Prevents co-adaptation of features and simulates a sparse activation
§ Analogous to training a large ensemble of models but with much higher efficiency
COMP9417 T1, 2021 97
Dropout Layer
• During test time, direct application would make the output random
• A simple approach: multiply the activation by dropout probability (e.g. 0.5)
COMP9417 T1, 2021 98
Output Layer
§ The output layer produces the probability of each class given the input image
§ This is the last layer containing the same number of neurons as the number of classes in the dataset
§ The output of this layer passes through a Softmax activation function to normalize the outputs to a sum of one:
COMP9417 T1, 2021 99
Loss Function
§ A loss function is used to compute the model’s prediction accuracy from the outputs
§ Most commonly used: categorical cross-entropy loss function
§ The training objective is to minimise this loss
§ The loss guides the backpropagation process to train the CNN
model
§ Stochastic gradient descent and the Adam optimiser are commonly used algorithms for optimisation
COMP9417 T1, 2021
100
Training
• Backpropagation in general:
1. Initialise the network.
2. Input the first observation.
3. Forward-propagation. From left to right the neurons are activated and the output value is produced.
4. Calculate the error in the outputs (loss function).
5. From right to left the generated error is back-propagated and accumulate the weight updates (partial derivatives).
6. Repeat steps 2-5 and adjust the weights after a batch of observations.
7. When the whole training set passes through the network, that makes an epoch. Redo more epochs.
https://www.superdatascience.com/blogs/artificial-neural-networks-backpropagation
COMP9417 T1, 2021
101
Pre-processing
Pre-processing: image scaling, zero mean, and normalisation
COMP9417 T1, 2021
102
Data Augmentation
§ Essential for increasing the dataset size and avoiding over-fitting
§ More data augmentation often leads to better performance but also
longer training time
§ Commonly used techniques include:
§ Horizontal / vertical flipping
§ Random cropping and scaling § Rotation
§ Gaussian filtering
§ During testing, average the results from multiple augmented input images
COMP9417 T1, 2021
103
Data Augmentation
Need evaluation => not all techniques are useful
https://blog.insightdatascience.com/automl-for-data-augmentation-e87cf692c366
COMP9417 T1, 2021
104
Initialisation
§ Weight initialisation
§ Cannot be all 0’s => Need to ensure diversity in the filter weights
§ Use small random numbers => might aggravate the diminishing gradients problem
§ With calibration
§ Sparse initialisation
§ More advanced techniques
§ Use ImageNet pretrained models => not always possible
COMP9417 T1, 2021
105
Balancing Data
§ Balanced training data
§ Important to have similar numbers of training images for different
classes, so the optimisation would not be biased by one class
§ Use random sampling to achieve this effect during each epoch of training
§ Assign different weights in the loss function
COMP9417 T1, 2021
106
Testing
§ Forward passes of the network throughout the layers give the prediction output of the input data
COMP9417 T1, 2021
107
Transfer Learning
§ CNN models trained on ImageNet can be applied to other types of images
§ It is possible to finetune only the last FC layers to better fit the model to the specific set of images
§ Especially useful for small datasets
COMP9417 T1, 2021
108
VGGNet
COMP9417 T1, 2021
109
VGGNet
COMP9417 T1, 2021
110
Well-known Models
§ Object Recognition § AlexNet (2012) § GoogLeNet
§ VGGNet
§ ResNet
§ Inception v3/v4
§ DenseNets is the current state-of-the-art
§ Semantic Segmentation
§ Multi-scale CNN (2012) § FCN
§ U-net / V-net
§ U-net / V-net with skip/dense connections § Many other variations
COMP9417 T1, 2021
111
Summary
Artificial Neural Networks
Complex function fitting. Generalise core techniques from machine learning and statistics based on linear models for regression and classification.
Learning is typically stochastic gradient descent. Networks are too complex to fit otherwise.
COMP9417 T1, 2021 112
Summary
• A brief introduction of CNN – the most commonly used model of deep learning
• Widely used in computer vision studies
• In-depth knowledge in COMP9444
COMP9417 T1, 2021 113
Acknowledgement
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge UniversityPress(2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
Slides from CISC 4631/6930 Data Mining https://slideplayer.com/slide/13508539/
COMP9417 T1, 2021 114