CS计算机代考程序代写 Bayesian scheme data mining algorithm deep learning Neural Learning

Neural Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain

Aims
This lecture will develop your understanding of Neural Network Learning & will extend that to Deep Learning
– describe Perceptrons and how to train them
– relate neural learning to optimization in machine learning
– outline the problem of neural learning
– derive the Gradient Descent for linear models
– describe the problem of non-linear models with neural networks – outline the method of back-propagation training of a multi-layer – understand Convolutional Neural Networks (CNNs)
– understand the main difference between CNN and regular NN – know the basics of training CNN
COMP9417 T1, 2021 1

Artificial Neural Networks
Artificial Neural Networks are inspired by human nervous system
NNs are composed of a large number of interconnected processing elements known as neurons
They use supervised error correcting rules with back-propagation to learn a specific task
http://statsmaths.github.io/stat665/lectures/lec12/lecture12.pdf
COMP9417 T1, 2021 2

Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron – a simplified neuron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 T1, 2021 3

Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 T1, 2021 4

Perceptron
Each neuron has multiple dendrites and a single axon. The neuron receives its inputs from its dendrites and transmits its output through its axon. Both inputs and outputs take the form of electrical impulses. The neuron sums up its inputs, and if the total electrical impulse strength exceeds the neuron’s firing threshold, the neuron fires off a new impulse along its single axon. The axon, in turn, distributes the signal along its branching synapses which collectively reach thousands of neighboring neurons.
https://towardsdatascience.com/from-fiction-to-reality-a-beginners-guide-to-artificial-neural-networks-d0411777571b
COMP9417 T1, 2021 5

Perceptron
Output o is thresholded sum of products of inputs and their weights:
COMP9417 T1, 2021 6

Perceptron
Or in vector notation:
COMP9417 T1, 2021 7

Decision Surface of a Perceptron
• Perceptron is able to represent some useful functions which are linearly separable (a)
• But functions that are not linearly separable are not representable (e.g. (b) XOR)
COMP9417 T1, 2021 8

Perceptron Learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ←wi+∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 T1, 2021 9

Perceptron Learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
§ Let xi be a misclassified positive example, then we have yi=+1andw·xi<0.Wethereforewanttofindw'suchthat w'·xi> w·xi, which moves the decision boundary towards and hopefully past xi.
§ This can be achieved by calculating the new weight vector as
w’ = w + ηxi, where 0< η ≤ 1 is the learning rate (again, assume set to 1). We thenhavew'·xi = w·xi + ηxi·xi > w·xi asrequired.
§ Similarly, if xj is a misclassified negative example, then we have yj=−1andw·xj>0.Inthiscasewecalculatethenewweight vectoras w’ = w − ηxj , and thus w’· xj = w· xj − ηxj· xj < w· xj . COMP9417 T1, 2021 10 Perceptron Learning § The two cases can be combined in a single update rule: w' = w + η yi xi § Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified § This is the basis of the perceptron training algorithm for linear classification § The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified § If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps. COMP9417 T1, 2021 11 Training Perceptron COMP9417 T1, 2021 12 Perceptron Learning Rate (left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence. COMP9417 T1, 2021 13 Perceptron Convergence Perceptron training will converge (under some mild assumptions) for linearly separable classification problems A labelled data set is linearly separable if there is a linear decision boundary that separates the classes COMP9417 T1, 2021 14 Perceptron Convergence Dataset D = {(x1, y1), . . . , (xn, yn)} At least one example in D is labelled +1, and one is labelled -1. A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ ·xi ≥ γ γ is typically referred to as the “margin” COMP9417 T1, 2021 15 Perceptron Convergence Perceptron Convergence Theorem (Novikoff, 1962) R=maxi ║xi║! The number of mistakes made by the perceptron is at most (Rγ )2 COMP9417 T1, 2021 16 Decision Surface of a Perceptron § Unfortunately, as a linear classifier perceptrons are limited in expressive power § So some functions not representable, e.g., not linearly separable § For non-linearly separable data we’ll need something else § However, with a relatively minor modification many perceptrons can be combined together to form one model § multilayer perceptrons, the classic “neural network” COMP9417 T1, 2021 17 Optimisation Studied in many fields such as engineering, science, economics, . . . A general optimisation algorithm: 1 1) start with initial point x = x0 2) select a search direction p, usually to decrease f (x) 3) select a step length η 4) sets=ηp 5) setx=x+s 6) go to step 2, unless convergence criteria are met For example, could minimise a real-valued function f Note: convergence criteria will be problem-specific. 1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP. COMP9417 T1, 2021 18 Optimisation Usually, we would like the optimisation algorithm to quickly reach an answer that is close to being the right one. § typically, we need to minimise a function § e.g., error or loss § optimisation is known as gradient descent or steepest descent § sometimes, we need to maximise a function § e.g., probability or likelihood § optimisation is known as gradient ascent or steepest ascent COMP9417 T1, 2021 19 Gradient Descent To understand, consider the simple linear unit, where Let’s learn 𝑤𝑖 that minimise the squared error Where D is the set of training samples COMP9417 T1, 2021 20 Gradient Descent COMP9417 T1, 2021 21 Gradient Descent Gradient: Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want Training rule: i.e., COMP9417 T1, 2021 22 Gradient Descent COMP9417 T1, 2021 23 Gradient Descent COMP9417 T1, 2021 24 Perceptron vs. Linear Unit Perceptron training rule guaranteed to succeed if § Training examples are linearly separable § Sufficiently small learning rate η Linear unit training rule uses gradient descent § Guaranteed to converge to hypothesis with minimum squared error § Given sufficiently small learning rate η § Even when training data contains noise § Even when training data not separable by H COMP9417 T1, 2021 25 Stochastic (Incremental) Gradient Descent Batch mode Gradient Descent: Stochastic (incremental) mode Gradient Descent: COMP9417 T1, 2021 26 Stochastic Gradient Descent § Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if 𝜂 made small enough § Very useful for training large networks, or online learning from data streams § Stochastic implies examples should be selected at random COMP9417 T1, 2021 27 Multilayer Networks COMP9417 T1, 2021 28 Multilayer Networks § Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult § A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next (with activation feeding forward) § The weights determine the function computed. § Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer COMP9417 T1, 2021 29 General Structure of ANN COMP9417 T1, 2021 30 General Structure of ANN Properties of Artificial Neural Networks (ANN’s): § Many neuron-like threshold switching units § Many weighted interconnections among units § Highly parallel, distributed process § Emphasis on tuning weights automatically Artificial Neural Network (Source: VIASAT) COMP9417 T1, 2021 31 When to Consider Neural Networks § Input is high-dimensional discrete or real-valued (e.g., raw sensor input) § Output can be discrete or real-valued § Output can be a vector of values § Possibly noisy data § Form of target function is unknown § Interpretability of result is not important Examples: § Speech recognition § Image classification § many others . . . COMP9417 T1, 2021 32 ALVINN drives 70 mph on highways COMP9417 T1, 2021 33 ALVINN COMP9417 T1, 2021 34 MLP Speech Recognition Decision Boundaries COMP9417 T1, 2021 35 Sigmoid Unit COMP9417 T1, 2021 36 Sigmoid Unit § Same as a perceptron except that the step function has been replaced by a nonlinear sigmoid function. § Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output. COMP9417 T1, 2021 37 Sigmoid Unit COMP9417 T1, 2021 38 Sigmoid Unit Why use the sigmoid function σ (x) ? Nice property: We can derive gradient descent rules to train •One sigmoid unit •Multilayer networks of sigmoid units → Backpropagation Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easiertotrain, butSigmoidsaremathematicallyconvenient. COMP9417 T1, 2021 39 Error Gradient of Sigmoid Unit Start by assuming we want to minimise squared error ( ) over a set of training examples D. COMP9417 T1, 2021 40 Error Gradient of Sigmoid Unit We know: So: COMP9417 T1, 2021 41 Backpropagation Algorithm COMP9417 T1, 2021 42 More on Backpropagation A solution for learning highly complex models . . . § Gradient descent over entire network weight vector § Easily generalised to arbitrary directed graphs § Can learn probabilistic models by maximising likelihood Minimises error over all training examples § Training can take thousands of iterations → slow! § Using network after training is very fast COMP9417 T1, 2021 43 More on Backpropagation Will converge to a local, not necessarily global, error minimum § Might exist many such local minima § In practice, often works well (can run multiple times) § Often include weight momentum α ∆𝑤𝑗𝑖(𝑛) = 𝜂𝛿𝑗𝑥𝑗𝑖 + 𝛼 ∆𝑤𝑗𝑖(𝑛 − 1) § Stochastic gradient descent using “mini-batches” Nature of convergence § Initialise weights near zero § Therefore, initial networks near-linear § Increasingly non-linear functions become possible as training progresses COMP9417 T1, 2021 44 More on Backpropagation Models can be very complex § Will network generalise well to subsequent examples? • may underfit by stopping too soon • may overfit . . . Many ways to regularise network, making it less likely to overfit § Add term to error that increases with magnitude of weight vector § Other ways to penalise large weights, e.g., weight decay § Using "tied" or shared set of weights, e.g., by setting all weights to their mean after computing the weight updates § Many other ways . . . COMP9417 T1, 2021 45 Expressive Capabilities of ANNs Boolean functions: § Every Boolean function can be represented by a network with single hidden layer § but might require exponential (in number of inputs) hidden units Continuous functions: § Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] § Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] Being able to approximate any function is one thing, being able to learn it is another . . . COMP9417 T1, 2021 46 How complex should the model be? “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." John von Neumann COMP9417 T1, 2021 47 “Goodness of fit” in ANNs Can neural networks overfit/underfit ? Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks. Note the difference between training set and off-training set (validation set) error on both tasks ! Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting). COMP9417 T1, 2021 48 Overfitting in ANNs COMP9417 T1, 2021 49 Underfitting in ANNs COMP9417 T1, 2021 50 Neural Networks for Classification Sigmoid unit computes output o(x) = σ (w·x) Output ranges from 0 to 1 Example: binary classification Questions: § what error (loss) function should be used ? § how can we train such a classifier ? COMP9417 T1, 2021 51 Neural Networks for Classification Minimizing square error (as before) does not work so well for classification If we take the output o(x) as the probability of the class of x being 1, the preferred loss function is the cross-entropy where: td∈{0,1}istheclasslabelfortrainingexampled,andod istheoutputofthe sigmoid unit, interpreted as the probability of the class of training example d being 1. To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm – this will yield the maximum likelihood solution. COMP9417 T1, 2021 52 Application: Face Pose Recognition Dataset: 624 images of faces of 20 different people § image size 120x128 pixels § grey-scale, 0-255 intensity value range § different poses § different expressions § wearing sunglasses or not Raw images compressed to 30x32 pixels MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes COMP9417 T1, 2021 53 Application: Face Pose Recognition left straight right up Four pose classes: looking left, straight ahead, right or upwards Use a 1-of-n encoding: more parameters; can give confidence of prediction Selected single hidden layer with 3 nodes by experimentation COMP9417 T1, 2021 54 Application: Face Pose Recognition After 1 epoch left straight right up COMP9417 T1, 2021 55 Application: Face Pose Recognition After 100 epochs left straight right up COMP9417 T1, 2021 56 Application: Face Recognition Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks. Leftmost block corresponds to the bias (threshold) weight Weightsfromeachof30x32imagepixelsintoeachhiddenunitareplotted in position of corresponding image pixel. Classification accuracy: 90% on test set (default: 25%) Question: what has the network learned ? For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html COMP9417 T1, 2021 57 Deep Learning: Convolutional Neural Networks COMP9417 T1, 2021 58 A Bit of History § Earliest studies about visual mechanics of animals emphasised the importance of edge detection for solving Computer Vision problems § Early processing in cat visual cortex looks like it is performing convolutions that are looking for oriented edges and blobs § Certain cells are looking for edges with a particular orientation at a particular spatial location in the visual field § This inspired convolutional neural networks but was limited by the lack of computational power § Hinton et al. reinvigorated research into deep learning and proposed a greedy layer wise training technique § In 2012, Alex Krizhevesky et al. won ImageNet challenge and proposed the well recognised AlexNet Convolutional Neural Network COMP9417 T1, 2021 59 A Bit of History • LeNet-5 • The very first CNN • Handwritten digit recognition COMP9417 T1, 2021 60 A Bit of History • LeNet-5 • The very first CNN • Handwritten digit recognition COMP9417 T1, 2021 61 A Bit of History § AlexNet (2012) § ImageNet classification § Images showing 1000 object categories COMP9417 T1, 2021 62 A Bit of History § AlexNet (2012) § ImageNet classification § Images showing 1000 object categories COMP9417 T1, 2021 63 Deep Learning § Deep learning is a collection of artificial neural network techniques that are widely used at present § Predominantly, deep learning techniques rely on large amounts of data and deeper learning architectures § Some well-known paradigms: § Convolutional Neural Networks (CNNs) § Recurrent Neural Networks § Auto-encoders § Restricted Boltzmann Machines COMP9417 T1, 2021 64 CNNs § CNNs are very similar to regular Neural Networks § Made up of neurons with learnable weights § CNN architecture assumes that inputs are images § So that we have local features § Which allows us to § encode certain properties in the architecture that makes the forward pass more efficient and § significantly reduces the number of parameters needed for the network COMP9417 T1, 2021 65 Why CNNs? The problem with regular NNs is that they do not scale well with dimensions (i.e. larger images) § Eg: 32x32 image with 3 channels (RGB) – a neuron in first hidden layer would have 32x32x3 = 3,072 weights : manageable. § Eg: 200x200 image with 3 channels – a neuron in first hidden layer would have 200x200x3 = 120,000 weights and we need at least several of these neurons which makes the weights explode. COMP9417 T1, 2021 66 What is different? § In contrast, CNNs consider 3-D volumes of neurons and propose a parameter sharing scheme that minimises the number of parameters required by the network. § CNN neurons are arranged in 3 dimensions: Width, Height and Depth. § Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected) COMP9417 T1, 2021 67 What is different? NN CNN COMP9417 T1, 2021 68 CNN architecture Main layers: § Convolutional § Pulling § ReLU § Fully-connected § Drop-out § Output layers COMP9417 T1, 2021 69 Convolutional Layer Suppose we want to classify an image as a bird, sunset, dog, cat, etc. If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image. We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image. COMP9417 T1, 2021 70 Convolutional Layer COMP9417 T1, 2021 71 Convolutional Layer Convolution COMP9417 T1, 2021 72 Convolutional Layer COMP9417 T1, 2021 73 Convolutional Layer Original link: https://i.stack.imgur.com/nOLCe.gif COMP9417 T1, 2021 74 Convolutional Layer § The output of the Conv layer can be interpreted as holding neurons arranged in a 3D volume. § The Conv layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. § During the forward pass, each filter is slid (convolved) across the width and height of the input volume, producing a 2-dimensional activation map of that filter. § Network will learn filters (via backpropagation) that activate when they see some specific type of feature at some spatial position in the input. COMP9417 T1, 2021 75 Convolutional Layer § Stacking these activation maps for all filters along the depth dimension forms the full output volume § E.g., with 6 filters, we get 6 activation maps COMP9417 T1, 2021 76 Convolutional Layer § Three hyperparameters control the size of the output volume: the depth, stride and zero-padding § Depth controls the number of neurons in the Conv layer that connect to the same region of the input volume COMP9417 T1, 2021 77 Convolutional Layer § Three hyperparameters control the size of the output volume: the depth, stride and zero-padding § Stride is the distance that the filter is moved by in spatial dimensions http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html COMP9417 T1, 2021 78 Convolutional Layer § Three hyperparameters control the size of the output volume: the depth, stride and zero-padding § Zero-padding is padding of the input with zeros spatially on the border of the input volume https://deeplizard.com/learn/video/qSTv_m-KFk0 COMP9417 T1, 2021 79 Convolutional Layer § We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border: (W−F+2P)/S+1 § If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. COMP9417 T1, 2021 80 Example: AlexNet For example, in the first convolutional layer of AlexNet, W = 227, F = 11, P = 0, S = 4. The width of the output is (W−F+2P)/S+1 =(227−11+0)/4+1=55 There are 96 filters in this layer, the output volume of this layer is thus 55 × 55 × 96 COMP9417 T1, 2021 81 Example: AlexNet COMP9417 T1, 2021 82 Convolutional Layer § Main property – local connectivity: § Each neuron only connects to a local region of the input volume. § The spatial extent of this connectivity is a hyperparameter called receptive field of the neuron. § The extent of the connectivity along the depth axis is always equal to the depth of the input volume. COMP9417 T1, 2021 83 Convolutional Layer § Examples: § Eg1: Suppose that the input volume has size [32x32x3]. If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights. Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. § Eg2: Suppose an input volume had size [16x16x20], i.e. . Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20). COMP9417 T1, 2021 84 Convolutional Layer § Main property – parameter sharing: § Parameter sharing scheme used in Convolutional Layers to control the number of parameters § In other words, denoting a single 2-D slice as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias § This is exactly what we do with spatial filters for signals/images! COMP9417 T1, 2021 85 Convolutional Layer § Main property – parameter sharing: § Motivation of parameter sharing § If one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). COMP9417 T1, 2021 86 Convolutional Layer § Example: § In AlexNet, without parameter sharing, there are 55*55*96 = 290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. § Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. § With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique sets of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). § Alternatively, it can be viewed as all 55*55 neurons in each depth slice will now be using the same parameters. COMP9417 T1, 2021 87 Example: AlexNet For example, in the first convolutional layer of AlexNet, W = 227, F = 11, P = 0, S = 4. The width of the output is (W−F+2P)/S+1 =(227−11+0)/4+1=55 There are 96 filters in this layer. Compute the number of: weights per neuron? neurons? connections? independent parameters? COMP9417 T1, 2021 88 Example: AlexNet For example, in the first convolutional layer of AlexNet, W = 227, F = 11, P = 0, S = 4. The width of the output is (W−F+2P)/S+1 =(227−11+0)/4+1=55 There are 96 filters in this layer. Compute the number of: weights per neuron? neurons? connections? independent parameters? 1+11×11×3 55 × 55 × 96 55×55×96×364 96 × 364 = 364 = 290, 400 = 105,705,600 = 34,944 COMP9417 T1, 2021 89 Pooling Layer § The function of pooling layer § to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and § hence to also control overfitting § The Pooling Layer operates § independently on every depth slice of the input and resizes it spatially, typically using the MAX operation (i.e.: max pooling) § The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2, which downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations COMP9417 T1, 2021 90 Pooling Layer COMP9417 T1, 2021 91 Pooling Layer Max pooling COMP9417 T1, 2021 92 Pooling Layer § If the previous layer is J × K, and max pooling is applied with width F and stride S, the size of the output will be (1 + (𝐽 −𝐹)/𝑠)×(1 + (𝐾 −𝐹)/𝑠) § If max pooling with width 3 and stride 2 is applied to the feature map of size 55 ×55 in the first convolutional layer of AlexNet, what is the output size after pooling? Answer: 1+(55−3)/2=27. § How many independent parameters does this add to the model? Answer: None! (no weights to be learned, just computing max) COMP9417 T1, 2021 93 ReLU Layer § Although ReLU (Rectified Linear Unit) is considered as a layer, it is really an activation function: § f(x) = max(0, x) § This is favoured in deep learning as opposed to the traditional activation functions like Sigmoid or Tanh § To accelerate the convergence of stochastic gradient descent § Be computationally inexpensive compared to traditional ones § However, ReLu units can be fragile during training and ‘die’. Leaky ReLUs were proposed to handle this problem. COMP9417 T1, 2021 94 Fully-connected Layer § Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. COMP9417 T1, 2021 95 Dropout Layer § Problem with overfitting – model performs well on training data but generalises poorly to testing data § Dropout is a simple and effective method to reduce overfitting § In each forward pass, randomly set some neurons to zero § Probability of dropping is a hyperparameter, such as 0.5 COMP9417 T1, 2021 96 Dropout Layer § Makes the training process noisy § Forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs § Prevents co-adaptation of features and simulates a sparse activation § Analogous to training a large ensemble of models but with much higher efficiency COMP9417 T1, 2021 97 Dropout Layer • During test time, direct application would make the output random • A simple approach: multiply the activation by dropout probability (e.g. 0.5) COMP9417 T1, 2021 98 Output Layer § The output layer produces the probability of each class given the input image § This is the last layer containing the same number of neurons as the number of classes in the dataset § The output of this layer passes through a Softmax activation function to normalize the outputs to a sum of one: COMP9417 T1, 2021 99 Loss Function § A loss function is used to compute the model’s prediction accuracy from the outputs § Most commonly used: categorical cross-entropy loss function § The training objective is to minimise this loss § The loss guides the backpropagation process to train the CNN model § Stochastic gradient descent and the Adam optimiser are commonly used algorithms for optimisation COMP9417 T1, 2021 100 Training • Backpropagation in general: 1. Initialise the network. 2. Input the first observation. 3. Forward-propagation. From left to right the neurons are activated and the output value is produced. 4. Calculate the error in the outputs (loss function). 5. From right to left the generated error is back-propagated and accumulate the weight updates (partial derivatives). 6. Repeat steps 2-5 and adjust the weights after a batch of observations. 7. When the whole training set passes through the network, that makes an epoch. Redo more epochs. https://www.superdatascience.com/blogs/artificial-neural-networks-backpropagation COMP9417 T1, 2021 101 Pre-processing Pre-processing: image scaling, zero mean, and normalisation COMP9417 T1, 2021 102 Data Augmentation § Essential for increasing the dataset size and avoiding over-fitting § More data augmentation often leads to better performance but also longer training time § Commonly used techniques include: § Horizontal / vertical flipping § Random cropping and scaling § Rotation § Gaussian filtering § During testing, average the results from multiple augmented input images COMP9417 T1, 2021 103 Data Augmentation Need evaluation => not all techniques are useful
https://blog.insightdatascience.com/automl-for-data-augmentation-e87cf692c366
COMP9417 T1, 2021
104

Initialisation
§ Weight initialisation
§ Cannot be all 0’s => Need to ensure diversity in the filter weights
§ Use small random numbers => might aggravate the diminishing gradients problem
§ With calibration
§ Sparse initialisation
§ More advanced techniques
§ Use ImageNet pretrained models => not always possible
COMP9417 T1, 2021
105

Balancing Data
§ Balanced training data
§ Important to have similar numbers of training images for different
classes, so the optimisation would not be biased by one class
§ Use random sampling to achieve this effect during each epoch of training
§ Assign different weights in the loss function
COMP9417 T1, 2021
106

Testing
§ Forward passes of the network throughout the layers give the prediction output of the input data
COMP9417 T1, 2021
107

Transfer Learning
§ CNN models trained on ImageNet can be applied to other types of images
§ It is possible to finetune only the last FC layers to better fit the model to the specific set of images
§ Especially useful for small datasets
COMP9417 T1, 2021
108

VGGNet
COMP9417 T1, 2021
109

VGGNet
COMP9417 T1, 2021
110

Well-known Models
§ Object Recognition § AlexNet (2012) § GoogLeNet
§ VGGNet
§ ResNet
§ Inception v3/v4
§ DenseNets is the current state-of-the-art
§ Semantic Segmentation
§ Multi-scale CNN (2012) § FCN
§ U-net / V-net
§ U-net / V-net with skip/dense connections § Many other variations
COMP9417 T1, 2021
111

Summary
Artificial Neural Networks
Complex function fitting. Generalise core techniques from machine learning and statistics based on linear models for regression and classification.
Learning is typically stochastic gradient descent. Networks are too complex to fit otherwise.
COMP9417 T1, 2021 112

Summary
• A brief introduction of CNN – the most commonly used model of deep learning
• Widely used in computer vision studies
• In-depth knowledge in COMP9444
COMP9417 T1, 2021 113

Acknowledgement
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge UniversityPress(2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
Slides from CISC 4631/6930 Data Mining https://slideplayer.com/slide/13508539/
COMP9417 T1, 2021 114