Neural Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael
Copyright By PowCoder代写 加微信 powcoder
This lecture will develop your understanding of Neural Network Learning & will extend that to Deep Learning
– describe Perceptrons and how to train them
– relate neural learning to optimization in machine learning
– outline the problem of neural learning
– derive the Gradient Descent for linear models
– describe the problem of non-linear models with neural networks – outline the method of back-propagation training of a multi-layer – understand Convolutional Neural Networks (CNNs)
– understand the main difference between CNN and regular NN – know the basics of training CNN
COMP9417 T1, 2022 1
Artificial Neural Networks
Artificial Neural Networks are inspired by human nervous system
NNs are composed of a large number of interconnected processing elements known as neurons
They use supervised error correcting rules with back-propagation to learn a specific task
http://statsmaths.github.io/stat665/lectures/lec12/lecture12.pdf
COMP9417 T1, 2022 2
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron – a simplified neuron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 T1, 2022 3
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 T1, 2022 4
Perceptron
Each neuron has multiple dendrites and a single axon. The neuron receives its inputs from its dendrites and transmits its output through its axon. Both inputs and outputs take the form of electrical impulses. The neuron sums up its inputs, and if the total electrical impulse strength exceeds the neuron’s firing threshold, the neuron fires off a new impulse along its single axon. The axon, in turn, distributes the signal along its branching synapses which collectively reach thousands of neighboring neurons.
https://towardsdatascience.com/from-fiction-to-reality-a-beginners-guide-to-artificial-neural-networks-d0411777571b
COMP9417 T1, 2022 5
Perceptron
Output o is thresholded sum of products of inputs and their weights:
COMP9417 T1, 2022 6
Perceptron
Or in vector notation:
COMP9417 T1, 2022 7
Decision Surface of a Perceptron
• Perceptron is able to represent some useful functions which are linearly separable (a)
• But functions that are not linearly separable are not representable (e.g. (b) XOR)
COMP9417 T1, 2022 8
Perceptron Learning
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ←wi+∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 T1, 2022 9
Perceptron Learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
§ Let xi be a misclassified positive example, then we have yi=+1andw·xi<0.Wethereforewanttofindw'suchthat w'·xi> w·xi, which moves the decision boundary towards and hopefully past xi.
§ This can be achieved by calculating the new weight vector as
w’ = w + ηxi, where 0< η ≤ 1 is the learning rate (again, assume set to 1). We thenhavew'·xi = w·xi + ηxi·xi > w·xi asrequired.
§ Similarly, if xj is a misclassified negative example, then we have yj=−1andw·xj>0.Inthiscasewecalculatethenewweight vectoras
w’ = w − ηxj , and thus w’· xj = w· xj − ηxj· xj < w· xj . COMP9417 T1, 2022 10
Perceptron Learning
§ The two cases can be combined in a single update rule: w' = w + η yi xi
§ Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified
§ This is the basis of the perceptron training algorithm for linear classification
§ The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified
§ If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps.
COMP9417 T1, 2022 11
Training Perceptron
COMP9417 T1, 2022 12
Perceptron Learning Rate
(left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update.
(middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence.
(right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence.
COMP9417 T1, 2022 13
Perceptron Convergence
Perceptron training will converge (under some mild assumptions) for linearly separable classification problems
A labelled data set is linearly separable if there is a linear decision boundary that separates the classes
COMP9417 T1, 2022 14
Perceptron Convergence
Dataset D = {(x1, y1), . . . , (xn, yn)}
At least one example in D is labelled +1, and one is labelled -1. A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ ·xi ≥ γ
γ is typically referred to as the “margin”
COMP9417 T1, 2022 15
Perceptron Convergence
Perceptron Convergence Theorem (Novikoff, 1962)
R=maxi ║xi║#
The number of mistakes made by the perceptron is at most (Rγ )2
COMP9417 T1, 2022 16
Decision Surface of a Perceptron
§ Unfortunately, as a linear classifier perceptrons are limited in expressive power
§ So some functions not representable, e.g., not linearly separable
§ For non-linearly separable data we’ll need something else
§ However, with a relatively minor modification many perceptrons can be combined together to form one model
§ multilayer perceptrons, the classic “neural network”
COMP9417 T1, 2022 17
Optimisation
Studied in many fields such as engineering, science, economics, . . . A general optimisation algorithm: 1
1) start with initial point x = x0
2) select a search direction p, usually to decrease f (x)
3) select a step length η
4) sets=ηp
5) setx=x+s
6) go to step 2, unless convergence criteria are met
For example, could minimise a real-valued function f Note: convergence criteria will be problem-specific.
1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP.
COMP9417 T1, 2022 18
Optimisation
Usually, we would like the optimisation algorithm to quickly reach an answer that is close to being the right one.
§ typically, we need to minimise a function
§ e.g., error or loss
§ optimisation is known as gradient descent or steepest descent
§ sometimes, we need to maximise a function
§ e.g., probability or likelihood
§ optimisation is known as gradient ascent or steepest ascent
COMP9417 T1, 2022 19
Gradient Descent
To understand, consider the simple linear unit, where
Let’s learn 𝑤𝑖 that minimise the squared error
Where D is the set of training samples
COMP9417 T1, 2022 20
Gradient Descent
COMP9417 T1, 2022 21
Gradient Descent
Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want
Training rule: i.e.,
COMP9417 T1, 2022 22
Gradient Descent
COMP9417 T1, 2022 23
Gradient Descent
COMP9417 T1, 2022 24
Perceptron vs. Linear Unit
Perceptron training rule guaranteed to succeed if § Training examples are linearly separable § Sufficiently small learning rate η
Linear unit training rule uses gradient descent
§ Guaranteed to converge to hypothesis with minimum squared error
§ Given sufficiently small learning rate η
§ Even when training data contains noise
§ Even when training data not separable by H
COMP9417 T1, 2022 25
Stochastic (Incremental) Gradient Descent Batch mode Gradient Descent:
Stochastic (incremental) mode Gradient Descent:
COMP9417 T1, 2022 26
Stochastic Gradient Descent
§ Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if 𝜂 made small enough
§ Very useful for training large networks, or online learning from data streams
§ Stochastic implies examples should be selected at random
COMP9417 T1, 2022 27
Multilayer Networks
COMP9417 T1, 2022 28
Multilayer Networks
§ Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult
§ A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next (with activation feeding forward)
§ The weights determine the function computed.
§ Given an arbitrary number of hidden units, any boolean function can be
computed with a single hidden layer
COMP9417 T1, 2022 29
General Structure of ANN
COMP9417 T1, 2022 30
General Structure of ANN
Properties of Artificial Neural Networks (ANN’s): § Many neuron-like threshold switching units
§ Many weighted interconnections among units § Highly parallel, distributed process
§ Emphasis on tuning weights automatically
Artificial Neural Network (Source: VIASAT)
COMP9417 T1, 2022 31
When to Consider Neural Networks
§ Input is high-dimensional discrete or real-valued (e.g., raw sensor input)
§ Output can be discrete or real-valued
§ Output can be a vector of values
§ Possibly noisy data
§ Form of target function is unknown
§ Interpretability of result is not important
§ Speech recognition § Image classification § many others . . .
COMP9417 T1, 2022 32
ALVINN drives 70 mph on highways
COMP9417 T1, 2022 33
COMP9417 T1, 2022 34
MLP Speech Recognition
Decision Boundaries
COMP9417 T1, 2022 35
Sigmoid Unit
COMP9417 T1, 2022 36
Sigmoid Unit
§ Same as a perceptron except that the step function has been replaced by a nonlinear sigmoid function.
§ Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output.
COMP9417 T1, 2022 37
Sigmoid Unit
COMP9417 T1, 2022 38
Sigmoid Unit
Why use the sigmoid function σ (x) ?
Nice property:
We can derive gradient descent rules to train
•One sigmoid unit
•Multilayer networks of sigmoid units → Backpropagation
Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easiertotrain, butSigmoidsaremathematicallyconvenient.
COMP9417 T1, 2022 39
Error Gradient of Sigmoid Unit
Start by assuming we want to minimise squared error ( ) over a set of training examples D.
COMP9417 T1, 2022 40
Error Gradient of Sigmoid Unit
COMP9417 T1, 2022 41
Backpropagation Algorithm
COMP9417 T1, 2022 42
More on Backpropagation
A solution for learning highly complex models . . .
§ Gradient descent over entire network weight vector
§ Easily generalised to arbitrary directed graphs
§ Can learn probabilistic models by maximising likelihood
Minimises error over all training examples
§ Training can take thousands of iterations → slow! § Using network after training is very fast
COMP9417 T1, 2022 43
More on Backpropagation
Will converge to a local, not necessarily global, error minimum § Might exist many such local minima
§ In practice, often works well (can run multiple times)
§ Often include weight momentum α
∆𝑤𝑗𝑖(𝑛) = 𝜂𝛿𝑗𝑥𝑗𝑖 + 𝛼 ∆𝑤𝑗𝑖(𝑛 − 1)
§ Stochastic gradient descent using “mini-batches” Nature of convergence
§ Initialise weights near zero
§ Therefore, initial networks near-linear
§ Increasingly non-linear functions become possible as training progresses
COMP9417 T1, 2022 44
More on Backpropagation
Models can be very complex
§ Will network generalise well to subsequent examples?
• may underfit by stopping too soon
• may overfit . . .
Many ways to regularise network, making it less likely to overfit
§ Add term to error that increases with magnitude of weight vector
§ Other ways to penalise large weights, e.g., weight decay
§ Using "tied" or shared set of weights, e.g., by setting all weights to
their mean after computing the weight updates
§ Many other ways . . .
COMP9417 T1, 2022 45
Expressive Capabilities of ANNs
Boolean functions:
§ Every Boolean function can be represented by a network with single hidden layer
§ but might require exponential (in number of inputs) hidden units Continuous functions:
§ Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
§ Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
Being able to approximate any function is one thing, being able to learn it is another . . .
COMP9417 T1, 2022 46
How complex should the model be?
“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
John von OMP9417 T1, 2022 47
“Goodness of fit” in ANNs
Can neural networks overfit/underfit ?
Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks.
Note the difference between training set and off-training set (validation set) error on both tasks !
Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting).
COMP9417 T1, 2022 48
Overfitting in ANNs
COMP9417 T1, 2022 49
Underfitting in ANNs
COMP9417 T1, 2022 50
Neural Networks for Classification Sigmoid unit computes output o(x) = σ (w·x)
Output ranges from 0 to 1 Example: binary classification
Questions:
§ what error (loss) function should be used ? § how can we train such a classifier ?
COMP9417 T1, 2022 51
Neural Networks for Classification
Minimizing square error (as before) does not work so well for classification
If we take the output o(x) as the probability of the class of x being 1, the preferred
loss function is the cross-entropy
td∈{0,1}istheclasslabelfortrainingexampled,andod istheoutputofthe
sigmoid unit, interpreted as the probability of the class of training example d being 1.
To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm – this will yield the maximum likelihood solution.
COMP9417 T1, 2022 52
Application: Face Pose Recognition
Dataset: 624 images of faces of 20 different people
§ image size 120x128 pixels
§ grey-scale, 0-255 intensity value range § different poses
§ different expressions
§ wearing sunglasses or not
Raw images compressed to 30x32 pixels
MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes
COMP9417 T1, 2022 53
Application: Face Pose Recognition
left straight right up
Four pose classes: looking left, straight ahead, right or upwards
Use a 1-of-n encoding: more parameters; can give confidence of prediction Selected single hidden layer with 3 nodes by experimentation
COMP9417 T1, 2022 54
Application: Face Pose Recognition After 1 epoch
left straight right up
COMP9417 T1, 2022 55
Application: Face Pose Recognition After 100 epochs
left straight right up
COMP9417 T1, 2022 56
Application: Face Recognition
Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks.
Leftmost block corresponds to the bias (threshold) weight
Weightsfromeachof30x32imagepixelsintoeachhiddenunitareplotted in position of corresponding image pixel.
Classification accuracy: 90% on test set (default: 25%) Question: what has the network learned ?
For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html
COMP9417 T1, 2022 57
Deep Learning: Convolutional Neural Networks
COMP9417 T1, 2022 58
A Bit of History
§ Earliest studies about visual mechanics of animals emphasised the importance of edge detection for solving Computer Vision problems
§ Early processing in cat visual cortex looks like it is performing convolutions that are looking for oriented edges and blobs
§ Certain cells are looking for edges with a particular orientation at a particular spatial location in the visual field
§ This inspired convolutional neural networks but was limited by the lack of computational power
§ Hinton et al. reinvigorated research into deep learning and proposed a greedy layer wise training technique
§ In 2012, et al. won ImageNet challenge and proposed the well recognised AlexNet Convolutional Neural Network
COMP9417 T1, 2022 59
A Bit of History
• The very first CNN
• Handwritten digit recognition
COMP9417 T1, 2022 60
A Bit of History
• The very first CNN
• Handwritten digit recognition
COMP9417 T1, 2022 61
A Bit of History
§ AlexNet (2012)
§ ImageNet classification
§ Images showing 1000 object categories
COMP9417 T1, 2022 62
A Bit of History
§ AlexNet (2012)
§ ImageNet classification
§ Images showing 1000 object categories
COMP9417 T1, 2022 63
Deep Learning
§ Deep learning is a collection of artificial neural network techniques that are widely used at present
§ Predominantly, deep learning techniques rely on large amounts of data and deeper learning architectures
§ Some well-known paradigms:
§ Convolutional Neural Networks (CNNs) § Recurrent Neural Networks
§ Auto-encoders
§ Restricted Boltzmann Machines
COMP9417 T1, 2022 64
§ CNNs are very similar to regular Neural Networks § Made up of neurons with learnable weights
§ CNN architecture assumes that inputs are images § So that we have local features
§ Which allows us to
§ encode certain properties in the architecture that makes the forward pass more efficient and
§ significantly reduces the number of parameters needed for the network
COMP9417 T1, 2022 65
The problem with regular NNs is that they do not scale well with dimensions (i.e. larger images)
§ Eg: 32x32 image with 3 channels (RGB) – a neuron in first hidden layer would have 32x32x3 = 3,072 weights : manageable.
§ Eg: 200x200 image with 3 channels – a neuron in first hidden layer would have 200x200x3 = 120,000 weights and we need at least several of these neurons which makes the weights explode.
COMP9417 T1, 2022 66
What is different?
§ In contrast, CNNs consider 3-D volumes of neurons and propose a parameter sharing scheme that minimises the number of parameters required by the network.
§ CNN neurons are arranged in 3 dimensions: Width, Height and Depth.
§ Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected)
COMP9417 T1, 2022 67
What is different?
COMP9417 T1, 2022 68
CNN architecture
Main layers:
§ Convolutional
§ Fully-connected § Drop-out
§ Output layers
COMP9417 T1, 2022 69
Convolutional Layer
Suppose we want to classify an image as a bird, sunset, dog, cat, etc.
If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image.
We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image.
COMP9417 T1, 2022 70
Convolutional Layer
COMP9417 T1, 2022 71
Convolutional Layer
Convolution
COMP9417 T1, 2022 72
Convolutional Layer
COMP9417 T1, 2022 73
Convolutional Layer
Original link: https://i.stack.imgur.com/nOLCe.gif
COMP9417 T1, 2022 74
Convolutional Layer
§ The output of the Conv layer can be interpreted as holding neurons arranged in a 3D volume.
§ The Conv layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
§ During the forward pass, each filter is slid (convolved) across the width and height of the input volume, producing a 2-dimensional activation map of t
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com