Machine Learning Neural Networks
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 Introduction
3 Evaluation
4 Optimisation
5 Overfitting
Lecture Overview
Lecture Overview1
By the end of this lecture you should:
1 Be familiar with Neural Networks and be aware of some of their beneficial properties and empirical successes
2 Understand the backpropagation approach to optimisation of neural networks
3 Be aware of the high capacity of neural networks, their tendency to overfit and some practical remedies for this problem
This lecture is based in large part on ’s excellent ‘Introduction to Neural Networks’ notes
Introduction
Lecture Overview
1 Lecture Overview
2 Introduction
3 Evaluation
4 Optimisation
5 Overfitting
Introduction
Neural Networks have exhibited great empirical success, and great popular and commercial attention in recent years, for example:
Object Recognition
Neural Networks achieved superhuman performance in 2011 Now used for tasks which were impossible only a few years ago
Speech Recognition
All major speech recognition systems are now based on neural networks: Alexa, Siri, Cortana…
Moorfields Eye Hospital / DeepMind partnership build AI system that matches world-leading experts in diagnosing a range of eye conditions, recommending the correct referral decision for over 50 eye diseases with 94% accuracy ( et al., 2018).
Introduction
The Biological Neuron
Neural Networks started life as an attempt to model what goes on in the brain
The neuron is the basic unit of processing in the brain
It receives chemical signals from other neurons at junctions: the
It converts these signals into electrical signals, combines them, and if the combination is strong enough, generates an action potential: an electrical signal that travels along the neuron’s axon
Introduction
The Biological Neuron
The Action potential is binary: either on or off
This signal then causes the release of chemical signals at synapses with other neurons…
Introduction
The Artificial Neuron
An artificial model can be viewed as a model of a biological neuron(!)
It takes in inputs, x ∈ Rm, then:
Combines them into a weighted sum (and adds a bias term):
Passes this sum through an activation function, φ Then outputs, a = φ(z)
Introduction
The Artificial Neuron
Activation function
w2 Σ a Weights
What form does φ take?
The Heaviside function would look most biological…
But a common (better behaved) approximation is the logistic sigmoid:
φ(z)= 1 1+e−z
(Although there are many other forms…)
Introduction
Artificial Neural Network
An Artificial Neural Network (ANN or NN) is a combination of many of these units
The units may vary across the network – weights and activation functions might be different
How the units are connected may be complex and is characterised by the architecture of the NN
Introduction
Artificial Neural Network
Two broad categories are:
Feed Forward Neural Networks (FFN)
Where the units are arranged into a graph without any cycles
Recurrent Neural Networks (RNN)
Where the graph can have cycles so that the outputs of a unit can loop back to become one of its own inputs
Introduction
Architecture
The architecture of a NN can encourage it to ‘perceive’ different features
For example a Convolutional Neural Network (CNN) exploits the local structure of input data
It is particularly effective when applied to datasets such as images where local stucture – edges, sub-images, etc. – are manifest
Similarly a Recurrent Neural Network (RNN) is well placed to perceive time series structure
How many architectural models are there?
Introduction
Architecture
Introduction
But in this lecture we will focus on the (MLP), a particular form of FFN in which each layer contains some number of identical sigmoid units
The network is fully connected
Every unit in one layer is connected to every unit in the next
The first layer is the input layer
It takes its values from the input attributes, xi
The units in such a layer are known as input units
Introduction
The last layer is the output layer
It defines values for the output variable, y
A slight nuance is that for multiclassification we have a separate yi for each class i
The units in such a layer are known as output units
All the layers in between are the hidden layers
The units in such a layer are known as hidden units
The number of layers is the depth of the MLP
The number of units in a layer is known as the width of the MLP
Introduction
Input Layer
Hidden Layer 1
Hidden Layer 2
Output Layer
Introduction
In general:
z[1] = φ w[1]xj + b[1] iij i
z[2] = φ w[2]z[1] + b[2]
i ijj i
oi =φ w[3]z[2]+b[3] ijj i
z[h] = φ w[h]z[h−1] + b[h] iijj i
Introduction
Feature Learning
Why do we select this representation? For 1-layer:
oi =φ w[2]φ w[1]xk +b[1] +b[2]
ijjk j i jk
(But in general oi is more complex)
Introduction
Feature Learning
So,oi :(x,w)→R
Here w is the set of all weights and bias parameters
So a NN is a nonlinear mapping from a set of input variables, x, to a set of output variables, {oi }, with the functional forms controlled by the weights, w, which we are to learn
The number of weights (and basis functions) are chosen in advance via our architecture
c.f. Gaussian Processes or Kernel methods more generally But these basis functions are adaptive (via the weights)
Introduction
Feature Learning
It turns out that such a structure can be very expressive Layers of the NN are said to learn features implicitly
And deeper layers learn more and more complex features…without any guidance!
Introduction
Feature Learning: Example2
‘Visualizing and Understanding Convolutional Networks’ Zeiler & Fergus [2012]
Introduction
Feature Learning: Example
Introduction
Universality
Why should MLP’s be so expressive?
Non-linearity of activation functions
Linear activation functions can learn little
Even a deep set of layers of linear units will just result in a linear unit
But we can prove that a shallow MLP with just one hidden layer and non-linear activation functions can learn any non-linear function of the input
We say that an MLP is a universal function approximator
Introduction
Universality
They are very versatile…
…But not necessarily compact
Sometimes exponentially large NN’s are required to achieve function approximators
This is one reason that deep learning is interesting:
One can often approximate the same function with a narrow, deep,
Much more compactly than with a wide, shallow one
Introduction
The XOR Problem
Let’s try to get some intuition for why the MLP acts as a Universal Function Approximator
Consider a Boolean function
any such function can be represented as a disjunction of
conjunctions…
and a disjunction of conjunctions can be implemented by an MLP
with 1 hidden layer
Each conjunction is implemented by a hidden unit Each disjunction is implemented by the output unit
Introduction
The XOR Problem
A perceptron can implement both AND and OR functions:
Forexample,withx=[1,x1,x2]T andw=[−1.5,1,1]T wehavethe ingredients of an AND function:
+1 +1 x0 x1 x2
Introduction
The XOR Problem
Which yields the following truth table:
x1 x2 w·x o
0 0 -1.5 0 0 1 -0.5 0 1 0 -0.5 0 1 1 0.5 1
Introduction
The XOR Problem
…But a perceptron can’t implement a XOR function…
Introduction
The XOR Problem
However two perceptrons in parallel can implement the two ANDs and another perceptron afterwards can OR these results together:
z0 z1 z2 −0.5
−0.5 +1 −1
Introduction
The XOR Problem
The first layer maps from x to z…
…And in this space the data is linearly separable – because (x1 =0,x2 =0),(x1 =1,x2 =1)arebothmappedto
(z1 =0,z2 =0)
Which yields the following truth table:
x1 x2 z1 z2 o 00000 01011 10101 11000
Introduction
Boolean Functions and the MLP
In general, for Boolean functions with 2 inputs:
For each input combination where the output is 1, we define a hidden unit that checks for that particular conjunction of the inputs
The next layer then performs a disjunction between these
But what if there are d inputs?
Up to 2d hidden units may be necessary
We can still approximate the function with a two layer MLP….But we end up with a cumbersome lookup table
This architecture is not compact and not good at generalisation
Introduction
Continuous Functions and the MLP
How can we approximate continuous functions? We’ll focus on the 2 hidden layer case:
For each input case / region, we can delimit that region by hyperplanes on all sides using hidden weights in the first hidden layer
A hidden unit in the next layer then ANDs them together to bound the region
Then the weight of the connection from that hidden unit to the output unit is set equal to the desired function value
The results is a piecewise constant approximation
Its accuracy may be increased by increasing the number of hidden units which is equivalent to placing a finer grid over the inputs
Evaluation
Lecture Overview
1 Lecture Overview
2 Introduction
3 Evaluation
4 Optimisation
5 Overfitting
Evaluation
Evaluation
So how do we learn all the weights in the MLP architecture and attempt to learn any function?
First we need to define an evaluation function
NN’s are very flexible, and can accommodate a variety of evaluation functions – as usual we need to pick one for the task at hand
Let’s take a look at three canonical cases, for a H hidden layer network:
Evaluation
Regression
Here we have one output, o(i)
And we set this output to be a linear function of units in the last hidden layer:
(i) [H+1] [H](i) [H+1] o=wj zj +b
Then set evaluation function to be the squared loss:
(o(i) − y(i))2
Evaluation
Binary Classification
Here we have one output, o(i)
And we set this output to be a logistic sigmoid function of units in the last hidden layer:
o = [H+1] [H](i) [H+1]
1 + e− j wj zj +b
Then set evaluation function to be the cross entropy:
y(i) ln(o(i)) + (1 − y(i)) ln(1 − o(i))
Evaluation
Multi-Classification
Here we have K ‘initial’ outputs, {o(i)}K , from the last hidden layer:
(i) [H+1] [H](i) [H+1]
ok = wkj zj +bk j
And we push these through a softmax function to generate the final output units:
(i) exp o(i) ok= k
Then set evaluation function to be the cross entropy:
K exp o(i) j=1 j
y(i) ln(o(i))
Evaluation
Forward Propagation
In order to calculate each of these evaluation functions then clearly
we need to evaluate o(i) for each of our training examples o(i) is calculated as a forward pass through the network:
x(i) is passed through the first hidden layer where {z[1](i)} is calculated
The set {z[1](i)} is passed through the second hidden layer where {z[2](i)} is calculated
And so on until the output layer
This process of evaluation is often termed the forward propagation of information through the network
Optimisation
Lecture Overview
1 Lecture Overview
2 Introduction
3 Evaluation
4 Optimisation
5 Overfitting
In order to set values on the weights we need to optimise our loss function L
In common with, for example, logistic regression, in general we cannot optimise L for a NN analytically
So let’s apply a numerical technique, gradient descent:
Recall that for a general representation of the weights, w, gradient descent looks like:
w(t+1) ← w(t) − α∇wL|w=w(t)
Here α is the learning rate
And w is just a shorthand for all the weights and biases in the network
So what we really need to calculate is:
∂L , ∂L ∀i,j,h
∂w [h] ∂b[h] ij i
This amounts to a large amount of derivatives to calculate – how should we proceed?
Finite Difference is one approach…But it’s expensive and would require many passes through the network to approach optimality
Is there a more efficient approach?
We could try to apply the Chain Rule. For f(u) where u = u(v):
∂f =∂f∂u ∂v ∂u ∂v
This is the foundation for a key optimisation technique for NN’s: the
backpropagation algorithm
Since L = n L(i), where L(i) is the loss associated with a single
i = 1 example…
…Then let’s begin by calculating the derivative of L(i)…
…We can then sum these later to calculate the aggregate derivative
So, for a network with H hidden layers, let’s start by calculating the derivative of L(i) with respect to the weights in some layer h
Let us recall that, for a hidden layer, the activations are given by:
z[h] =φa[h] ii
(1) ai= wijzj +bi (2)
[h] [h] [h−1] [h] j
Then, using the chain rule:
∂L ∂L ∂a[h] ∂a[h]
= i=δ[h]i ∂w[h] ∂a[h] ∂w[h] i ∂w[h]
ij i ij ij
Where δ[h] are known as the errors: i
δ[h] = ∂L (3) i ∂a[h]
3Henceforth we drop the superscript (i) to avoid cluttered notation. Furthermore, we concentrate on taking derivatives w.r.t.
[h] [h] wij , a similar analysis obtains for bi
Now, from expression(2), i
∂a[h] ∂w[h]
= δ[h]z[h−1] (4) i j
In other words the derivative we seek is equivalent to:
(value of δ for the unit at the output end of weight) ×
(value of z for the unit at the input end of the weight)
Thus we need only calculate δ[h] for each hidden and output unit in the i
: Calculating δ[h] i
For the output units for the loss functions in which we are interested (for example the squared loss or the cross entropy), we have:
δ[H+1] =o−y (5) i
Why?…Recall the output definition for each of the loss functions: Regression:
o = a[H+1]
Binary Classification:
o=1 1+e−a[H+1]
Multi-Classification: [H +1]
k K −a[H+1]
: Calculating δ[h] i
For the hidden units we use the chain rule:
[h] ∂L ∂L ∂a[h+1]
i ∂a[h] ∂a[h+1] ∂a[h]
ikki [h+1]∂a[h+1]
Where k runs over all the units to which unit i sends connections
k Now we need to find an expression for ∂a[h+1]
=δk (6) k ∂a[h]
: Calculating δ[h] i
Using expressions (1 & 2) and the chain rule:
∂a[h+1] ∂a[h+1] ∂z[h] k=ki ∂a[h] ∂z[h] ∂a[h]
= w [ h + 1 ] φ ′ a [ h ] ki i
Substituting this into expressions (6) gives:
[h] ′[h] [h+1] [h+1]
wki δk (7)
: Calculating δ[h] i
In other words, the value of δ for a particular hidden unit, h, can be obtained by propagating the δ’s backwards from higher up the network:
a[h−1] a[h+1] jk
w [h−1] ij
If we recursively apply expression (7) we can generate δ’s for the entire NN
Algorithm Algorithm 1 Backprop
1: initialise: {w[h], b[h]} ∀i, j, h ij i
2: apply: input vector x(i) to network and forward propagate to generate o(i), z[h](i)∀i, h in network, using:
a[h](i) = w[h]z[h−1](i) + b[h]
3: evaluate: δ[H+1](i) for all output units, using: i
δ[H+1](i) = o(i) − y(i) ii
z[h](i) = φa[h](i)
4: backpropagate: δ’s to obtain δ[h](i) for each hidden unit in the network, using: i
δ[h](i) = φ ′ a[h](i) w[h+1]δ[h+1](i) i ikkik
5: evaluate: all derivatives, using: ∂L(i) = δ[h](i)z[h−1](i)
∂w[h] i j ij
6: return:
∂w[h] i=1 ∂w[h] ij ij
: Flexibility
Note that Backprop is a generic algorithm that works regardless of the activation function (subject to smoothness) or loss function (subject to structure)
It can accommodate a variety of activation functions within the same network…
…Things just get more fiddly!
Optimisation
Why Backprop?
Efficiency:
1 forward pass takes O(|w|)
1 backward pass takes O(|w|)
Compare with finite difference:
Each forward pass takes O(|w|) and we need to perform |w| of these
for each unit we perturb This efficiency is O(|w|2)
Optimisation
Batch or Stochastic Gradient Descent? Training NN’s can be expensive, even using Backprop
Added to this, the capacity of NN’s necessitates training on large datasets
For these reasons we often prefer to use SGD (recall our comparisons of these methods in the Linear Regression Lecture)…
…Here we apply our updates on an example-by-example basis, rather than across the batch
This makes for a faster, though locally less stable, convergence
Optimisation
Recall the problems associated with numerical optimisation from
the Linear Regression Lecture
We have all of these in NN training…and more…
Much of the work in NN’s early in this iteration of their popularity has surrounded identifying and developing workarounds for these problems
Let’s examine some of them:
Optimisation
Non-Convexity
This is the elephant in the room: NN’s, because they are the result of performing both positive and negative summations on convex (or non-convex) activation functions, do not result in convex loss functions
So Gradient Descent has no reason to converge
The function which we are attempting to optimise is, in general, riddled with multiple local minima and saddle points
Optimisation
Non-Convexity
How can we resolve this?
We can’t – without enumerating all minima we can’t know that we’re at a globally optimal point rather than just a local one
One workaround is to re-run Gradient Descent several times for different random intialisations of w and select the best optimum from amongst them…But it’s just a heuristic
Optimisation
Symmetries
How do we initialise the weights when performing Gradient Descent?
What if we initialise w = 0?
Then all hidden activations will be the same…
…And all weights feeding into a hidden unit will have identical derivatives
So then weights will have identical values in the next step The complexity reduces to one neuron!
We can resolve this problem by symmetry breaking via random initialisation
Optimisation
Learning Rate Size4
Recall from the Linear Regression Lecture For α too small, convergence is slow
For α too large, we get divergence
The solution is to treat α as a hyperparameter and then tune it Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]
Optimisation
Fluctuations
A feature of stochastic gradient descent is that the cost function won’t necessarily monotonically decrease
The stochastic gradients are noisy and sometimes push us in the wrong direction
This is more manifest as we approach an optimal point – further away we generally take big enough steps to swamp this fluctuation noise
A solution is learning rate decay: we set α = α(t) = α(0)e−t/τ
τ is the decay scale
Implicitly this will result in an α which exhibits rapid but late decay
Optimisation
Dead Units
If we have moved to a region where the activation of a unit changes
little, i.e. φ ′ a[h] ≈ 0, then by expression (7): i
δ[h]≈0 =⇒ ∂L ≈0 ∀j i ∂w[h]
This means that we have become stuck on a kind of plateau on the loss surface which it will take a long time to escape from
Learning will slow down
If we look at a h
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com