What is a Deep Neural Network?
A deep network is a neural network with >3 layers
● typically >>3 layers
A shallow network is a neural network with ≤3 layers
Copyright By PowCoder代写 加微信 powcoder
● e.g. the types of network described in the preceding 2 weeks
The term deep learning refers to the process of training a deep neural network
(note, above counts linear input layer in number of layers)
Why Build Deep Neural Networks?
Three-layer Neural Networks are able to approximate any continuous non-linear function arbitrarily well.
● Universal function approximators
● Can solve any pattern recognition task, in theory
So what is the point of building deeper networks?
● What is a Deep Neural Network?
● Why Build Deep Neural Networks?
● Difficulties Training Deep Neural Networks
– solutions
● Convolutional Neural Networks (CNNs)
– convolutionallayers
– pooling and fully-connected layers ● Issues with CNNs/deep networks
Why Build Deep Neural Networks?
To solve complex tasks (i.e. to perform complex mappings) we need larger networks (i.e. more parameters).
1) could build wide network
Why Build Deep Neural Networks?
To solve complex tasks (i.e. to perform complex mappings) we need larger networks (i.e. more parameters).
1) could build wide network
● large increase in parameters needed to achieve a certain level of performance on a task
● tends to memorise training inputs, fails to generalise
Why Build Deep Neural Networks?
To solve complex tasks (i.e. to perform complex mappings) we need larger networks (i.e. with more parameters).
Why Build Deep Neural Networks?
To solve complex tasks (i.e. to perform complex mappings) we need larger networks (i.e. more parameters).
2) could build deep network
● smaller increase in parameters needed to achieve a certain level of performance on a task
● aims to capture the natural “hierarchy” of the task to improve generalisation
Why Build Deep Neural Networks?
Deep networks aim to provide a hierarchy of representations with increasing level of abstraction.
Natural for dealing with many tasks, e.g.: Image recognition
● pixel → edge → texton/contour → part → object Text recognition
● character → word → clause → sentence → story
Speech recognition
● sound → phone → phoneme → word → clause → sentence
Why Build Deep Neural Networks?
To solve complex tasks (i.e. to perform complex mappings) we need larger networks (i.e. more parameters).
2) could build deep network
Why Build Deep Neural Networks?
Need to be able to build deep neural networks also arises from recurrent networks
Recurrent networks process temporal information (i.e. where values of x change over time).
In a simple recurrent network, connections take the output of every hidden neuron and feed it in as addition input at the next iteration.
recurrent connection for one hidden neuron
Why Build Deep Neural Networks?
Need to be able to build deep neural networks also arises from recurrent networks
Simplified representation (each layer shown as one neuron)
h1 h2 h3 h4
What is a Deep Neural Network?
“Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones”. [ ]
Difficulties Training Deep NNs
● What is a Deep Neural Network?
● Why Build Deep Neural Networks?
● Difficulties Training Deep Neural Networks
– solutions
● Convolutional Neural Networks (CNNs)
– convolutionallayers
– pooling and fully-connected layers ● Issues with CNNs/deep networks
Why Build Deep Neural Networks?
Need to be able to build deep neural networks also arises from recurrent networks
Training a recurrent network is achieved by unfolding the network and using backpropagation on the unfolded network
There is one layer in the unfolded network for each time-step, so recurrent networks for processing long sequences are equivalent to very deep networks
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Applying backpropagation:
Δw54=η(t−y5)φ'(w54 y4)y4
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Applying backpropagation:
Δw54=η(t−y5)φ'(w54 y4)y4 Δw43=η(t−y5)φ'(w54 y4)w54φ'(w43 y3)y3
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y 123455
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji
Applying backpropagation:
Δw54=η(t−y5)φ'(w54 y4)y4 Δw43=η(t−y5)φ'(w54 y4)w54φ'(w43 y3)y3 Δw32=η(t−y5)φ'(w54 y4)w54φ'(w43 y3)w43φ'(w32 y2)y2
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Applying backpropagation:
Δw54=η(t−y5)φ'(w54 y4)y4 Δw43=η(t−y5)φ'(w54 y4)w54φ'(w43 y3)y3 Δw32=η(t−y5)φ'(w54 y4)w54φ'(w43 y3)w43φ'(w32 y2)y2
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Each time the error is propagated further backwards it is
multiplied by a factor of the form
φ'(wji xi)wji
When using a standard approach to weight initialisation, such as choosing weights from a distribution with a mean of zero, then this term may also be small
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
123455 w21≈0.004 w32≈0.016 w43≈0.063 Δw54≈0.25
If each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji≪1
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y 123455
Each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji
When using a standard activation function, like tanh or sigmoid, derivative is small for most values of wx
Difficulties Training Deep Neural Networks Exploding gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
Each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji
Note if weights are initialised to, or learn, large values, then
φ'(wji xi)wji>1
Difficulties Training Deep Neural Networks Exploding gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
123455 Δw21≈31.25 Δw32≈6.25 Δw43≈1.25 Δw54≈0.25
If each time the error is propagated further backwards it is multiplied by a factor of the form φ'(wji xi)wji>1
Difficulties Training Deep Neural Networks Vanishing gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
123455 Δw21≈0.004 Δw32≈0.016 Δw43≈0.063 Δw54≈0.25
If not careful, the gradient tends to get smaller as we move backward through the hidden layers. This means:
● neurons in the earlier layers learn much more slowly than neurons in later layers,
● early layers contribute nothing to solving the task (keep initial random weights), and hence:
– making network deeper does not improve performance
Difficulties Training Deep Neural Networks
Backpropagation is inherently unstable (gradients can easily vanish or explode).
To train deep networks it is necessary to mitigate this issue, by using:
● activation functions with non-vanishing derivatives
● better ways to initialise weights
● adaptive variations on standard backpropagation
● batch normalisation
● skip connections
Another issue is that deep networks have lots of parameters,
it is therefore necessary to have:
● very large labelled datasets
● large computational resources
Overcoming the Difficulties with Training Deep NNs
Difficulties Training Deep Neural Networks Exploding gradient problem
Consider a deep network with one neuron per layer:
y w21 y w32 y w43 y w54 y y
123455 Δw21≈31.25 Δw32≈6.25 Δw43≈1.25 Δw54≈0.25
If not careful, the gradient tends to get larger as we move backward through the hidden layers. This means:
● neurons in the earlier layers make, large, often random changes in their weights,
● later layers can not learn due to constantly changing output of earlier layers, and hence:
– makingnetworkdeepermakesperformanceworse
later next
Difficulties Training Deep Neural Networks
Backpropagation is inherently unstable (gradients can easily vanish or explode).
To train deep networks it is necessary to mitigate this issue, by using:
● activation functions with non-vanishing derivatives
● better ways to initialise weights
● adaptive variations on standard backpropagation
● batch normalisation
● skip connections
Activation Functions with Non-Vanishing Derivatives
Rectified Linear Unit (ReLU) φ(net j)=net j if net j≥0
0if netj<0
Leaky Rectified Linear Unit (LReLU)
φ(netj)= netjifnetj≥0 a×net j if net j<0
where a is fixed and same for all neurons Parametric Rectified Linear Unit (PReLU)
φ(netj)= netjifnetj≥0 aj×netjif netj<0
where aj is learnt for each neuron
● What is a Deep Neural Network?
● Why Build Deep Neural Networks?
● Difficulties Training Deep Neural Networks
– solutions
● Convolutional Neural Networks (CNNs)
– convolutionallayers
– pooling and fully-connected layers ● Issues with CNNs/deep networks
Better Ways to Initialise Weights
For weights connecting m inputs to n outputs:
Name of method
Xavier / / He
Activation function
sigmoid / tanh
ReLU / LReLU / PReLU
Choose weights from
Uniform distribution with range:
Normal distribution with mean=0 and standard deviation:
(−√m+n ,√m+n) 6 6
Adaptive Versions of struggles to deal with gradients in the cost function J(w) that are too small, or too large.
Recall, backpropagation is performing gradient
descent to find parameters
that minimise a cost function(e.g.thenumberof J(w) misclassified samples)
Activation Functions with Non-Vanishing Derivatives
Rectified Linear Unit (ReLU) φ'(netj)=1if netj≥0
φ'(netj) 1
φ'(netj) 1
φ'(netj) 1
0if netj<0
Leaky Rectified Linear Unit (LReLU)
φ'(netj)=1if netj≥0 aif netj<0
Parametric Rectified Linear Unit (PReLU)
φ'(netj)=1if netj≥0 ajif netj<0
Adaptive Versions of struggles to deal with gradients in the cost function J(w) that are too small, or too large.
gradient too low: many iterations for little gain in performance
gradient just right: optimal parameters found with few iterations
gradient too large: fails to find optimal parameters, oscillates
Adaptive Versions of struggles to deal with gradients in the cost function J(w) that are too small, or too large.
An analogous issue occurs when trying to select an appropriate learning rate:
rate too low:
many iterations for little gain in performance
rate just right: optimal parameters found with few iterations
rate too large:
fails to find optimal parameters, oscillates
Adaptive Versions of struggles to deal with gradients in the cost function J(w) that are too small, or too large.
Variation in the magnitude of the gradient may occur between:
● different layers (due to vanishing and exploding gradients)
J(w) ● different parts of the cost
function for a single neuron
● different directions for a multi-dimensional function
Adaptive Versions of Backpropagation
momentum: adds moving average of previous gradient to current gradient
● Increases step size when weight changes are consistently in same direction (helps with plateaus and local minima)
Adaptive Versions of Backpropagation adaptive learning rate: vary the learning rate (for individual
parameters) during training
● increasing learning rate if cost is decreasing
● decreasing learning rate if cost is increasing
increasing decreasing
Adaptive Versions of struggles to deal with gradients in the cost function J(w) that are too small, or too large.
Contrary to previous illustrations, step size is proportional to the gradient, this make problem even worse:
∂J(w)>0 ∂J(w)
Batch Normalisation
Learning in one layer of a network will change the distribution of inputs received by the subsequent layer of the network, e.g.:
● at some time during training
weight of 2nd neuron will have been trained to be appropriate for outputs produced by 1st neuron
distribution of outputs produced by 1st neuron with current weights
Batch Normalisation
Learning in one layer of a network will change the distribution of inputs received by the subsequent layer of the network, e.g.:
● at some later time during training
weight of 2nd neuron will have to be re-trained to be appropriate for new outputs produced by 1st neuron
distribution of outputs produced by 1st neuron with updated weights
Consequently, learning is slow as later layers are always having to compensate for changes made to earlier layers.
0 0.5 1 1.5 2
0 0.5 1 1.5 2
Adaptive Versions of Backpropagation adaptive learning rate: vary the learning rate (for individual
parameters) during training
● increasing learning rate if cost is decreasing
● decreasing learning rate if cost is increasing
Backpropagation algorithms with adaptive learning: ● AdaGrad
● RM algorithms with adaptive learning and momentum:
● ADAM ● Nadam
Batch Normalisation
Batch normalisation attempts to solve both these issues by scaling the output of each individual neuron so that it has a mean close to 0 and a standard deviation close to 1.
BN(x)=β+γ x−E(x) √Var(x)+ε
● β and γ are parameters learnt by backpropagation
● ε is a constant used to prevent division-by-zero errors
● E(x) is the mean of x
● Var(x) is the variance (the squared standard-deviation) of x
E(x) and Var(x) can be calculated using the values of x in the current batch or using all the training data presented so far
Batch Normalisation
Batch normalisation can be applied
● before yj=φ(BN(wji xi))
● or after y j=BN (φ(w ji xi))
the activation function
Batch Normalisation
Different inputs to one neuron can have very different scales, e.g.:
x2 0.4 0.2
0-2 0 2 x2
Inputs with smaller scales will tend to have less influence than ones with large scale (even if the smaller scale inputs are more discriminatory).
0-2 0 2 x1
Skip Connections
Connections that skip one or more layers of the network:
y w21 y w32+ y w43 y w54+ y y
123455 sometimes called a residual module
Skip connections let gradients by-pass parts of the network where the gradient has vanished
● Network effectively becomes shallower, but this may be temporary
Convolutional NNs: Convolutional Layers
Batch Normalisation
● Enables saturating activation functions to be used as it limits activations to the range where the gradients are non-zero
● Make weight initialisation less critical
● generally stabilises learning (gradients less likely to vanish or explode).
Convolutional Neural Networks (CNNs)
The most popular type of deep neural network. A CNN is:
● Any neural network in which at least one layer has an transfer function implemented using convolution/cross- correlation.
● (as opposed to vector multiplication).
Motivated by desire to recognise patterns with tolerance to
Convolutional Neural Networks (CNNs)
Motivated by desire to recognise patterns with tolerance to location, e.g. spatial location:
● What is a Deep Neural Network?
● Why Build Deep Neural Networks?
● Difficulties Training Deep Neural Networks
– solutions
● Convolutional Neural Networks (CNNs)
– convolutionallayers
– pooling and fully-connected layers ● Issues with CNNs/deep networks
Convolutional Neural Networks (CNNs)
Tolerance to location could be achieved by learning to recognise the pattern at each location independently.
sub-network recognising dog at position 1
sub-network recognising dog at position 2
Convolutional Neural Networks (CNNs)
Tolerance to location could be achieved by learning to recognise the pattern at each location independently.
Computationally more efficient to share the weights between the sub-networks (i.e. to have multiple copies of the same sub-network processing different locations).
Convolutional Neural Networks (CNNs)
Motivated by desire to recognise patterns with tolerance to location, e.g. temporal location:
CNNs: Convolution Layers
Weight sharing is achieved by using cross-correlation as the
transfer function.
Cross-correlation is implemented as follows.
For each location in the input array in turn:
1. Multiply each weight by the corresponding value in the input
2. Sum these products and write answer in the corresponding location in the output array
-1 10.51 0-2-10 0.5-1-11 -1 1 0 -2
-1 10.51 0-2-10 0.5-1-11 -1 1 0 -2
(−1×−1) +(1×1) +(0×0) +(−2×−2)
CNNs: Convolution Layers
Weight sharing is achieved by using cross-correlation as the
transfer function.
Cross-correlation is implemented as follows.
For each location in the input array in turn:
1. Multiply each weight by the corresponding value in the input
2. Sum these products and write answer in the corresponding location in the output array
(1×−1) +(0.5×1) +(−2×0) +(−1×−2)
CNNs: Convolution Layers
Weight sharing is achieved by using cross-correlation as the
transfer function.
A neuron’s weights are defined as an array, e.g.: -1 1
called a “mask”, or “filter”, or “kernel”
The values in this array, the weights, will be learnt using backpropagation.
The input is also an array, e.g.: -1 1 0.5 1 animageortheoutputofthe 0 -2 -1 0 preceding layer of the network 0.5 -1 -1 1
(note: can use 1D kernels for a 1D input, e.g. an audio recording, or use 3D kernels for a 3D input, e.g. a video.)
CNNs: Convolution Layers
Weight sharing is achieved by using cross-correlation as the
transfer function.
Cross-correlation is implemented as follows.
For each location in the input array in turn:
1. Multiply each weight by the corresponding value in the input
2. Sum these products and write answer in the corresponding location in the output array
-1 10.51 0-2-10 0.5-1-11 -1 1 0 -2
6 1.50.5 0
-1 10.51 0 -2 -1 0 0.5 -1 -1 1 -1 1 0 -2
(0×−1) +(−2×1) +(0.5×0) +(−1×−2)
CNNs: Convolution Layers
Weight sharing is achieved by using cross-correlation as the
transfer function.
Cross-correlation is implemented as follows.
For each location in the input array in turn:
1. Multiply each weight by the corresponding value in the input
2. Sum these products and wr
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com