[06-30213][06-30241][06-25024]
Computer Vision and Imaging &
Robot Vision
Dr Hyung Jin Chang Dr Yixing Gao
h.j.chang@bham.ac.uk y.gao.8@bham.ac.uk
School of Computer Science
DEEP LEARNING II
2
Why (convolutional) neural networks?
State of the art performance on many problems
Most (all?) papers in recent vision conferences use deep neural networks
Razavian et al., CVPR 2014 Workshops
Neural network definition
• Nonlinear classifier
• Can approximate any continuous function to arbitrary
accuracy given sufficiently many hidden units
Figure from Christopher Bishop
Neural network definition
• Activations:
• Nonlinear activation function h (e.g. sigmoid,
RELU): Figure from Christopher Bishop
Neural network definition
• Layer 2
• Layer 3 (final)
• Outputs (e.g. sigmoid/softmax) (binary)
• Putting everything together:
(multiclass)
Nonlinear activation functions
Sigmoid
Leaky ReLU
max(0.1x, x)
tanh tanh(x) ReLU max(0,x)
Maxout ELU
Andrej Karpathy
Multilayer networks
• Cascade neurons together
• Output from one layer is the input to the next
• Each layer has its own sets of weights
HKUST
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
10
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
11
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
12
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
13
Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
14
Deep neural networks
• Lots of hidden layers
• Depth = power (usually)
Figure from http://neuralnetworksanddeeplearning.com/chap5.html
Weights to learn!
Weights to learn!
Weights to learn!
Weights to learn!
How do we train them?
• The goal is to iteratively find a set of weights that allow the activations/outputs to match the desired output
• For this, we will minimize a loss function
• The loss function quantifies the agreement
between the predicted scores and GT labels
• First, let’s simplify and assume we have a single layer of weights in the network
Classification goal
Example dataset: CIFAR-10 10 labels
50,000 training images
each image is 32x32x3 10,000 test images.
Andrej Karpathy
Classification scores
image parameters f(x,W)
+b
10 numbers, indicating class scores
Andrej Karpathy
[32x32x3]
array of numbers 0…1 (3072 numbers total)
Linear classifier
3072×1
10×1
10×3072
10×1
10 numbers, indicating class scores
Andrej Karpathy
[32x32x3]
array of numbers 0…1
parameters, or “weights”
+b
Linear classifier
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Andrej Karpathy
Linear classifier
TODO:
Andrej Karpathy
-3.45 -8.87
0.09
2.9
4.48
8.02
3.78
1.06
-0.36
-0.72
-0.51 3.42 6.04 4.64 5.31 2.65
-4.22 5.1 -4.19 2.64 3.58 5.55
2.
4.49
-4.37 -1.5
-4.79 -2.93 6.14
-2.09
-4.34
1.
Define a loss function that quantifies our unhappiness with the scores across the training data.
Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)
Linear classifier
Suppose: 3 training examples, 3 classes. With some W the scores
cat car frog
are:
3.2 1.3 2.2
5.1 4.9 2.5 -1.7 2.0 -3.1
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
Want: syi >= sj + 1 i.e.sj –syi +1<=0
If true, loss is 0
If false, loss is magnitude of violation
cat car frog
3.2 1.3 2.2
5.1 4.9 2.5 -1.7 2.0 -3.1
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
= max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1)
= max(0, 2.9) + max(0, -3.9) = 2.9 + 0
= 2.9
3.2
5.1 -1.7
2.9
cat car
frog
Loss:
1.3 2.2 4.9 2.5 2.0 -3.1
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
= max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9) =0+0
=0
1.3
4.9
2.0
0
cat
car 5.1
3.2 2.2 2.5
-1.7 Loss: 2.9
-3.1
frog
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores
are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
= max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1)
= max(0, 5.3 + 1) + max(0, 5.6 + 1)
= 6.3 + 6.6
= 12.9
2.2
2.5
-3.1
12.9
cat car
3.2
5.1 -1.7
1.3
4.9
2.0 0
frog Loss:
2.9 Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
and the full training loss is the mean over all examples in the training data:
L = (2.9 + 0 + 12.9)/3 = 15.8 / 3 = 5.3
Lecture 3 - 12
cat
car
frog
Loss:
3.2 1.3 2.2
5.1 4.9 2.5 -1.7 2.0 -3.1
2.9 0 12.9
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Adapted from Andrej Karpathy
Linear classifier: Hinge loss
Weight Regularization
In common use:
L2 regularization
L1 regularization Dropout (will see later)
λ = regularization strength (hyperparameter)
Adapted from Andrej Karpathy
Another loss: Softmax (cross-entropy)
cat car frog
scores = unnormalized log probabilities of the classes.
where
Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class:
5.1 -1.7
Andrej Karpathy
3.2
Another loss: Softmax (cross-entropy)
cat car frog
exp
normalize
L_i = -log(0.13) = 0.89
3.2
5.1 -1.7
unnormalized probabilities
24.5
164.0 0.18
0.13
0.87 0.00
unnormalized log probabilities
probabilities
Adapted from Andrej Karpathy
How to minimize the loss function?
Andrej Karpathy
How to minimize the loss function? In 1-dimension, the derivative of a function:
In multiple dimensions, the gradient is the vector of (partial derivatives).
Andrej Karpathy
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
gradient dW:
[?, ?,
?,
?,
?,
?,
?,
?, ?,...]
Andrej Karpathy
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
W + h (first dim): [0.34 + 0.0001,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25322
gradient dW:
[?, ?,
?,
?,
?,
?,
?,
?, ?,...]
Andrej Karpathy
current W: W + h (first dim):
[0.34, [0.34 + 0.0001, -1.11, -1.11,
0.78, 0.78,
0.12, 0.12,
gradient dW:
[-2.5, ?,
?,
0.55, 0.55,
2.81, 2.81,
-3.1, -3.1,
-1.5, -1.5, ?, 0.33,...] 0.33,...] ?,...] loss 1.25347 loss 1.25322
Andrej Karpathy
?,
(1.25322 - 1.25347)/0.0001
?,
= -2.5
?, ?,
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
W + h (second dim): [0.34,
-1.11 + 0.0001, 0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25353
gradient dW:
[-2.5, ?,
?,
?,
?,
?,
?,
?, ?,...]
Andrej Karpathy
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
W + h (second dim): [0.34,
-1.11 + 0.0001, 0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25353
gradient dW:
[-2.5, 0.6, ?,
?,
?,
?, ?,...]
?,
(1.25353 - 1.25347)/0.0001
?,
= 0.6
Andrej Karpathy
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
W + h (third dim): [0.34,
-1.11,
0.78 + 0.0001, 0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
gradient dW:
[-2.5, 0.6, ?,
?,
?,
?,
?,
?, ?,...]
Andrej Karpathy
This is silly. The loss is just a function of W:
want
Andrej Karpathy
This is silly. The loss is just a function of W:
want
Use Calculus! = ...
Andrej Karpathy
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,...]
loss 1.25347
dW = ...
(some function data and W)
gradient dW:
[-2.5, 0.6,
0,
0.2, 0.7, -0.5, 1.1, 1.3, -2.1,...]
Andrej Karpathy
Loss gradients
• Denoted as (diff notations):
• i.e. how the loss changes as a function of the weights
• We want to change the weights in such a way that makes the loss decrease as fast as possible
Gradient descent
• We’ll update weights iteratively
• Move in direction opposite to gradient:
Time
L
Learning rate
W_2
loss function landscape
negative gradient direction
W_1
original W
Figure from Andrej Karpathy
Gradient descent
• Iteratively subtract the gradient with respect to the model parameters (w)
• i.e. we’re moving in a direction opposite to the gradient of the loss
• i.e. we’re moving towards smaller loss
• The procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent
Mini-batch gradient descent
• In classic gradient descent, we compute the gradient from the loss for all training examples (can be slow)
• So, use only use some of the data for each gradient update
• We cycle through all the training examples multiple times
• Each time we’ve cycled through all of them once is called an ‘epoch’
Learning rate selection
The effects of step size (or “learning rate”)
Andrej Karpathy
Gradient descent in multi-layer nets
• We’ll update weights
• Move in direction opposite to gradient:
• How to update the weights at all layers?
• Answer: backpropagation of loss from higher layers to lower layers
Backpropagation: Graphic example
First calculate error of output units and use this to change the top layer of weights.
Update weights into j
output
hidden
w(2)
w(1)
k
j
i
Adapted from Ray Mooney
input
Backpropagation: Graphic example
Next calculate error for hidden units based on errors on the output units it feeds into.
output k
hidden j
Adapted from Ray Mooney
input
i
Backpropagation: Graphic example
Finally update bottom layer of weights based on errors calculated for hidden units.
Adapted from Ray Mooney
Update weights into i
output k
hidden j
input
i
Backpropagation
• Easier if we use computational graphs, especially when we have complicated functions typical in deep neural networks
Figure from Karpathy
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 10
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 11
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 12
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 13
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 14
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 15
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 16
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 17
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 18
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Chain rule:
Andrej Karpathy
Upstream gradient
Local gradient
Lecture 4 -
13 Jan 2016
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 20
13 Jan 2016
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Chain rule:
Andrej Karpathy
Lecture 4 -
13 Jan 2016
Lecture 4 - 21
13 Jan 2016
activations
f
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 -
13 Jan 2016
Andrej Karpathy
Lecture 4 - 22
13 Jan 2016
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 - 13 Jan 2016
Andrej Karpathy
Lecture 4 - 23
gradients
13 Jan 2016
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 - 13 Jan 2016
Andrej Karpathy
Lecture 4 - 24
gradients
13 Jan 2016
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 - 13 Jan 2016
Andrej Karpathy
Lecture 4 - 25
gradients
13 Jan 2016
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 - 13 Jan 2016
Andrej Karpathy
Lecture 4 - 26
gradients
13 Jan 2016
activations
f
“local gradient”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 4 - 13 Jan 2016
Andrej Karpathy
Lecture 4 - 27
gradients
13 Jan 2016
Backpropagation: another example
Andrej Karpathy
Convolutional Neural Networks (CNN)
• Neural network with specialized connectivity structure
• Stack multiple stages of feature extractors
• Higher stages compute more global, more invariant, more abstract features
• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document
recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Adapted from Rob Fergus
•
Feed-forward feature extraction:
1. Convolve input with learned filters
2. Apply non-linearity
3. Spatial pooling (downsample)
Supervised training of convolutional filters by back-propagating classification error
Output (class probs)
...
Spatial pooling
Non-linearity
Convolution (Learned)
Convolutional Neural Networks (CNN)
•
Adapted from Lana Lazebnik
Input Image
Convolutions: More detail 32x32x3 image
height
Andrej Karpathy
width depth
32 3
32
Convolutions: More detail
32x32x3 image 32
5x5x3 filter
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Andrej Karpathy
32 3
Convolutions: More detail
Convolution Layer
32
32x32x3 image
5x5x3 filter
32 3
1 number:
the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Andrej Karpathy
Convolutions: More detail Convolution Layer
activation map
32
32x32x3 image
5x5x3 filter
28
Andrej Karpathy
32 3
28 1
convolve (slide) over all spatial locations
Convolutions: More detail
Convolution Layer 32
consider a second, green filter
32x32x3 image
5x5x3 filter
activation maps
28
Andrej Karpathy
32 3
28 1
convolve (slide) over all spatial locations
Convolutions: More detail
For example, if we had 6 5x5x3 filters, we’ll get 6 separate activation maps:
32
activation maps
28
Andrej Karpathy
32 3
28 6
Convolution Layer
We stack these up to get a “new image” of size 28x28x6!
Convolutions: More detail
one filter =>
one activation map
example 5×5 filters (32 total)
We call the layer convolutional because it is related to convolution of two signals:
Element-wise multiplication and sum of a filter and the signal (image)
Adapted from Andrej Karpathy, Kristen Grauman
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
28
32 36
Andrej Karpathy
CONV, ReLU e.g. 6 5x5x3 filters
28
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32 28 24
32 5x5x3 filters
CONV, ReLU e.g. 6
CONV, ReLU e.g.10 5x5x6 filters
CONV, ReLU
24
….
28
3 6 10
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
32
32x32x3 image
5x5x3 filter
activation map
28
Andrej Karpathy
32 3
28
1
convolve (slide) over all spatial locations
Convolutions: More detail A closer look at spatial dimensions:
7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy
Convolutions: More detail A closer look at spatial dimensions:
7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
=> 5×5 output
Andrej Karpathy
7
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2
Andrej Karpathy
7
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2
Andrej Karpathy
7
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2 => 3×3 output!
Andrej Karpathy
7
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 3?
Andrej Karpathy
7
Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 3?
doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.
Andrej Karpathy
7
Convolutions: More detail
N
F
F
Andrej Karpathy
N
Output size:
(N – F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 – 3)/1 + 1 = 5 stride 2 => (7 – 3)/2 + 1 = 3 stride 3 => (7 – 3)/3 + 1 = 2.33 :\
Convolutions: More detail preview:
Andrej Karpathy
Spatial Pooling
A Common Architecture: AlexNet
Figure from http://www.mdpi.com/2072-4292/7/11/14680/htm
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Only 3×3 CONV stride 1, pad 1 and 2×2 MAX POOL stride 2
best model
11.2% top 5 error in ILSVRC 2013 ->
7.3% top 5 error
Andrej Karpathy
Case Study: GoogLeNet
[Szegedy et al., 2014]
Andrej Karpathy
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
Case Study: ResNet
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
Andrej Karpathy
Case Study: ResNet
Andrej Karpathy
(slide from Kaiming He’s recent presentation)
Case Study: ResNet
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
2-3 weeks of training on 8 GPU machine
Andrej Karpathy
(slide from Kaiming He’s recent presentation)
Practical matters
Comments on training algorithm
• Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
• However, in practice, does converge to low error for many large networks on real data.
• Thousands of epochs (epoch = network sees all training data once) may be required, hours or days to train.
• To avoid local-minima problems, run several trials starting with different random weights (random restarts), and take results of trial with lowest training set error.
• Maybe hard to set learning rate and to select number of hidden units and layers.
• Neural networks had fallen out of fashion in 90s, early 2000s; back with a new name and significantly improved performance (deep networks trained with dropout and lots of data).
Ray Mooney, Carlos Guestrin, Dhruv Batra
Over-training prevention
• Running too many epochs can result in over-fitting.
0
# training epochs
• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
Adapted from Ray Mooney
on test data
on training data
error
Training: Best practices
• Use mini-batch
• Use regularization
• Use cross-validation for your parameters
• Use RELU or leaky RELU, don’t use sigmoid
• Center (subtract mean from) your data
• Learning rate: too high? too low?
Regularization: Dropout
• Randomly turn off some neurons
• Allows individual neurons to independently be responsible for performance
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014] Adapted from Jia-bin Huang
Data Augmentation (Jittering)
Create virtual training samples
• Horizontal flip
• Random crop
• Color casting
• Geometric distortion
Jia-bin Huang
Deep Image [Wu et al. 2015]
Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
Andrej Karpathy
Transfer Learning with CNNs
• The more weights you need to learn, the more data you need
• That’s why with a deeper network, you need more data for training than for a shallower network
• One possible solution:
Set these to the already learned weights from another network
Learn these on your own task
Transfer Learning with CNNs
Source: classification on ImageNet
Target: some other task/data 3. Medium dataset:
finetuning
more data = retrain more of the network (or all of it)
Freeze these
1. Train on ImageNet
2. Small dataset:
Freeze these
Train this
Adapted from Andrej Karpathy
Train this
Lecture 11 – 29
Summary
•
• •
We use deep neural networks because of their strong performance in practice
Convolutional neural networks (CNN) • Convolution, nonlinearity, max pooling
Training deep neural nets
• We need an objective function that measures and guides us towards good performance
• We need a way to minimize the loss function: stochastic gradient descent
• We need backpropagation to propagate error through all layers and change their weights
Practices for preventing overfitting
• Dropout; data augmentation; transfer learning
•