CS计算机代考程序代写 chain deep learning GPU case study algorithm [06-30213][06-30241][06-25024]

[06-30213][06-30241][06-25024]
Computer Vision and Imaging &
Robot Vision
Dr Hyung Jin Chang Dr Yixing Gao
h.j.chang@bham.ac.uk y.gao.8@bham.ac.uk
School of Computer Science

DEEP LEARNING II
2

Why (convolutional) neural networks?
State of the art performance on many problems
Most (all?) papers in recent vision conferences use deep neural networks
Razavian et al., CVPR 2014 Workshops

Neural network definition
• Nonlinear classifier
• Can approximate any continuous function to arbitrary
accuracy given sufficiently many hidden units
Figure from Christopher Bishop

Neural network definition
• Activations:
• Nonlinear activation function h (e.g. sigmoid,
RELU): Figure from Christopher Bishop

Neural network definition
• Layer 2
• Layer 3 (final)
• Outputs (e.g. sigmoid/softmax) (binary)
• Putting everything together:
(multiclass)

Nonlinear activation functions
Sigmoid
Leaky ReLU
max(0.1x, x)
tanh tanh(x) ReLU max(0,x)
Maxout ELU
Andrej Karpathy

Multilayer networks
• Cascade neurons together
• Output from one layer is the input to the next
• Each layer has its own sets of weights
HKUST

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
10

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
11

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
12

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
13

Feed-forward networks
• Predictions are fed forward through the network to classify
HKUST
14

Deep neural networks
• Lots of hidden layers
• Depth = power (usually)
Figure from http://neuralnetworksanddeeplearning.com/chap5.html
Weights to learn!
Weights to learn!
Weights to learn!
Weights to learn!

How do we train them?
• The goal is to iteratively find a set of weights that allow the activations/outputs to match the desired output
• For this, we will minimize a loss function
• The loss function quantifies the agreement
between the predicted scores and GT labels
• First, let’s simplify and assume we have a single layer of weights in the network

Classification goal
Example dataset: CIFAR-10 10 labels
50,000 training images
each image is 32x32x3 10,000 test images.
Andrej Karpathy

Classification scores
image parameters f(x,W)
+b
10 numbers, indicating class scores
Andrej Karpathy
[32x32x3]
array of numbers 0…1 (3072 numbers total)

Linear classifier
3072×1
10×1
10×3072
10×1
10 numbers, indicating class scores
Andrej Karpathy
[32x32x3]
array of numbers 0…1
parameters, or “weights”
+b

Linear classifier
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Andrej Karpathy

Linear classifier
TODO:
Andrej Karpathy
-3.45 -8.87
0.09
2.9
4.48
8.02
3.78
1.06
-0.36
-0.72
-0.51 3.42 6.04 4.64 5.31 2.65
-4.22 5.1 -4.19 2.64 3.58 5.55
2.
4.49
-4.37 -1.5
-4.79 -2.93 6.14
-2.09
-4.34
1.
Define a loss function that quantifies our unhappiness with the scores across the training data.
Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization)

Linear classifier
Suppose: 3 training examples, 3 classes. With some W the scores
cat car frog
are:
3.2 1.3 2.2
5.1 4.9 2.5 -1.7 2.0 -3.1
Adapted from Andrej Karpathy

Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes. With some W the scores are:
Hinge loss:
Given an example
where is the image and where is the (integer) label,
and using the shorthand for the scores vector:
the loss has the form:
Want: syi >= sj + 1 i.e.sj –syi +1<=0 If true, loss is 0 If false, loss is magnitude of violation cat car frog 3.2 1.3 2.2 5.1 4.9 2.5 -1.7 2.0 -3.1 Adapted from Andrej Karpathy Linear classifier: Hinge loss Suppose: 3 training examples, 3 classes. With some W the scores are: Hinge loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the loss has the form: = max(0, 5.1 - 3.2 + 1) +max(0, -1.7 - 3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9 3.2 5.1 -1.7 2.9 cat car frog Loss: 1.3 2.2 4.9 2.5 2.0 -3.1 Adapted from Andrej Karpathy Linear classifier: Hinge loss Suppose: 3 training examples, 3 classes. With some W the scores are: Hinge loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the loss has the form: = max(0, 1.3 - 4.9 + 1) +max(0, 2.0 - 4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 1.3 4.9 2.0 0 cat car 5.1 3.2 2.2 2.5 -1.7 Loss: 2.9 -3.1 frog Adapted from Andrej Karpathy Linear classifier: Hinge loss Suppose: 3 training examples, 3 classes. With some W the scores are: Hinge loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the loss has the form: = max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 5.3 + 1) + max(0, 5.6 + 1) = 6.3 + 6.6 = 12.9 2.2 2.5 -3.1 12.9 cat car 3.2 5.1 -1.7 1.3 4.9 2.0 0 frog Loss: 2.9 Adapted from Andrej Karpathy Linear classifier: Hinge loss Suppose: 3 training examples, 3 classes. With some W the scores are: Hinge loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the loss has the form: and the full training loss is the mean over all examples in the training data: L = (2.9 + 0 + 12.9)/3 = 15.8 / 3 = 5.3 Lecture 3 - 12 cat car frog Loss: 3.2 1.3 2.2 5.1 4.9 2.5 -1.7 2.0 -3.1 2.9 0 12.9 Adapted from Andrej Karpathy Linear classifier: Hinge loss Adapted from Andrej Karpathy Linear classifier: Hinge loss Weight Regularization In common use: L2 regularization L1 regularization Dropout (will see later) λ = regularization strength (hyperparameter) Adapted from Andrej Karpathy Another loss: Softmax (cross-entropy) cat car frog scores = unnormalized log probabilities of the classes. where Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class: 5.1 -1.7 Andrej Karpathy 3.2 Another loss: Softmax (cross-entropy) cat car frog exp normalize L_i = -log(0.13) = 0.89 3.2 5.1 -1.7 unnormalized probabilities 24.5 164.0 0.18 0.13 0.87 0.00 unnormalized log probabilities probabilities Adapted from Andrej Karpathy How to minimize the loss function? Andrej Karpathy How to minimize the loss function? In 1-dimension, the derivative of a function: In multiple dimensions, the gradient is the vector of (partial derivatives). Andrej Karpathy current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,...] Andrej Karpathy current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 W + h (first dim): [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25322 gradient dW: [?, ?, ?, ?, ?, ?, ?, ?, ?,...] Andrej Karpathy current W: W + h (first dim): [0.34, [0.34 + 0.0001, -1.11, -1.11, 0.78, 0.78, 0.12, 0.12, gradient dW: [-2.5, ?, ?, 0.55, 0.55, 2.81, 2.81, -3.1, -3.1, -1.5, -1.5, ?, 0.33,...] 0.33,...] ?,...] loss 1.25347 loss 1.25322 Andrej Karpathy ?, (1.25322 - 1.25347)/0.0001 ?, = -2.5 ?, ?, current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25353 gradient dW: [-2.5, ?, ?, ?, ?, ?, ?, ?, ?,...] Andrej Karpathy current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 W + h (second dim): [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25353 gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?,...] ?, (1.25353 - 1.25347)/0.0001 ?, = 0.6 Andrej Karpathy current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 W + h (third dim): [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 gradient dW: [-2.5, 0.6, ?, ?, ?, ?, ?, ?, ?,...] Andrej Karpathy This is silly. The loss is just a function of W: want Andrej Karpathy This is silly. The loss is just a function of W: want Use Calculus! = ... Andrej Karpathy current W: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33,...] loss 1.25347 dW = ... (some function data and W) gradient dW: [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1,...] Andrej Karpathy Loss gradients • Denoted as (diff notations): • i.e. how the loss changes as a function of the weights • We want to change the weights in such a way that makes the loss decrease as fast as possible Gradient descent • We’ll update weights iteratively • Move in direction opposite to gradient: Time L Learning rate W_2 loss function landscape negative gradient direction W_1 original W Figure from Andrej Karpathy Gradient descent • Iteratively subtract the gradient with respect to the model parameters (w) • i.e. we’re moving in a direction opposite to the gradient of the loss • i.e. we’re moving towards smaller loss • The procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent Mini-batch gradient descent • In classic gradient descent, we compute the gradient from the loss for all training examples (can be slow) • So, use only use some of the data for each gradient update • We cycle through all the training examples multiple times • Each time we’ve cycled through all of them once is called an ‘epoch’ Learning rate selection The effects of step size (or “learning rate”) Andrej Karpathy Gradient descent in multi-layer nets • We’ll update weights • Move in direction opposite to gradient: • How to update the weights at all layers? • Answer: backpropagation of loss from higher layers to lower layers Backpropagation: Graphic example First calculate error of output units and use this to change the top layer of weights. Update weights into j output hidden w(2) w(1) k j i Adapted from Ray Mooney input Backpropagation: Graphic example Next calculate error for hidden units based on errors on the output units it feeds into. output k hidden j Adapted from Ray Mooney input i Backpropagation: Graphic example Finally update bottom layer of weights based on errors calculated for hidden units. Adapted from Ray Mooney Update weights into i output k hidden j input i Backpropagation • Easier if we use computational graphs, especially when we have complicated functions typical in deep neural networks Figure from Karpathy Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 10 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 11 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 12 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 13 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 14 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 15 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 16 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 17 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 18 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Chain rule: Andrej Karpathy Upstream gradient Local gradient Lecture 4 - 13 Jan 2016 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 20 13 Jan 2016 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Chain rule: Andrej Karpathy Lecture 4 - 13 Jan 2016 Lecture 4 - 21 13 Jan 2016 activations f Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 22 13 Jan 2016 activations f “local gradient” Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 23 gradients 13 Jan 2016 activations f “local gradient” Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 24 gradients 13 Jan 2016 activations f “local gradient” Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 25 gradients 13 Jan 2016 activations f “local gradient” Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 26 gradients 13 Jan 2016 activations f “local gradient” Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016 Andrej Karpathy Lecture 4 - 27 gradients 13 Jan 2016 Backpropagation: another example Andrej Karpathy Convolutional Neural Networks (CNN) • Neural network with specialized connectivity structure • Stack multiple stages of feature extractors • Higher stages compute more global, more invariant, more abstract features • Classification layer at the end Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998. Adapted from Rob Fergus • Feed-forward feature extraction: 1. Convolve input with learned filters 2. Apply non-linearity 3. Spatial pooling (downsample) Supervised training of convolutional filters by back-propagating classification error Output (class probs) ... Spatial pooling Non-linearity Convolution (Learned) Convolutional Neural Networks (CNN) • Adapted from Lana Lazebnik Input Image Convolutions: More detail 32x32x3 image height Andrej Karpathy width depth 32 3 32 Convolutions: More detail 32x32x3 image 32 5x5x3 filter Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Andrej Karpathy 32 3 Convolutions: More detail Convolution Layer 32 32x32x3 image 5x5x3 filter 32 3 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) Andrej Karpathy Convolutions: More detail Convolution Layer activation map 32 32x32x3 image 5x5x3 filter 28 Andrej Karpathy 32 3 28 1 convolve (slide) over all spatial locations Convolutions: More detail Convolution Layer 32 consider a second, green filter 32x32x3 image 5x5x3 filter activation maps 28 Andrej Karpathy 32 3 28 1 convolve (slide) over all spatial locations Convolutions: More detail For example, if we had 6 5x5x3 filters, we’ll get 6 separate activation maps: 32 activation maps 28 Andrej Karpathy 32 3 28 6 Convolution Layer We stack these up to get a “new image” of size 28x28x6! Convolutions: More detail one filter =>
one activation map
example 5×5 filters (32 total)
We call the layer convolutional because it is related to convolution of two signals:
Element-wise multiplication and sum of a filter and the signal (image)
Adapted from Andrej Karpathy, Kristen Grauman

Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
28
32 36
Andrej Karpathy
CONV, ReLU e.g. 6 5x5x3 filters
28

Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32 28 24
32 5x5x3 filters
CONV, ReLU e.g. 6
CONV, ReLU e.g.10 5x5x6 filters
CONV, ReLU
24
….
28
3 6 10
Andrej Karpathy

Convolutions: More detail
A closer look at spatial dimensions:
32
32x32x3 image
5x5x3 filter
activation map
28
Andrej Karpathy
32 3
28
1
convolve (slide) over all spatial locations

Convolutions: More detail A closer look at spatial dimensions:
7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy

Convolutions: More detail A closer look at spatial dimensions:
7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
7
Andrej Karpathy

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter
=> 5×5 output
Andrej Karpathy
7

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2
Andrej Karpathy
7

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2
Andrej Karpathy
7

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 2 => 3×3 output!
Andrej Karpathy
7

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 3?
Andrej Karpathy
7

Convolutions: More detail
A closer look at spatial dimensions: 7
7×7 input (spatially) assume 3×3 filter applied with stride 3?
doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.
Andrej Karpathy
7

Convolutions: More detail
N
F
F
Andrej Karpathy
N
Output size:
(N – F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 – 3)/1 + 1 = 5 stride 2 => (7 – 3)/2 + 1 = 3 stride 3 => (7 – 3)/3 + 1 = 2.33 :\

Convolutions: More detail preview:
Andrej Karpathy

Spatial Pooling

A Common Architecture: AlexNet
Figure from http://www.mdpi.com/2072-4292/7/11/14680/htm

Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Only 3×3 CONV stride 1, pad 1 and 2×2 MAX POOL stride 2
best model
11.2% top 5 error in ILSVRC 2013 ->
7.3% top 5 error
Andrej Karpathy

Case Study: GoogLeNet
[Szegedy et al., 2014]
Andrej Karpathy
Inception module
ILSVRC 2014 winner (6.7% top 5 error)

Case Study: ResNet
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
Andrej Karpathy

Case Study: ResNet
Andrej Karpathy
(slide from Kaiming He’s recent presentation)

Case Study: ResNet
[He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
2-3 weeks of training on 8 GPU machine
Andrej Karpathy
(slide from Kaiming He’s recent presentation)

Practical matters

Comments on training algorithm
• Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
• However, in practice, does converge to low error for many large networks on real data.
• Thousands of epochs (epoch = network sees all training data once) may be required, hours or days to train.
• To avoid local-minima problems, run several trials starting with different random weights (random restarts), and take results of trial with lowest training set error.
• Maybe hard to set learning rate and to select number of hidden units and layers.
• Neural networks had fallen out of fashion in 90s, early 2000s; back with a new name and significantly improved performance (deep networks trained with dropout and lots of data).
Ray Mooney, Carlos Guestrin, Dhruv Batra

Over-training prevention
• Running too many epochs can result in over-fitting.
0
# training epochs
• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
Adapted from Ray Mooney
on test data
on training data
error

Training: Best practices
• Use mini-batch
• Use regularization
• Use cross-validation for your parameters
• Use RELU or leaky RELU, don’t use sigmoid
• Center (subtract mean from) your data
• Learning rate: too high? too low?

Regularization: Dropout
• Randomly turn off some neurons
• Allows individual neurons to independently be responsible for performance
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014] Adapted from Jia-bin Huang

Data Augmentation (Jittering)
Create virtual training samples
• Horizontal flip
• Random crop
• Color casting
• Geometric distortion
Jia-bin Huang
Deep Image [Wu et al. 2015]

Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
Andrej Karpathy

Transfer Learning with CNNs
• The more weights you need to learn, the more data you need
• That’s why with a deeper network, you need more data for training than for a shallower network
• One possible solution:
Set these to the already learned weights from another network
Learn these on your own task

Transfer Learning with CNNs
Source: classification on ImageNet
Target: some other task/data 3. Medium dataset:
finetuning
more data = retrain more of the network (or all of it)
Freeze these
1. Train on ImageNet
2. Small dataset:
Freeze these
Train this
Adapted from Andrej Karpathy
Train this
Lecture 11 – 29

Summary

• •
We use deep neural networks because of their strong performance in practice
Convolutional neural networks (CNN) • Convolution, nonlinearity, max pooling
Training deep neural nets
• We need an objective function that measures and guides us towards good performance
• We need a way to minimize the loss function: stochastic gradient descent
• We need backpropagation to propagate error through all layers and change their weights
Practices for preventing overfitting
• Dropout; data augmentation; transfer learning