Deep Learning – COSC2779 – Vision Application & CNN Architectures
Deep Learning – COSC2779
Vision Application & CNN Architectures
Dr. Ruwan Tennakoon
Aug 16, 2021
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 1 / 51
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 2 / 51
Last Week: Pooling & Convolutions
C1
H
W
Conv 3× 3
Ch: C2
Activation
+ Pool
Conv 3× 3
Ch: C3
Convolutions can be combined with pooling to construct a chain of layers.
Feature extraction usually happens locally – sparse connectivity.
In feature extraction the same operation is applied at different locations – parameter
sharing.
Pooling help reduce redundant information and provide some level of invariance to
translations.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 3 / 51
Last Week: Pooling & Convolutions
C1
H
W
Conv 3× 3
Ch: C2
Activation
+ Pool
Conv 3× 3
Ch: C3
Convolutions can be combined with pooling to construct a chain of layers.
Feature extraction usually happens locally – sparse connectivity.
In feature extraction the same operation is applied at different locations – parameter
sharing.
Pooling help reduce redundant information and provide some level of invariance to
translations.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 3 / 51
Last Week: LeNet Architecture
“LeNet is a classic example of convolutional neural network to successfully predict
handwritten digits.” [LeNet]
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 4 / 51
https://ieeexplore.ieee.org/abstract/document/726791
How Can We Design the Architecture?
There are so many hyper parameter to choose in CNN:
Number of convolutional layers, filter size, number of filters, stride,
initialization . . .
Pooling size, Number of pooling layers . . .
Number of FC layers, units, . . .
optimization type, lerning rate, . . .
. . .
Use classic networks like LeNet, AlexNet, VGG-16, VGG-19, ResNet
etc. as inspiration (follow the trend used in those architectures).
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 5 / 51
How Can We Design the Architecture?
There are so many hyper parameter to choose in CNN:
Number of convolutional layers, filter size, number of filters, stride,
initialization . . .
Pooling size, Number of pooling layers . . .
Number of FC layers, units, . . .
optimization type, lerning rate, . . .
. . .
Use classic networks like LeNet, AlexNet, VGG-16, VGG-19, ResNet
etc. as inspiration (follow the trend used in those architectures).
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 5 / 51
Objectives of the Lecture
Identify how deep networks are developed via case study: Image
classification (IMAGENET).
Understand the main trends in classic architectures and why they work.
Identify the classic network architectures used for common computer
vision problems:
Image Classification
Object Detection
Image Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 6 / 51
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 7 / 51
Image Classification
The ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
is an annual competition helped
between 2010 and 2017.
The datasets comprised
approximately 1 million images
and 1,000 object classes.
The annual challenge focuses on
multiple tasks for image
classification.
Image source: ImageNet
Alex Krizhevsky, et al. “ImageNet Classification with Deep Convolutional Neural Networks”
developed a convolutional neural network that achieved top results on the ILSVRC-2010 and
ILSVRC-2012 image classification tasks.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 8 / 51
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 9 / 51
CNN Architecture Evolution: Image Classification
Alexnet ZFNet VGGNet GoogLeNet ResNet
5
10
15
16.4
11.7
7.3
6.7
3.6
Im
ag
eN
et
E
rr
or
(%
)
Alexnet ZFNet VGGNet GoogLeNet ResNet
0
50
100
150
8 8
19 22
152
#
La
ye
rs
Supervised learning based image classification.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 10 / 51
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3
Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape ?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3
Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape (W − F + 2P)/S + 1 = (227− 11 + 2 ∗ 0)/4 + 1 = 55→ [? , 55, 55, 96]
Parameters ?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3
Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape (W − F + 2P)/S + 1 = (227− 11 + 2 ∗ 0)/4 + 1 = 55→ [? , 55, 55, 96]
Parameters 11× 11× 3× 96 + 96
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3→ After Conv1: 55× 55× 96
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3→ After Conv1: 55× 55× 96
Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape ?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3→ After Conv1: 55× 55× 96
Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape (W − F + 2P)/S + 1 = (55− 3 + 2 ∗ 0)/2 + 1 = 27→ [? , 27, 27, 96]
Parameters ?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: ImageNet Classification with Deep Convolutional Neural Networks
Input: 227× 227× 3→ After Conv1: 55× 55× 96
Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape (W − F + 2P)/S + 1 = (55− 3 + 2 ∗ 0)/2 + 1 = 27→ [? , 27, 27, 96]
Parameters 0
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: Krizhevsky et al. 2012
227x227x3 Input
55x55x96 Conv1: 96 11×11 filters at stride 4, pad 0
27x27x96 Max-pool1: 3×3 filters at stride 2
27x27x96 Norm1: Normalization layer
27x27x256 Conv2: 256 5×5 filters at stride 1, pad 2
13x13x256 Max-pool2: 3×3 filters at stride 2
13x13x256 Norm2: Normalization layer
13x13x384 Conv3: 384 3×3 filters at stride 1, pad 1
13x13x384 Conv4: 384 3×3 filters at stride 1, pad 1
13x13x256 Conv5: 256 3×3 filters at stride 1, pad 1
6x6x256 Max-pool3: 3×3 filters at stride 2
4096 FC6: 4096 neurons
4096 FC7: 4096 neurons
1000 FC8: 1000 neurons (class scores)
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 13 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: Krizhevsky et al. 2012
What happens if we change the input tensor height
& width to conv layer?
C1
H
W
Conv 3× 3
Ch: C2
Can we change the input to AlexNet?
No, This will change the dimensions of the input
tensor to FC layers.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 13 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
AlexNet [Krizhevsky et al. 2012]
Image: Krizhevsky et al. 2012
Number of Parameters: Overall, AlexNet
has about 61M parameters.
Use of ReLU
Norm layers – not common anymore
Data augmentation
Dropout 0.5 in FC layers.
Batch size 128
SGD Momentum 0.9
Initial learning rate 1e-2, reduced by 10x
manually when val accuracy plateaus.
RegularizationL2 weight decay 5e-4
7 CNN ensemble: 18.2% to 15.4%
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 14 / 51
https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
ZFNet [Zeiler and Fergus, 2013]
Image: Visualizing and Understanding Convolutional Networks
Similer to AlexNet:
Conv1: Change from 11×11 stride 4 to 7×7 stride 2.
Conv3,4,5: Change from 384, 384, 256 filters to 512, 1024, 512.
ImageNet top 5 error improved from 16.4% to 11.7%
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 15 / 51
https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 16 / 51
VGGNet [Simonyan and Zisserman, 2014]
Input
96-Conv 11×11
Pool
256-Conv 5×5
Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
AlexNet
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-16
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-19
Convolution layers: 3×3, stride 1, pad 1.
Pooling layers: 2×2 max-pool stride 2.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51
VGGNet [Simonyan and Zisserman, 2014]
Input
96-Conv 11×11
Pool
256-Conv 5×5
Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
AlexNet
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-16
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-19
Images: from Unsplash
What will happen when objects are at different scales?
Stack of three 3×3 convolution (stride 1) layers has same effective receptive field
as one 7×7 convolution layer.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51
VGGNet [Simonyan and Zisserman, 2014]
Input
96-Conv 11×11
Pool
256-Conv 5×5
Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
AlexNet
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-16
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-19
Convolution layers: 3×3, stride 1, pad 1.
Pooling layers: 2×2 max-pool stride 2.
Stack of three 3×3 convolution (stride 1) layers
has same effective receptive field as one 7×7
convolution layer.
Deeper structure allows more non-linearities.
Still have fewer parameters: 3x(3x3xCi xCo) Vs.
(7x7xCi xCo)
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51
VGGNet [Simonyan and Zisserman, 2014]
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-16
Input
64-Conv 3×3
64-Conv 3×3
Pool
128-Conv 3×3
128-Conv 3×3
Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
Pool
FC 4096
FC 4096
FC 1000
VGG-19
Memory: 96MB per image (forward pass)
Number of Parameters: 138M parameters
Similar training procedure as AlexNet.
Use ensembles for best results.
FC7 features generalize well to other tasks.
Large amount of parameters at the last FC
layers (80%).
Main idea: Smaller filters and deeper
networks.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 18 / 51
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 19 / 51
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Computationally efficient deeper network:
Number of parameters: 5M (12x less that AlexNet, 27x less than VGG-16)
No fully connected layers at the end. Average pooling across channels.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 20 / 51
https://arxiv.org/pdf/1409.4842.pdf
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Computationally efficient deeper network:
Number of parameters: 5M (12x less that AlexNet, 27x less than VGG-16)
No fully connected layers at the end. Average pooling across channels.
22 layers (with efficient “Inception module”)
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 20 / 51
https://arxiv.org/pdf/1409.4842.pdf
GoogLeNet [Szegedy et al., 2014]
Images: from Unsplash
Inception module:
Design a good local network topology
Apply filters with different size receptive fields to the input form previous layer (1×1, 3×3,
5×5, 3×3 pooling ). Then Concatenate all filter outputs together in channel diminution.
‘ReLU’ activation or convolution modules.
Why use 1×1 convolutions?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 21 / 51
GoogLeNet [Szegedy et al., 2014]
Previous Layer
5×5
Convolution
3×3
Convolution
1×1
Convolution
1×1
Convolution
1×1
Convolution
1×1
Convolution
3×3 Max
pooling
Filter
Concatenetion
Inception module:
Design a good local network topology
Apply filters with different size receptive fields to the input form previous layer (1×1, 3×3,
5×5, 3×3 pooling ). Then Concatenate all filter outputs together in channel diminution.
‘ReLU’ activation or convolution modules.
Why use 1×1 convolutions?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 21 / 51
GoogLeNet [Szegedy et al., 2014]
Previous Layer
96, 5×5
Convolution
192, 3×3
Convolution
3×3
Pooling
128, 1×1
Convolution
Filter
Concatenetion
Input:28x28x256
28x28x9628x28x19228x28x128 28x28x256
28x28x672
Inception module:
The channel dimension grows with depth of the network.
Very expensive to compute.
Path [3×3] has 3× 3× 192× 256 = 442368 parameters.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 22 / 51
GoogLeNet [Szegedy et al., 2014]
C i
1 6 2 8 1 2 7
1 6 2 8 1 2 5
0 5 8 1 5 7 1
1 7 1 3 5 8 0
5 2 4 4 5 8 4
8 2 3 7 3 8 2
1 2 3 6 5 9 6
w1
C
o
C o
We can do channel dimensionality reduction (increase) with 1×1 convolutions.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 22 / 51
GoogLeNet [Szegedy et al., 2014]
Previous Layer
96, 5×5
Convolution
192, 3×3
Convolution
64, 1×1
Convolution
128, 1×1
Convolution
64, 1×1
Convolution
64, 1×1
Convolution
3×3 Max
pooling
Filter
Concatenetion
Input:28x28x256
28x28x9628x28x19228x28x128 28x28x64
28x28x480
28x28x6428x28x64 28x28x256
“Bottleneck” modules [1×1] reduce computational cost and output shape.
Path [3×3] originally had 3× 3× 192× 256 = 442, 368 parameters.
Path [3×3] how has 1× 1× 64× 256 + 3× 3× 192× 64 = 126, 976 parameters.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 23 / 51
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Some convolution + max pooling at start.
Stack “Inception module” on top of each other.
Some intermediate stacking will have max-pooling.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 24 / 51
https://arxiv.org/pdf/1409.4842.pdf
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
No FC layers at the end. The output of last inception module is subjected to global
average pooling and then the final “softmax” layer with 1000 classes.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 25 / 51
https://arxiv.org/pdf/1409.4842.pdf
Gobal Average Pooling (GAP)
Image: Global Average Pooling Layers for Object Localization
Normal pooling does each channel
independently with 2D masks.
In GAP, all the pixels in each channel
is averaged (or max in global
max-pooling) independently to
produce a vector.
If the input to GAP is [B, H, W, C]
the output will be [B, 1, 1, C].
Allows different input shapes at train
and test times.
First paper to use GAP-Network In
Network
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 26 / 51
https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/
https://arxiv.org/pdf/1312.4400.pdf
https://arxiv.org/pdf/1312.4400.pdf
Gobal Average Pooling (GAP)
Image: Global Average Pooling Layers for Object Localization
What happens if we change the input
tensor height & width to conv layer?
C1
H
W
Conv 3 × 3
Ch: C2
Can we change the input to AlexNet?
No, This will change the dimensions of the
input tensor to FC layers.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 26 / 51
https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Deep networks has the issue of vanishing gradients.
Auxiliary classification outputs to inject additional gradient at lower layers.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 27 / 51
https://arxiv.org/pdf/1409.4842.pdf
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Stochastic gradient descent with 0.9 momentum.
Fixed learning rate schedule: decreasing the learning rate by 4% every 8 epochs.
Data Augmentation and dropout (last layer) for preventing over-fitting.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 28 / 51
https://arxiv.org/pdf/1409.4842.pdf
GoogLeNet [Szegedy et al., 2014]
Image: Going deeper with convolutions
Main ideas:
Efficient “Inception” module integrates information at multiple receptive fields.
No FC at the end (use GAP instead) which will reduce the number of parameters
significantly.
Auxiliary classification outputs to inject additional gradient at lower layers.
ILSVRC 2014 classification winner with 6.7% top 5 error.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 29 / 51
https://arxiv.org/pdf/1409.4842.pdf
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 30 / 51
CNN Architecture Evolution: Image Classification
Alexnet ZFNet VGGNet GoogLeNet ResNet
5
10
15
16.4
11.7
7.3
6.7
3.6
Im
ag
eN
et
E
rr
or
(%
)
Alexnet ZFNet VGGNet GoogLeNet ResNet
0
50
100
150
8 8
19 22
152
#
La
ye
rs
Make the network more and more deep to increase performance?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 31 / 51
ResNet [He et al., 2015]
What happens if we increase the depth?
Image: Deep Residual Learning for Image Recognition
Deeper model performs worse than he shallow model. Maybe over-fitting?
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 32 / 51
https://arxiv.org/pdf/1512.03385.pdf
ResNet [He et al., 2015]
What happens if we increase the depth?
Image: Deep Residual Learning for Image Recognition
Deeper model performs worse than he shallow model. Maybe over-fitting?
The training error of deeper model is also worse. Not over-fitting.
Hypothesis – Training (optimizing) deeper models is harder.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 32 / 51
https://arxiv.org/pdf/1512.03385.pdf
ResNet [He et al., 2015]
Learn a residual mapping at each layer, instead of trying to learn the underlying mapping.
X
Previous Layer
3×3
Convolution
3×3
Convolution
ReLU
ReLU
H(X)
X
Previous Layer
3×3
Convolution
3×3
Convolution
ReLU
F (X)
+
H(X)
ReLU
H(X) = F (X) + X
F (X) = H(X) − X . Residual
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 33 / 51
ResNet [He et al., 2015]
Learn a residual mapping at each layer, instead of trying to learn the underlying mapping.
X
Previous Layer
3×3
Convolution
3×3
Convolution
ReLU
ReLU
H(X)
X
Previous Layer
3×3
Convolution
3×3
Convolution
ReLU
F (X)
+
H(X)
ReLU
H(X) = F (X) + X
F (X) = H(X) − X . Residual
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 33 / 51
ResNet [He et al., 2015]
X
Previous Layer
3×3
Convolution
3×3
Convolution
ReLU
F (X)
+
H(X)
ReLU
Full ResNet architecture:
Stack residual blocks. Every residual block has two 3×3 conv layers.
Periodically, double # of filters and downsample spatially using stride 2
(/2 in each dimension)
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 34 / 51
ResNet [He et al., 2015]
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 34 / 51
ResNet [He et al., 2015]
For very deep networks use “bottleneck” blocks to
reduce computations.
If F (X) = 0 then only identify: H(X) = X.
Combined with weight decay we can get some layers to
be identify.
Skip connections also provide a direct path for gradients
to the bottom layer (close to input).
X
channels 256
64, 1×1
Convolution
64, 3×3
Convolution
256, 1×1
Convolution
+
H(X)
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 35 / 51
ResNet [He et al., 2015]
X
Previous Layer
Convolution
Convolution
ReLU
F (X)
+
H(X)
ILSVRC 2015 classification winner (3.6% top 5 error)
Many layers: 152 layers on ImageNet, 1202 on Cifar.
Additional conv layer at the beginning.
No FC layers at the end (only FC 1000 to output classes)
Training ResNet in practice:
Batch Normalization after every Convolution layer.
He normal initialization from He et al.
SGD + Momentum (0.9)
Learning rate: 0.1, divided by 10 when validation error
plateaus
Mini-batch size 256
Weight decay of 1e-5
No dropout used
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 36 / 51
Improvements/Modifications to ResNet
Improve residual block: Identity Mappings in Deep Residual Networks
Wide residual blocks not only deep: Wide Residual Networks
Residual blocks that share multi scale convolutions from GoogleLeNet:
Aggregated Residual Transformations for Deep Neural Networks
(ResNeXt)
Dropout layers in residual block: Deep Networks with Stochastic Depth
(dropout layers)
All convolutions in a residual block also gets skip connections: Densely
Connected Convolutional Networks.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 37 / 51
https://arxiv.org/pdf/1603.05027.pdf
https://arxiv.org/pdf/1605.07146.pdf
https://arxiv.org/pdf/1611.05431.pdf
https://arxiv.org/pdf/1611.05431.pdf
https://arxiv.org/pdf/1603.09382.pdf
https://arxiv.org/pdf/1603.09382.pdf
https://arxiv.org/pdf/1608.06993.pdf
https://arxiv.org/pdf/1608.06993.pdf
Comparing Different CNN Architectures for Image Classification
Image: An Analysis of Deep Neural Network Models for Practical Applications.
The size of the blobs is proportional to the number of network parameters
AlexNet low accuracy, high
computational cost.
VGG Higest number of
parameters.
ResNet: Best accuracy, moderate
complexity.
Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 38 / 51
https://arxiv.org/pdf/1605.07678.pdf
Efficient Networks for Mobile Applications
Mobilenets: MobileNets: Efficient Convolutional Neural Networks for
Mobile Vision Applications.
Squeeze Net: SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and <0.5MB model size
ShuffleNet: ShuffleNet: An Extremely Efficient Convolutional Neural
Network for Mobile Devices
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 39 / 51
https://arxiv.org/pdf/1704.04861.pdf
https://arxiv.org/pdf/1704.04861.pdf
https://arxiv.org/pdf/1602.07360.pdf
https://arxiv.org/pdf/1602.07360.pdf
https://arxiv.org/pdf/1707.01083.pdf
https://arxiv.org/pdf/1707.01083.pdf
Outline
1 Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
2 Object Detection & Segmentation
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 40 / 51
Image Segmentation
Image: SegNet: Road Scene Segmentation
Predict a category label (class) for each pixel of the image.
Crop patch and do classification of the center pixel using classification CNN?
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51
https://youtu.be/CxanE_W46ts
Image Segmentation
Image: SegNet: Road Scene Segmentation
Predict a category label (class) for each pixel of the image.
Crop patch and do classification of the center pixel using classification CNN?
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51
https://youtu.be/CxanE_W46ts
Image Segmentation
Image: SegNet: Road Scene Segmentation
Predict a category label (class) for each pixel of the image.
Crop patch and do classification of the center pixel using classification CNN?
Very expensive to do inference. Not feasible.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51
https://youtu.be/CxanE_W46ts
Image Segmentation
Ci
H
W
Conv Conv
Co
H
W
SoftMax
Design a network with only convolutional layers without downsampling operators to make
predictions for pixels all at once. Co is the number of classes in the classification problem.
Very expensive, large memory requirement.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 42 / 51
Image Segmentation
Image: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Encoder decoder architecture for segmentation.
Encoder First part of the network with convolution and pooling (strided convolutions).
Decoder Second part with convolution and upsampling (transpose convolutions).
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 43 / 51
https://arxiv.org/pdf/1511.00561.pdf
Image Segmentation
Fully Convolutional Networks for Semantic Segmentation
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation
Learning Deconvolution Network for Semantic Segmentation
U-Net: Convolutional Networks for Biomedical Image Segmentation
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 44 / 51
https://arxiv.org/pdf/1411.4038.pdf
https://arxiv.org/pdf/1511.00561.pdf
https://arxiv.org/pdf/1511.00561.pdf
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf
https://arxiv.org/pdf/1505.04597.pdf
Object Detection
Need to predict the location as well as the object class. The location is usually
defined by a bounding box.
What if there is only one object?
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 45 / 51
Single Object Detection
Image
96-Conv 11x11
Pool
256-Conv 5x5
Pool
384-Conv 3x3
384-Conv 3x3
256-Conv 3x3
Pool
FC 4096
FC 4096
FC 1000FC 4
ClassBox Problem compose to two sub problems:
Predict object class
Object location (quantified by [Bx , By , H, W ])
Can use any CNN (discussed in classification section)
and change the last layer to accommodate the two
predictions.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 46 / 51
Multi Object Detection
Image: RCNN
Use a Region proposal algorithm
(in traditional CV) to get initial
bounding box proposals.
Resize each proposed region to
fixed size.
Treat each proposal as a single
object detection problem and
follow the procedure in last slide.
Initial part subject to errors. Technical
difficulties when objects overlap. Need
non minimal suppression.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 47 / 51
https://arxiv.org/pdf/1311.2524.pdf
Multi Object Detection
Use a Region proposal algorithm
(in traditional CV) to get initial
bounding box proposals.
Use a CNN based Region
proposal algorithm.
Treat each as a single object
detection and follow the
procedure in last slide.
End-to-End network for object
detection.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 48 / 51
Multi Object Detection
Use a back-borne network for CNN based region proposal algorithms.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 49 / 51
Multi Object Detection
Faster RCNN
SSD: Single Shot MultiBox Detector
You Only Look Once: Unified, Real-Time Object Detection (YOLO)
Speed/accuracy trade-offs for modern convolutional object detectors
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 50 / 51
https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
https://arxiv.org/pdf/1512.02325.pdf
https://arxiv.org/pdf/1506.02640.pdf
https://arxiv.org/pdf/1611.10012.pdf
Summary
Famous networks for image classification, segmetation and object detection.
AlexNet: showed that you can use CNNs to train Computer Vision
models.
VGG: shows that bigger networks with smaller conv work better.
GoogLeNet: Focus on efficiency using 1x1 bottleneck convolutions and
global avg pool instead of FC layers.
ResNet: showed us how to train extremely deep networks.
After ResNet: CNNs were better than the human metric and focus
shifted to Efficient networks:
Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet
ResNet is currently a good defaults to use.
Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 51 / 51
Image Classification
AlexNet
VGGNet
GoogLeNet
ResNet
Object Detection & Segmentation