CS计算机代考程序代写 chain deep learning case study algorithm Deep Learning – COSC2779 – Vision Application & CNN Architectures

Deep Learning – COSC2779 – Vision Application & CNN Architectures

Deep Learning – COSC2779
Vision Application & CNN Architectures

Dr. Ruwan Tennakoon

Aug 16, 2021

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 1 / 51

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 2 / 51

Last Week: Pooling & Convolutions

C1

H

W

Conv 3× 3
Ch: C2

Activation
+ Pool

Conv 3× 3
Ch: C3

Convolutions can be combined with pooling to construct a chain of layers.

Feature extraction usually happens locally – sparse connectivity.
In feature extraction the same operation is applied at different locations – parameter
sharing.
Pooling help reduce redundant information and provide some level of invariance to
translations.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 3 / 51

Last Week: Pooling & Convolutions

C1

H

W

Conv 3× 3
Ch: C2

Activation
+ Pool

Conv 3× 3
Ch: C3

Convolutions can be combined with pooling to construct a chain of layers.

Feature extraction usually happens locally – sparse connectivity.
In feature extraction the same operation is applied at different locations – parameter
sharing.
Pooling help reduce redundant information and provide some level of invariance to
translations.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 3 / 51

Last Week: LeNet Architecture

“LeNet is a classic example of convolutional neural network to successfully predict
handwritten digits.” [LeNet]

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 4 / 51

https://ieeexplore.ieee.org/abstract/document/726791

How Can We Design the Architecture?

There are so many hyper parameter to choose in CNN:
Number of convolutional layers, filter size, number of filters, stride,
initialization . . .
Pooling size, Number of pooling layers . . .
Number of FC layers, units, . . .
optimization type, lerning rate, . . .
. . .

Use classic networks like LeNet, AlexNet, VGG-16, VGG-19, ResNet
etc. as inspiration (follow the trend used in those architectures).

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 5 / 51

How Can We Design the Architecture?

There are so many hyper parameter to choose in CNN:
Number of convolutional layers, filter size, number of filters, stride,
initialization . . .
Pooling size, Number of pooling layers . . .
Number of FC layers, units, . . .
optimization type, lerning rate, . . .
. . .

Use classic networks like LeNet, AlexNet, VGG-16, VGG-19, ResNet
etc. as inspiration (follow the trend used in those architectures).

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 5 / 51

Objectives of the Lecture

Identify how deep networks are developed via case study: Image
classification (IMAGENET).
Understand the main trends in classic architectures and why they work.
Identify the classic network architectures used for common computer
vision problems:

Image Classification
Object Detection
Image Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 6 / 51

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 7 / 51

Image Classification

The ImageNet Large Scale Visual
Recognition Challenge (ILSVRC)
is an annual competition helped
between 2010 and 2017.

The datasets comprised
approximately 1 million images
and 1,000 object classes.

The annual challenge focuses on
multiple tasks for image
classification.

Image source: ImageNet
Alex Krizhevsky, et al. “ImageNet Classification with Deep Convolutional Neural Networks”

developed a convolutional neural network that achieved top results on the ILSVRC-2010 and
ILSVRC-2012 image classification tasks.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 8 / 51

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 9 / 51

CNN Architecture Evolution: Image Classification

Alexnet ZFNet VGGNet GoogLeNet ResNet

5

10

15

16.4

11.7

7.3
6.7

3.6

Im
ag

eN
et

E
rr

or
(%

)

Alexnet ZFNet VGGNet GoogLeNet ResNet

0

50

100

150

8 8
19 22

152

#
La

ye
rs

Supervised learning based image classification.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 10 / 51

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3

Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape ?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3

Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape (W − F + 2P)/S + 1 = (227− 11 + 2 ∗ 0)/4 + 1 = 55→ [? , 55, 55, 96]
Parameters ?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3

Layer 1: 2D Convolution with 96, [11× 11] filters, with stride of 4. ‘ReLU’ activation.
Output Shape (W − F + 2P)/S + 1 = (227− 11 + 2 ∗ 0)/4 + 1 = 55→ [? , 55, 55, 96]
Parameters 11× 11× 3× 96 + 96

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 11 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3→ After Conv1: 55× 55× 96

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3→ After Conv1: 55× 55× 96

Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape ?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3→ After Conv1: 55× 55× 96

Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape (W − F + 2P)/S + 1 = (55− 3 + 2 ∗ 0)/2 + 1 = 27→ [? , 27, 27, 96]
Parameters ?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: ImageNet Classification with Deep Convolutional Neural Networks

Input: 227× 227× 3→ After Conv1: 55× 55× 96

Layer 2: Max Pooling with, [3× 3], with stride of 2.
Output Shape (W − F + 2P)/S + 1 = (55− 3 + 2 ∗ 0)/2 + 1 = 27→ [? , 27, 27, 96]
Parameters 0

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 12 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: Krizhevsky et al. 2012

227x227x3 Input
55x55x96 Conv1: 96 11×11 filters at stride 4, pad 0
27x27x96 Max-pool1: 3×3 filters at stride 2
27x27x96 Norm1: Normalization layer

27x27x256 Conv2: 256 5×5 filters at stride 1, pad 2
13x13x256 Max-pool2: 3×3 filters at stride 2
13x13x256 Norm2: Normalization layer
13x13x384 Conv3: 384 3×3 filters at stride 1, pad 1
13x13x384 Conv4: 384 3×3 filters at stride 1, pad 1
13x13x256 Conv5: 256 3×3 filters at stride 1, pad 1

6x6x256 Max-pool3: 3×3 filters at stride 2
4096 FC6: 4096 neurons
4096 FC7: 4096 neurons
1000 FC8: 1000 neurons (class scores)

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 13 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: Krizhevsky et al. 2012

What happens if we change the input tensor height
& width to conv layer?

C1

H

W

Conv 3× 3
Ch: C2

Can we change the input to AlexNet?
No, This will change the dimensions of the input
tensor to FC layers.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 13 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

AlexNet [Krizhevsky et al. 2012]

Image: Krizhevsky et al. 2012

Number of Parameters: Overall, AlexNet
has about 61M parameters.
Use of ReLU
Norm layers – not common anymore
Data augmentation
Dropout 0.5 in FC layers.
Batch size 128
SGD Momentum 0.9
Initial learning rate 1e-2, reduced by 10x
manually when val accuracy plateaus.
RegularizationL2 weight decay 5e-4
7 CNN ensemble: 18.2% to 15.4%

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 14 / 51

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

ZFNet [Zeiler and Fergus, 2013]

Image: Visualizing and Understanding Convolutional Networks

Similer to AlexNet:
Conv1: Change from 11×11 stride 4 to 7×7 stride 2.
Conv3,4,5: Change from 384, 384, 256 filters to 512, 1024, 512.
ImageNet top 5 error improved from 16.4% to 11.7%

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 15 / 51

https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 16 / 51

VGGNet [Simonyan and Zisserman, 2014]

Input
96-Conv 11×11

Pool
256-Conv 5×5

Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
AlexNet

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-16

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-19

Convolution layers: 3×3, stride 1, pad 1.
Pooling layers: 2×2 max-pool stride 2.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51

VGGNet [Simonyan and Zisserman, 2014]

Input
96-Conv 11×11

Pool
256-Conv 5×5

Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
AlexNet

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-16

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-19

Images: from Unsplash

What will happen when objects are at different scales?

Stack of three 3×3 convolution (stride 1) layers has same effective receptive field

as one 7×7 convolution layer.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51

VGGNet [Simonyan and Zisserman, 2014]

Input
96-Conv 11×11

Pool
256-Conv 5×5

Pool
384-Conv 3×3
384-Conv 3×3
256-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
AlexNet

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-16

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-19

Convolution layers: 3×3, stride 1, pad 1.
Pooling layers: 2×2 max-pool stride 2.

Stack of three 3×3 convolution (stride 1) layers
has same effective receptive field as one 7×7
convolution layer.

Deeper structure allows more non-linearities.

Still have fewer parameters: 3x(3x3xCi xCo) Vs.
(7x7xCi xCo)

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 17 / 51

VGGNet [Simonyan and Zisserman, 2014]

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-16

Input
64-Conv 3×3
64-Conv 3×3

Pool
128-Conv 3×3
128-Conv 3×3

Pool
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3
256-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3
512-Conv 3×3

Pool
FC 4096
FC 4096
FC 1000
VGG-19

Memory: 96MB per image (forward pass)

Number of Parameters: 138M parameters

Similar training procedure as AlexNet.
Use ensembles for best results.
FC7 features generalize well to other tasks.
Large amount of parameters at the last FC
layers (80%).

Main idea: Smaller filters and deeper
networks.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 18 / 51

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 19 / 51

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Computationally efficient deeper network:
Number of parameters: 5M (12x less that AlexNet, 27x less than VGG-16)
No fully connected layers at the end. Average pooling across channels.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 20 / 51

https://arxiv.org/pdf/1409.4842.pdf

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Computationally efficient deeper network:
Number of parameters: 5M (12x less that AlexNet, 27x less than VGG-16)
No fully connected layers at the end. Average pooling across channels.
22 layers (with efficient “Inception module”)

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 20 / 51

https://arxiv.org/pdf/1409.4842.pdf

GoogLeNet [Szegedy et al., 2014]

Images: from Unsplash

Inception module:
Design a good local network topology
Apply filters with different size receptive fields to the input form previous layer (1×1, 3×3,
5×5, 3×3 pooling ). Then Concatenate all filter outputs together in channel diminution.
‘ReLU’ activation or convolution modules.
Why use 1×1 convolutions?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 21 / 51

GoogLeNet [Szegedy et al., 2014]

Previous Layer

5×5
Convolution

3×3
Convolution

1×1
Convolution

1×1
Convolution

1×1
Convolution

1×1
Convolution

3×3 Max
pooling

Filter
Concatenetion

Inception module:
Design a good local network topology
Apply filters with different size receptive fields to the input form previous layer (1×1, 3×3,
5×5, 3×3 pooling ). Then Concatenate all filter outputs together in channel diminution.
‘ReLU’ activation or convolution modules.
Why use 1×1 convolutions?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 21 / 51

GoogLeNet [Szegedy et al., 2014]

Previous Layer

96, 5×5
Convolution

192, 3×3
Convolution

3×3
Pooling

128, 1×1
Convolution

Filter
Concatenetion

Input:28x28x256

28x28x9628x28x19228x28x128 28x28x256

28x28x672

Inception module:
The channel dimension grows with depth of the network.
Very expensive to compute.
Path [3×3] has 3× 3× 192× 256 = 442368 parameters.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 22 / 51

GoogLeNet [Szegedy et al., 2014]

C i

1 6 2 8 1 2 7
1 6 2 8 1 2 5
0 5 8 1 5 7 1
1 7 1 3 5 8 0
5 2 4 4 5 8 4
8 2 3 7 3 8 2
1 2 3 6 5 9 6

w1

C
o

C o

We can do channel dimensionality reduction (increase) with 1×1 convolutions.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 22 / 51

GoogLeNet [Szegedy et al., 2014]

Previous Layer

96, 5×5
Convolution

192, 3×3
Convolution

64, 1×1
Convolution

128, 1×1
Convolution

64, 1×1
Convolution

64, 1×1
Convolution

3×3 Max
pooling

Filter
Concatenetion

Input:28x28x256

28x28x9628x28x19228x28x128 28x28x64

28x28x480

28x28x6428x28x64 28x28x256

“Bottleneck” modules [1×1] reduce computational cost and output shape.
Path [3×3] originally had 3× 3× 192× 256 = 442, 368 parameters.
Path [3×3] how has 1× 1× 64× 256 + 3× 3× 192× 64 = 126, 976 parameters.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 23 / 51

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Some convolution + max pooling at start.
Stack “Inception module” on top of each other.
Some intermediate stacking will have max-pooling.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 24 / 51

https://arxiv.org/pdf/1409.4842.pdf

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

No FC layers at the end. The output of last inception module is subjected to global
average pooling and then the final “softmax” layer with 1000 classes.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 25 / 51

https://arxiv.org/pdf/1409.4842.pdf

Gobal Average Pooling (GAP)

Image: Global Average Pooling Layers for Object Localization

Normal pooling does each channel
independently with 2D masks.
In GAP, all the pixels in each channel
is averaged (or max in global
max-pooling) independently to
produce a vector.
If the input to GAP is [B, H, W, C]
the output will be [B, 1, 1, C].
Allows different input shapes at train
and test times.
First paper to use GAP-Network In
Network

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 26 / 51

https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/
https://arxiv.org/pdf/1312.4400.pdf
https://arxiv.org/pdf/1312.4400.pdf

Gobal Average Pooling (GAP)

Image: Global Average Pooling Layers for Object Localization

What happens if we change the input
tensor height & width to conv layer?

C1

H

W

Conv 3 × 3
Ch: C2

Can we change the input to AlexNet?
No, This will change the dimensions of the
input tensor to FC layers.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 26 / 51

https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Deep networks has the issue of vanishing gradients.
Auxiliary classification outputs to inject additional gradient at lower layers.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 27 / 51

https://arxiv.org/pdf/1409.4842.pdf

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Stochastic gradient descent with 0.9 momentum.
Fixed learning rate schedule: decreasing the learning rate by 4% every 8 epochs.
Data Augmentation and dropout (last layer) for preventing over-fitting.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 28 / 51

https://arxiv.org/pdf/1409.4842.pdf

GoogLeNet [Szegedy et al., 2014]

Image: Going deeper with convolutions

Main ideas:
Efficient “Inception” module integrates information at multiple receptive fields.
No FC at the end (use GAP instead) which will reduce the number of parameters
significantly.
Auxiliary classification outputs to inject additional gradient at lower layers.
ILSVRC 2014 classification winner with 6.7% top 5 error.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 29 / 51

https://arxiv.org/pdf/1409.4842.pdf

Outline

1 Image Classification

AlexNet

VGGNet

GoogLeNet

ResNet
2 Object Detection & Segmentation

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 30 / 51

CNN Architecture Evolution: Image Classification

Alexnet ZFNet VGGNet GoogLeNet ResNet

5

10

15

16.4

11.7

7.3
6.7

3.6

Im
ag

eN
et

E
rr

or
(%

)

Alexnet ZFNet VGGNet GoogLeNet ResNet

0

50

100

150

8 8
19 22

152

#
La

ye
rs

Make the network more and more deep to increase performance?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 31 / 51

ResNet [He et al., 2015]

What happens if we increase the depth?

Image: Deep Residual Learning for Image Recognition

Deeper model performs worse than he shallow model. Maybe over-fitting?

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 32 / 51

https://arxiv.org/pdf/1512.03385.pdf

ResNet [He et al., 2015]
What happens if we increase the depth?

Image: Deep Residual Learning for Image Recognition

Deeper model performs worse than he shallow model. Maybe over-fitting?

The training error of deeper model is also worse. Not over-fitting.
Hypothesis – Training (optimizing) deeper models is harder.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 32 / 51

https://arxiv.org/pdf/1512.03385.pdf

ResNet [He et al., 2015]

Learn a residual mapping at each layer, instead of trying to learn the underlying mapping.

X
Previous Layer

3×3
Convolution

3×3
Convolution

ReLU

ReLU

H(X)

X
Previous Layer

3×3
Convolution

3×3
Convolution

ReLU

F (X)

+

H(X)

ReLU

H(X) = F (X) + X

F (X) = H(X) − X . Residual

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 33 / 51

ResNet [He et al., 2015]

Learn a residual mapping at each layer, instead of trying to learn the underlying mapping.

X
Previous Layer

3×3
Convolution

3×3
Convolution

ReLU

ReLU

H(X)

X
Previous Layer

3×3
Convolution

3×3
Convolution

ReLU

F (X)

+

H(X)

ReLU

H(X) = F (X) + X

F (X) = H(X) − X . Residual

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 33 / 51

ResNet [He et al., 2015]

X
Previous Layer

3×3
Convolution

3×3
Convolution

ReLU

F (X)

+

H(X)

ReLU

Full ResNet architecture:
Stack residual blocks. Every residual block has two 3×3 conv layers.
Periodically, double # of filters and downsample spatially using stride 2
(/2 in each dimension)

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 34 / 51

ResNet [He et al., 2015]

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 34 / 51

ResNet [He et al., 2015]

For very deep networks use “bottleneck” blocks to
reduce computations.

If F (X) = 0 then only identify: H(X) = X.

Combined with weight decay we can get some layers to
be identify.

Skip connections also provide a direct path for gradients
to the bottom layer (close to input).

X
channels 256

64, 1×1
Convolution

64, 3×3
Convolution

256, 1×1
Convolution

+

H(X)

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 35 / 51

ResNet [He et al., 2015]

X
Previous Layer

Convolution

Convolution

ReLU

F (X)

+

H(X)

ILSVRC 2015 classification winner (3.6% top 5 error)
Many layers: 152 layers on ImageNet, 1202 on Cifar.
Additional conv layer at the beginning.
No FC layers at the end (only FC 1000 to output classes)

Training ResNet in practice:
Batch Normalization after every Convolution layer.
He normal initialization from He et al.
SGD + Momentum (0.9)
Learning rate: 0.1, divided by 10 when validation error
plateaus
Mini-batch size 256
Weight decay of 1e-5
No dropout used

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 36 / 51

Improvements/Modifications to ResNet

Improve residual block: Identity Mappings in Deep Residual Networks
Wide residual blocks not only deep: Wide Residual Networks
Residual blocks that share multi scale convolutions from GoogleLeNet:
Aggregated Residual Transformations for Deep Neural Networks
(ResNeXt)
Dropout layers in residual block: Deep Networks with Stochastic Depth
(dropout layers)
All convolutions in a residual block also gets skip connections: Densely
Connected Convolutional Networks.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 37 / 51

https://arxiv.org/pdf/1603.05027.pdf
https://arxiv.org/pdf/1605.07146.pdf
https://arxiv.org/pdf/1611.05431.pdf
https://arxiv.org/pdf/1611.05431.pdf
https://arxiv.org/pdf/1603.09382.pdf
https://arxiv.org/pdf/1603.09382.pdf
https://arxiv.org/pdf/1608.06993.pdf
https://arxiv.org/pdf/1608.06993.pdf

Comparing Different CNN Architectures for Image Classification

Image: An Analysis of Deep Neural Network Models for Practical Applications.

The size of the blobs is proportional to the number of network parameters

AlexNet low accuracy, high
computational cost.
VGG Higest number of
parameters.
ResNet: Best accuracy, moderate
complexity.

Lecture 5 (Part 1) Deep Learning – COSC2779 Aug 16, 2021 38 / 51

https://arxiv.org/pdf/1605.07678.pdf

Efficient Networks for Mobile Applications

Mobilenets: MobileNets: Efficient Convolutional Neural Networks for
Mobile Vision Applications.
Squeeze Net: SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and <0.5MB model size ShuffleNet: ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 39 / 51 https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1602.07360.pdf https://arxiv.org/pdf/1602.07360.pdf https://arxiv.org/pdf/1707.01083.pdf https://arxiv.org/pdf/1707.01083.pdf Outline 1 Image Classification AlexNet VGGNet GoogLeNet ResNet 2 Object Detection & Segmentation Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 40 / 51 Image Segmentation Image: SegNet: Road Scene Segmentation Predict a category label (class) for each pixel of the image. Crop patch and do classification of the center pixel using classification CNN? Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51 https://youtu.be/CxanE_W46ts Image Segmentation Image: SegNet: Road Scene Segmentation Predict a category label (class) for each pixel of the image. Crop patch and do classification of the center pixel using classification CNN? Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51 https://youtu.be/CxanE_W46ts Image Segmentation Image: SegNet: Road Scene Segmentation Predict a category label (class) for each pixel of the image. Crop patch and do classification of the center pixel using classification CNN? Very expensive to do inference. Not feasible. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 41 / 51 https://youtu.be/CxanE_W46ts Image Segmentation Ci H W Conv Conv Co H W SoftMax Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once. Co is the number of classes in the classification problem. Very expensive, large memory requirement. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 42 / 51 Image Segmentation Image: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation Encoder decoder architecture for segmentation. Encoder First part of the network with convolution and pooling (strided convolutions). Decoder Second part with convolution and upsampling (transpose convolutions). Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 43 / 51 https://arxiv.org/pdf/1511.00561.pdf Image Segmentation Fully Convolutional Networks for Semantic Segmentation SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation Learning Deconvolution Network for Semantic Segmentation U-Net: Convolutional Networks for Biomedical Image Segmentation Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 44 / 51 https://arxiv.org/pdf/1411.4038.pdf https://arxiv.org/pdf/1511.00561.pdf https://arxiv.org/pdf/1511.00561.pdf https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf https://arxiv.org/pdf/1505.04597.pdf Object Detection Need to predict the location as well as the object class. The location is usually defined by a bounding box. What if there is only one object? Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 45 / 51 Single Object Detection Image 96-Conv 11x11 Pool 256-Conv 5x5 Pool 384-Conv 3x3 384-Conv 3x3 256-Conv 3x3 Pool FC 4096 FC 4096 FC 1000FC 4 ClassBox Problem compose to two sub problems: Predict object class Object location (quantified by [Bx , By , H, W ]) Can use any CNN (discussed in classification section) and change the last layer to accommodate the two predictions. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 46 / 51 Multi Object Detection Image: RCNN Use a Region proposal algorithm (in traditional CV) to get initial bounding box proposals. Resize each proposed region to fixed size. Treat each proposal as a single object detection problem and follow the procedure in last slide. Initial part subject to errors. Technical difficulties when objects overlap. Need non minimal suppression. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 47 / 51 https://arxiv.org/pdf/1311.2524.pdf Multi Object Detection Use a Region proposal algorithm (in traditional CV) to get initial bounding box proposals. Use a CNN based Region proposal algorithm. Treat each as a single object detection and follow the procedure in last slide. End-to-End network for object detection. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 48 / 51 Multi Object Detection Use a back-borne network for CNN based region proposal algorithms. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 49 / 51 Multi Object Detection Faster RCNN SSD: Single Shot MultiBox Detector You Only Look Once: Unified, Real-Time Object Detection (YOLO) Speed/accuracy trade-offs for modern convolutional object detectors Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 50 / 51 https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf https://arxiv.org/pdf/1512.02325.pdf https://arxiv.org/pdf/1506.02640.pdf https://arxiv.org/pdf/1611.10012.pdf Summary Famous networks for image classification, segmetation and object detection. AlexNet: showed that you can use CNNs to train Computer Vision models. VGG: shows that bigger networks with smaller conv work better. GoogLeNet: Focus on efficiency using 1x1 bottleneck convolutions and global avg pool instead of FC layers. ResNet: showed us how to train extremely deep networks. After ResNet: CNNs were better than the human metric and focus shifted to Efficient networks: Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet ResNet is currently a good defaults to use. Lecture 5 (Part 1) Deep Learning - COSC2779 Aug 16, 2021 51 / 51 Image Classification AlexNet VGGNet GoogLeNet ResNet Object Detection & Segmentation