14-cnn-architecutures
Qiuhong Ke
CNN Architectures
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
Outline
• LeNet5
• AlexNet
• VGG
• GoogleNet
• ResNet
2
Architecture
LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.
3
LeNet5 AlexNet VGG GoogleNet ResNet
1st layer
32x32x1
Convolutional layer
No.filters: 6
Filter size: 5×5
Padding: 0
Stride:1
28x28x6
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
4
LeNet5 AlexNet VGG GoogleNet ResNet
willoweit.
Typewritten Text
(32 – 5+1)/1= 28
2nd layer
Subsampling layer
Filter size: 2×2
Padding: 0
Stride:2
28x28x6
• Take the sum of all units in the 2×2 window
• output of each patch = sum*coefficient (trainable) + bias (trainable)
14x14x6
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
5
LeNet5 AlexNet VGG GoogleNet ResNet
3rd layer
Convolutional layer
No.filters: 16
Filter size: 5×5
Padding: 0
Stride:1
14x14x6
10x10x16
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
6
LeNet5 AlexNet VGG GoogleNet ResNet
willoweit.
Typewritten Text
(14-5+1)/1=10
Convolution on Multiple-channel input
R
G
B
Kernel: same channel (depth)
* K(Depth 1)
* K(Depth 2)
* K(Depth 3)
Element-wise
sum
One
channel
7
LeNet5 AlexNet VGG GoogleNet ResNet
3rd layer: Non-complete connection scheme
8
LeNet5 AlexNet VGG GoogleNet ResNet
Convolutional
layer
No.filters: 16
Filter size: 5×5
Padding: 0
Stride:1
4th layer
Subsampling layer
Filter size: 2×2
Padding: 0
Stride:2
10x10x16
5x5x16
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
9
LeNet5 AlexNet VGG GoogleNet ResNet
willoweit.
Typewritten Text
(10-2+1)/2=5
willoweit.
Highlight
willoweit.
Highlight
5th layer
5x5x16
Convolutional layer
No.filters: 120
Filter size: 5×5
Padding: 0
Stride:1
1x1x120
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
10
LeNet5 AlexNet VGG GoogleNet ResNet
willoweit.
Highlight
willoweit.
Highlight
Following: Fully connected layers
120
FC
84
FC
10
11
LeNet5 AlexNet VGG GoogleNet ResNet
Architecture
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
12
LeNet5 AlexNet VGG GoogleNet ResNet
227
227
1st layer
Convolutional layer
No.filters: 96
Filter size: 11×11
Padding: 0
Stride: 4
227x227x3
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
13
LeNet5 AlexNet VGG GoogleNet ResNet
How many parameters?
11x11x3x96=34,848
55x55x96
55x55x48
55x55x48
willoweit.
Typewritten Text
(227-11+1)/4 –>55
willoweit.
Oval
willoweit.
Line
willoweit.
Typewritten Text
96/2=48
Convolution on Multiple-channel input
R
G
B
Kernel: same channel (depth)
* K(Depth 1)
* K(Depth 2)
* K(Depth 3)
Element-wise
sum
One
channel
14
LeNet5 AlexNet VGG GoogleNet ResNet
2nd layer
Max-pooling
Pool size: 3×3
Padding: 0
Stride: 2
27x27x48
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
?
15
LeNet5 AlexNet VGG GoogleNet ResNet
How many parameters?
0
55x55x48
55x55x48 27x27x48
willoweit.
Typewritten Text
(55-3+1)/2–>27
willoweit.
Highlight
willoweit.
Typewritten Text
for max-pooling , we just use the maximum value.
Therefore, we do not have any parameter
3rd layer
padding: output_size=ceiling( (input_size)/stride )
Convolutional layer
No.filters: 256
Filter size: 5×5
Padding: 2
Stride: 1
27x27x48
27x27x128
?
16
LeNet5 AlexNet VGG GoogleNet ResNet
How many parameters?
5x5x48x256=307200
27x27x48
27x27x128
willoweit.
Typewritten Text
256/2=128
4th layer
27x27x128
Max-pooling
Pool size: 3×3
Padding: 0
Stride: 2
13x13x128
output_size=ceiling( (input_size-kernel_size+1)/stride )
?
17
LeNet5 AlexNet VGG GoogleNet ResNet
27x27x128
13x13x128
How many parameters?
0
5th layer
Convolution layer
No.filters: 384
Filter size: 3×3
Padding: 1
Stride: 1
18
LeNet5 AlexNet VGG GoogleNet ResNet
13x13x128
13x13x128
13x13x192
13x13x192
How many parameters?
3x3x256x384=884,736
willoweit.
Highlight
willoweit.
Highlight
willoweit.
Typewritten Text
since they are all cross connected
willoweit.
Line
willoweit.
Line
Following convolutional layers
13×13
X192
Convolution layer
No.filters: 384
Filter size: 3×3
Padding: 1
Stride: 1
Convolution layer
No.filters: 256
Filter size: 3×3
Padding: 1
Stride: 1
19
LeNet5 AlexNet VGG GoogleNet ResNet
13×13
X192
13×13
X192
13×13
X192
13×13
X128
13×13
X128
How many parameters?
3x3x192x384=663,552
How many parameters?
3x3x192x256=442,368
Max-pooling and flatten
Flatten
13x13x128
Max-pooling
Filter size: 3×3
Padding: 0
Stride: 2
4608
6x6x128
20
LeNet5 AlexNet VGG GoogleNet ResNet
13x13x128 6x6x128
4608
willoweit.
Typewritten Text
=4608
Following fully connected layers
FC
2048
FC
1000
21
LeNet5 AlexNet VGG GoogleNet ResNet
4608
4608
FC
2048
FC
2048
FC
2048
How many parameters?
9216×4096=37,748,736 4096×4096=16,777,216
4096×1000=4096000
willoweit.
Typewritten Text
=4608*2
willoweit.
Typewritten Text
=2048*2
Architecture
224 × 224×3
conv3-64
conv3-64
Maxpool
Conv3-128
Conv3-128
Maxpool
Conv3-256
Conv3-256
Conv3-256
Maxpool
Conv3-512
Conv3-512
Conv3-512
Maxpool
Conv3-512
Conv3-512
Conv3-512
Maxpool
FC4096
FC4096
FC1000
• VGG16: 16 weight layers
112×112×64 56×56×128 28×28×256 14×14×512 7×7×512
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
22
ResNetLeNet5 AlexNet VGG GoogleNet ResNet
Conv layer: kernel size 3×3, pad 1, stride 1
Maxpooling layer: 2×2,stride 2
Architecture
224 × 224×3
conv3-64
conv3-64
Maxpool
Conv3-128
Conv3-128
Maxpool
Conv3-256
Conv3-256
Conv3-256
Conv3-256
Maxpool
Conv3-512
Conv3-512
Conv3-512
Conv3-512
Maxpool
FC4096
FC4096
FC1000
• VGG19: 19 weight layers
Conv3-512
Conv3-512
Conv3-512
Conv3-512
Maxpool
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
23
ResNetLeNet5 AlexNet VGG GoogleNet ResNet
Conv layer: kernel size 3×3, pad 1, stride 1
Maxpooling layer: 2×2,stride 2
Stacking multiple 3×3 conv layers
• a stack of two 3×3 conv. layers has an effective receptive field of
5×5
• a stack of 3 3×3 conv. layers has an effective receptive field of 7×7
24
LeNet5 AlexNet VGG GoogleNet ResNet
More layers: larger size
of receptive field
(larger window of the
input is seen)
Conv
Conv
If you add an additional convolutional layer with kernel size K,
the receptive field is increased by (K-1)
Why Stacking multiple 3×3 conv layers instead of large filter size?
• Reduce parameter:
• 5×5=25, 3x3x2=18
• 7×7=49, 3x3x3=27
• Each conv use ReLU as the activation function. More layers, more
non-linear rectification layers More powerful network
25
LeNet5 AlexNet VGG GoogleNet ResNet
Architecture
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
Inception module (Naive version):
Channel-wise
concatenation
26
LeNet5 AlexNet VGG GoogleNet ResNet
Different scales of data require different convolutional filter sizes
27
LeNet5 AlexNet VGG GoogleNet ResNet
Inception module
28x28x192
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
64 128 32
28x28x(64+128+32+192)
=28x28x416
Channel-wise
concatenation
(Naive version)
28
LeNet5 AlexNet VGG GoogleNet ResNet
Params: 12K Params: 221K Params: 153K Params: 0
Total parameters:
~386K
How many parameters?
Input_channel x
Kernel_size x
NO.filters
(output_channel)
willoweit.
Typewritten Text
28x28x64
willoweit.
Typewritten Text
28x28x128
willoweit.
Typewritten Text
28x28x32
willoweit.
Typewritten Text
28x28x192
willoweit.
Typewritten Text
12k = 192*1*1*64
Inception module
28x28x416
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
128 192 96
28x28x(128+192+96+416)
=28x28x832
Channel-wise
concatenation
(Naive version)
29
LeNet5 AlexNet VGG GoogleNet ResNet
Params: 53K Params: 718K Params: 998K Params: 0
Total parameters:
~1.7M
More modules
More parameters
More computations
Inception module with dimensionality reduction
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
Use 1×1 convolution to reduce channels
28x28x192
96 16
32
64
128 32
30
LeNet5 AlexNet VGG GoogleNet ResNet
Params: 12K
Params:18K
Params:110K
Params:3K
Params:12K
Params: 0
Params: 6K
Total parameters:
~386K
28x28x(64+128+32+32)
=28x28x256
161K
Inception module with dimensionality reduction
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015
Use 1×1 convolution to reduce channels
28x28x256
128 32
64
128
192 96
31
LeNet5 AlexNet VGG GoogleNet ResNet
Params: 32K
Params:32K
Params:221K
Params:8K
Params:76K
Params: 0
Params: 16K
28x28x(128+192+96+64)
=28x28x480Total parameters:
~1.7 M 385K
Large Scale Visual Recognition Challenge
Comparison of different architectures
• Top-5 error: the proportion of images that the ground-truth category is outside
the top-5 predicted categories of the model.
32
willoweit.
Highlight
•
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
33
Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error
Residual network: more layers & better performance
More layers?
Residual network
Hypothesis: residual mapping is easier to optimise
Residual learning:
let these layers fit a residual mapping
34
LeNet5 AlexNet VGG GoogleNet ResNet
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Weight layer
relu
Weight layer
relu
x
Unreferenced mapping:
directly fit the desired underlying mapping
Residual network (34 layers)
IMG
Input sizse:
224X224X3
x 3
x 4
128
128
128-d
x 6
256
256
256-d
x 3
512
512
512-d
conv
1×1,128
conv
1×1,256
conv
1×1,512
35
LeNet5 AlexNet VGG GoogleNet ResNet
Residual network
LeNet5 AlexNet VGG GoogleNet ResNet
Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2
36
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Large Scale Visual Recognition Challenge
Comparison of different architectures
• Top-5 error: the proportion of images that the ground-truth category is outside
the top-5 predicted categories of the model.
37
Use the pretrained CNN model as feature extractor
Train a new classifier for output
38
If you have quite a lot of data: fine-tuning
Slightly train a few more top layers
Train
Conv layers
Train
Classifier
Frozen
39
Summary
• How to calculate the NO. parameters & size of output feature map ?
• Difference of the architectures
• Key idea of VGG: how to increase receptive field?
• Key idea of GoogleNet: how to reduce parameters?
• Key idea of ResNet: how to increase layers with better performance?
40