CS计算机代考程序代写 scheme 14-cnn-architecutures

14-cnn-architecutures

Qiuhong Ke

CNN Architectures
COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

Outline

• LeNet5

• AlexNet

• VGG

• GoogleNet

• ResNet

2

Architecture

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

3

LeNet5 AlexNet VGG GoogleNet ResNet

1st layer

32x32x1

Convolutional layer
No.filters: 6

Filter size: 5×5
Padding: 0

Stride:1

28x28x6

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?

4

LeNet5 AlexNet VGG GoogleNet ResNet

willoweit.
Typewritten Text
(32 – 5+1)/1= 28

2nd layer

Subsampling layer
Filter size: 2×2

Padding: 0
Stride:2

28x28x6

• Take the sum of all units in the 2×2 window

• output of each patch = sum*coefficient (trainable) + bias (trainable)

14x14x6

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?

5

LeNet5 AlexNet VGG GoogleNet ResNet

3rd layer

Convolutional layer
No.filters: 16

Filter size: 5×5
Padding: 0

Stride:1
14x14x6

10x10x16

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?
6

LeNet5 AlexNet VGG GoogleNet ResNet

willoweit.
Typewritten Text
(14-5+1)/1=10

Convolution on Multiple-channel input

R

G

B

Kernel: same channel (depth)

* K(Depth 1)

* K(Depth 2)

* K(Depth 3)

Element-wise

sum

One

channel

7

LeNet5 AlexNet VGG GoogleNet ResNet

3rd layer: Non-complete connection scheme

8

LeNet5 AlexNet VGG GoogleNet ResNet

Convolutional
layer

No.filters: 16
Filter size: 5×5

Padding: 0
Stride:1

4th layer

Subsampling layer
Filter size: 2×2

Padding: 0
Stride:2

10x10x16
5x5x16

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?
9

LeNet5 AlexNet VGG GoogleNet ResNet

willoweit.
Typewritten Text
(10-2+1)/2=5

willoweit.
Highlight

willoweit.
Highlight

5th layer

5x5x16

Convolutional layer
No.filters: 120
Filter size: 5×5

Padding: 0
Stride:1

1x1x120

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?
10

LeNet5 AlexNet VGG GoogleNet ResNet

willoweit.
Highlight

willoweit.
Highlight

Following: Fully connected layers

120
FC
84

FC
10

11

LeNet5 AlexNet VGG GoogleNet ResNet

Architecture

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

12

LeNet5 AlexNet VGG GoogleNet ResNet

227

227

1st layer

Convolutional layer
No.filters: 96

Filter size: 11×11
Padding: 0
Stride: 4

227x227x3

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?
13

LeNet5 AlexNet VGG GoogleNet ResNet

How many parameters?
11x11x3x96=34,848

55x55x96

55x55x48

55x55x48

willoweit.
Typewritten Text
(227-11+1)/4 –>55

willoweit.
Oval

willoweit.
Line

willoweit.
Typewritten Text
96/2=48

Convolution on Multiple-channel input

R

G

B

Kernel: same channel (depth)

* K(Depth 1)

* K(Depth 2)

* K(Depth 3)

Element-wise

sum

One

channel

14

LeNet5 AlexNet VGG GoogleNet ResNet

2nd layer

Max-pooling
Pool size: 3×3

Padding: 0
Stride: 2

27x27x48

No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )

?
15

LeNet5 AlexNet VGG GoogleNet ResNet

How many parameters?
0

55x55x48

55x55x48 27x27x48

willoweit.
Typewritten Text
(55-3+1)/2–>27

willoweit.
Highlight

willoweit.
Typewritten Text
for max-pooling , we just use the maximum value.
Therefore, we do not have any parameter

3rd layer

padding: output_size=ceiling( (input_size)/stride )

Convolutional layer
No.filters: 256
Filter size: 5×5

Padding: 2
Stride: 1

27x27x48

27x27x128

?
16

LeNet5 AlexNet VGG GoogleNet ResNet

How many parameters?
5x5x48x256=307200

27x27x48

27x27x128

willoweit.
Typewritten Text
256/2=128

4th layer

27x27x128

Max-pooling
Pool size: 3×3

Padding: 0
Stride: 2

13x13x128

output_size=ceiling( (input_size-kernel_size+1)/stride )

?
17

LeNet5 AlexNet VGG GoogleNet ResNet

27x27x128
13x13x128

How many parameters?
0

5th layer

Convolution layer
No.filters: 384
Filter size: 3×3

Padding: 1
Stride: 1

18

LeNet5 AlexNet VGG GoogleNet ResNet

13x13x128

13x13x128

13x13x192

13x13x192

How many parameters?
3x3x256x384=884,736

willoweit.
Highlight

willoweit.
Highlight

willoweit.
Typewritten Text
since they are all cross connected

willoweit.
Line

willoweit.
Line

Following convolutional layers

13×13

X192

Convolution layer
No.filters: 384
Filter size: 3×3

Padding: 1
Stride: 1

Convolution layer
No.filters: 256
Filter size: 3×3

Padding: 1
Stride: 1

19

LeNet5 AlexNet VGG GoogleNet ResNet

13×13

X192

13×13

X192

13×13

X192

13×13

X128

13×13

X128

How many parameters?
3x3x192x384=663,552

How many parameters?
3x3x192x256=442,368

Max-pooling and flatten

Flatten

13x13x128

Max-pooling
Filter size: 3×3

Padding: 0
Stride: 2

4608

6x6x128

20

LeNet5 AlexNet VGG GoogleNet ResNet

13x13x128 6x6x128

4608

willoweit.
Typewritten Text
=4608

Following fully connected layers

FC
2048

FC
1000

21

LeNet5 AlexNet VGG GoogleNet ResNet

4608

4608
FC

2048

FC
2048

FC
2048

How many parameters?

9216×4096=37,748,736 4096×4096=16,777,216

4096×1000=4096000

willoweit.
Typewritten Text
=4608*2

willoweit.
Typewritten Text
=2048*2

Architecture

224 × 224×3

conv3-64

conv3-64

Maxpool

Conv3-128

Conv3-128

Maxpool

Conv3-256

Conv3-256

Conv3-256

Maxpool

Conv3-512

Conv3-512

Conv3-512

Maxpool

Conv3-512

Conv3-512

Conv3-512

Maxpool

FC4096

FC4096

FC1000

• VGG16: 16 weight layers

112×112×64 56×56×128 28×28×256 14×14×512 7×7×512

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

22

ResNetLeNet5 AlexNet VGG GoogleNet ResNet

Conv layer: kernel size 3×3, pad 1, stride 1

Maxpooling layer: 2×2,stride 2

Architecture

224 × 224×3

conv3-64

conv3-64

Maxpool

Conv3-128

Conv3-128

Maxpool

Conv3-256

Conv3-256

Conv3-256

Conv3-256

Maxpool

Conv3-512

Conv3-512

Conv3-512

Conv3-512

Maxpool

FC4096

FC4096

FC1000

• VGG19: 19 weight layers

Conv3-512

Conv3-512

Conv3-512

Conv3-512

Maxpool

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

23

ResNetLeNet5 AlexNet VGG GoogleNet ResNet

Conv layer: kernel size 3×3, pad 1, stride 1

Maxpooling layer: 2×2,stride 2

Stacking multiple 3×3 conv layers
• a stack of two 3×3 conv. layers has an effective receptive field of

5×5
• a stack of 3 3×3 conv. layers has an effective receptive field of 7×7

24

LeNet5 AlexNet VGG GoogleNet ResNet

More layers: larger size
of receptive field

(larger window of the
input is seen)

Conv

Conv

If you add an additional convolutional layer with kernel size K,

the receptive field is increased by (K-1)

Why Stacking multiple 3×3 conv layers instead of large filter size?
• Reduce parameter:

• 5×5=25, 3x3x2=18

• 7×7=49, 3x3x3=27

• Each conv use ReLU as the activation function. More layers, more
non-linear rectification layers More powerful network

25

LeNet5 AlexNet VGG GoogleNet ResNet

Architecture

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

Inception module (Naive version):

Channel-wise

concatenation

26

LeNet5 AlexNet VGG GoogleNet ResNet

Different scales of data require different convolutional filter sizes

27

LeNet5 AlexNet VGG GoogleNet ResNet

45 Dog Memes That Are Paws-itively Hilarious

45 Dog Memes That Are Paws-itively Hilarious

Inception module

28x28x192

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

64 128 32

28x28x(64+128+32+192)

=28x28x416

Channel-wise

concatenation

(Naive version)

28

LeNet5 AlexNet VGG GoogleNet ResNet

Params: 12K Params: 221K Params: 153K Params: 0

Total parameters:

~386K

How many parameters?
Input_channel x

Kernel_size x

NO.filters

(output_channel)

willoweit.
Typewritten Text
28x28x64

willoweit.
Typewritten Text
28x28x128

willoweit.
Typewritten Text
28x28x32

willoweit.
Typewritten Text
28x28x192

willoweit.
Typewritten Text
12k = 192*1*1*64

Inception module

28x28x416

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

128 192 96

28x28x(128+192+96+416)

=28x28x832

Channel-wise

concatenation

(Naive version)

29

LeNet5 AlexNet VGG GoogleNet ResNet

Params: 53K Params: 718K Params: 998K Params: 0

Total parameters:

~1.7M

More modules
More parameters

More computations

Inception module with dimensionality reduction

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

Use 1×1 convolution to reduce channels

28x28x192

96 16

32
64

128 32

30

LeNet5 AlexNet VGG GoogleNet ResNet

Params: 12K

Params:18K

Params:110K

Params:3K

Params:12K

Params: 0

Params: 6K

Total parameters:

~386K

28x28x(64+128+32+32)

=28x28x256

161K

Inception module with dimensionality reduction

Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015

Use 1×1 convolution to reduce channels

28x28x256

128 32

64
128

192 96

31

LeNet5 AlexNet VGG GoogleNet ResNet

Params: 32K

Params:32K

Params:221K

Params:8K

Params:76K

Params: 0

Params: 16K

28x28x(128+192+96+64)

=28x28x480Total parameters:

~1.7 M 385K

 Large Scale Visual Recognition Challenge
Comparison of different architectures

• Top-5 error: the proportion of images that the ground-truth category is outside
the top-5 predicted categories of the model.

32

willoweit.
Highlight

He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

33

Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error

Residual network: more layers & better performance

More layers?

Residual network
Hypothesis: residual mapping is easier to optimise

Residual learning:

let these layers fit a residual mapping

34

LeNet5 AlexNet VGG GoogleNet ResNet

He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Weight layer

relu

Weight layer

relu

x

Unreferenced mapping:

directly fit the desired underlying mapping

Residual network (34 layers)

IMG

Input sizse:

224X224X3

x 3
x 4

128

128

128-d

x 6

256

256

256-d

x 3

512

512

512-d

conv

1×1,128

conv

1×1,256

conv

1×1,512

35

LeNet5 AlexNet VGG GoogleNet ResNet

Residual network

LeNet5 AlexNet VGG GoogleNet ResNet

Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2
36

He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

 Large Scale Visual Recognition Challenge
Comparison of different architectures

• Top-5 error: the proportion of images that the ground-truth category is outside
the top-5 predicted categories of the model.

37

Use the pretrained CNN model as feature extractor
Train a new classifier for output

38

If you have quite a lot of data: fine-tuning
Slightly train a few more top layers

Train
Conv layers

Train
Classifier

Frozen

39

Summary

• How to calculate the NO. parameters & size of output feature map ?

• Difference of the architectures

• Key idea of VGG: how to increase receptive field?

• Key idea of GoogleNet: how to reduce parameters?

• Key idea of ResNet: how to increase layers with better performance?

40