The University of Sydney Page 1
Dr Chang Xu
School of Computer Science
Neural Network
Architectures
The University of Sydney Page 2
ILSVRC
q Image Classification
q one of the core problems in computer vision
q many other tasks (such as object detection, segmentation) can be
reduced to image classification
q ImageNet Large Scale Visual Recognition Challenge
q 2010 – 2017
q main tasks: classification and detection
q a large-scale benchmark for image classification methods
q1000 classes
q1.2 million training images
q50 thousand verification images
q150 thousand test images
* Some slides are borrowed from cs231n.stanford.edu
The University of Sydney Page 3
ILSVRC
The University of Sydney Page 4
ILSVRC
SVM
SVM
AlexNet
ZFNet
VGGNet
ResNet
0.154
0.112
0.067 0.0357
Human-level performance: 5.1%
– Big Data
– GPU
– Algorithm improvement
The University of Sydney Page 5
Winning CNN Architectures
Model AlexNet ZF Net GoogLeNet Resnet
Year 2012 2013 2014 2015
#Layer 8 8 22 152
Top 5 Acc 15.4% 11.2% 6.7% 3.57%
Data
augmentation ⎷ ⎷ ⎷ ⎷
Dropout ⎷ ⎷
Batch
normalization
⎷
The University of Sydney Page 6
CNN Architecture: Part I
q LeNet [LeCun et al., 1998]
– Architecture is [CONV-POOL-CONV-POOL-FC-FC]
– CONV: 5×5 filter, stride=1
– POOL: 2×2 filter, stride=2
The University of Sydney Page 7
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
ILSVRC2010: 28.2%
ILSVRC2011: 25.8%
ILSVRC2012: 16.4% (second winner 26.2%)
The University of Sydney Page 8
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
– Activation function: ReLU
– Data Augmentation
– Dropout (drop rate=0.5)
– Local Response Normalization
– Overlapping Pooling
11×11 filters
-5 conv layers, 3 fully connected layers.
The University of Sydney Page 9
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012] -5 conv layers, 3 fully connected layers.
– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling
Pros:
– Easy to feed forward
– Easy to back propagate
– Help prevent saturation
– Sparse
Cons:
– Dead ReLU
The University of Sydney Page 10
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
– Randomly Rotate Images
– Flip Images
– Shift Images
– Contrast Stretching
– Adaptive Equalization
– etc.
– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling
The University of Sydney Page 11
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling
In practice, dropout trains 2″ networks (” – number of units).
Drop is a technique which deal with overfitting by combining the
predictions of many different large neural nets at test time.
The University of Sydney Page 12
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling
The University of Sydney Page 13
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling
-1 4 1 2
0 1 3 -2
1 5 -2 6
3 -1 -2 -2
4 3
5 6
Feature map
Max pooling
1 1
2 0
Average pooling
Pooling stride < Pooling kernel size
Pooling stride = Pooling kernel size
Standard Pooling
Overlapping Pooling
The University of Sydney Page 14
CNN Architecture: Part I
q AlexNet [Krizhevsky et al. 2012]
The problem set is divided into 2 parts, half executing on GPU 1 & another
half on GPU 2.
The University of Sydney Page 15
CNN Architecture: Part I
q ZFNet [Zeiler and Fergus, 2013]
- An improved version of AlexNet: top-5 error from 16.4% to 11.7%
- First convolutional layer: 11x11 filter, stride=4 -> 7×7 filter, stride=2
– The number of filters increase as we go deeper.
7×7 filters
The University of Sydney Page 16
CNN Architecture: Part I
q VGGNet [Simonyan and Zisserman, 2014]
Small filters:
– 3×3 convolutional layers (stride 1, pad 1)
– 2×2 max-pooling, stride 2
Deeper networks:
– AlexNet: 8 layers
– VGGNet: 16 or 19 layers
Number of Parameters
(millions)
Top-5
Error Rate (%)
The University of Sydney Page 17
CNN Architecture: Part I
Why small filters?
– Two 3×3 layer = 5×5 layer
Three 3×3 layer = 7×7 layer
– Deeper, more non-linearity
– Less parameters
Receptive Field (RF): the region in the input space that a
particular CNN’s feature is looking at (i.e. be affected by).
Image credit to mc.ai
The University of Sydney Page 18
CNN Architecture: Part I
Why popular?
– FC7 features generalize well to other tasks
– Plain network structure
The rule for network design
– Prefer a stack of small filters to a large filter
Image credit to: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
Why small filters?
– Two 3×3 layer = 5×5 layer
Three 3×3 layer = 7×7 layer
– Deeper, more non-linearity
– Less parameters
The University of Sydney Page 19
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
Deeper networks
– 22 layers.
– Auxiliary loss
Computational efficiency
– Inception module
– Remove FC layer, use global average pooling
– 12x less parameters than AlexNet
The University of Sydney Page 20
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
Design a good local network topology (NIN) and
then stack these modules on top of each other.
Inception module
The University of Sydney Page 21
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
Naive Inception module
– Different receptive fields
– The same feature dimension
– The effectives of pooling
Output Size = !”#$%&’ + 1
The University of Sydney Page 22
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
1×1 convolutional layer
– preserve spatial dimensions (56×56)
– reduces depth (64 -> 32)
The University of Sydney Page 23
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
100 x 100 x 128
128 x 5 x 5
Input layer
256
5×5 convolutions
Feature map 100 x 100 x 256
stride=1, pad=2
100 x 100 x 128
32 x 5 x 5
Input layer
256
5×5 convolutions
Feature map 100 x 100 x 256
stride=1, pad=2
32
1 x 1 convolutions
stride=1, pad=0128 x 1 x 1
Feature map 100 x 100 x 32
#parameter: 128x5x5x256
#parameter: 128x1x1x32 + 32x5x5x256
The University of Sydney Page 24
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
Inception module
The University of Sydney Page 25
CNN Architecture: Part I
q GoogLeNet [Szegedy et al., 2014]
Stacked Inception modules
No FC layers
The University of Sydney Page 26
CNN Architecture: Part I
The deeper model should be able to perform at least as well as the shallower
model.
GoogLeNet (22 layers)VGG16 (16 layers)AlexNet (9 layers)
The University of Sydney Page 27
CNN Architecture: Part I
q ResNet [He et al., 2015]
Stacking deeper layers on a “plain” convolutional neural network
=> the deeper model performs worse, but not overfitting.
Hypothesis: deeper models are harder to optimize.
The University of Sydney Page 28
CNN Architecture: Part I
q ResNet [He et al., 2015]
Solution: use network layers to fit a residual mapping instead of directly
trying to fit a desired underlying mapping
The University of Sydney Page 29
CNN Architecture: Part I
q ResNet [He et al., 2015]
Very deep networks using residual connections
Basic design:
– all 3×3 conv (almost)
– spatial size /2 => #filter x2
– just deep
– average pooling (remove FC)
– no dropout
ResNet
The University of Sydney Page 30
CNN Architecture: Part I
q ResNet [He et al., 2015]
Use “bottleneck” layer to improve efficiency (similar to
GoogLeNet ).
ResNet
The University of Sydney Page 31
CNN Architecture: Part I
q ResNet [He et al., 2015]
The University of Sydney Page 32
CNN Architecture: Part I
q ResNet [He et al., 2015]
Features matter: deeper features are well transferrable
The University of Sydney Page 33
CNN Architecture: Part II
The University of Sydney Page 34
CNN Architecture: Part II
Inception-res-v1,v2
Xception
Inception v1,v2,v3,v4Inception:
ResNeXt
DenseNet
Wide-ResNetResNet:
The University of Sydney Page 35
CNN Architecture: Part II
q Inception v2
– batch normalization
– remove LRN and dropout
– small kernels (inspired by VGG)
Inception module v1
(In GoogLeNet)
The University of Sydney Page 36
CNN Architecture: Part II
q Inception v2
Factorization
– 7×7 filter -> 7×1 and 1×7 filters
– 3×3 filter -> 3×1 and 1×3 filters
The University of Sydney Page 37
CNN Architecture: Part II
q Inception v3
For a training example with
ground-truth label !, we
replace the label distribution
with
## $ %
= 1 − ) # $ % + )+($)
Label Smoothing
The University of Sydney Page 38
CNN Architecture: Part II
q Inception v4
The University of Sydney Page 39
CNN Architecture: Part II
q Inception-ResNet v1, v2
The University of Sydney Page 40
CNN Architecture: Part II
q Error on ImageNet-1000 classification
The University of Sydney Page 41
CNN Architecture: Part II
q Xception [Franc ̧ois Chollet ,2017]
$!×&!×’!
(“×(#×’!
$$×&$
Inception module is to explicitly factorize
it into a series of operations that would
independently look at cross-channel
correlations and at spatial correlations.
The University of Sydney Page 42
CNN Architecture: Part II
q Xception [Franc ̧ois Chollet ,2017]
$!×&!×’!
(“×(#×’!
$$×&$
Assume that cross-channel correlations
and spatial correlations can be mapped
completely separately.
Inception module is to explicitly factorize
it into a series of operations that would
independently look at cross-channel
correlations and at spatial correlations.
The University of Sydney Page 43
CNN Architecture: Part II
q Wide ResNet (WRNs) [Zagoruyko et al. 2016]
A percent of improved accuracy costs nearly doubling the number of layers.
trade-off: decrease depth and increase width of residual networks
The University of Sydney Page 44
CNN Architecture: Part II
q Wide ResNet (WRNs) [Zagoruyko et al. 2016]
– Residuals are important, not depth
– Increasing width instead of depth more computationally efficient
– more convolutional layers per block
– more feature planes
The University of Sydney Page 45
CNN Architecture: Part II
q Wide ResNet (WRNs) [Zagoruyko et al. 2016]
50-layer wide ResNet outperforms 152-layer original ResNet
The University of Sydney Page 46
CNN Architecture: Part II
q ResNeXt [Xie et al. 2016]
– Split-transform-merge strategy. In an Inception module, the input is split into a few lower-
dimensional embeddings (by 1×1 convolutions), transformed by a set of specialized filters
(3×3, 5×5, etc.), and merged by concatenation.
– Stack building blocks of the same shape, e.g. VGG and ResNets.
Inception model
Repeating a building block that aggregates a set of trans-
formations with the same topology.
The University of Sydney Page 47
CNN Architecture: Part II
q ResNeXt [Xie et al. 2016]
Similar complexity
ResNet = ResNeXt
The University of Sydney Page 48
CNN Architecture: Part II
q ResNeXt [Xie et al. 2016]
The University of Sydney Page 49
CNN Architecture: Part II
q DenseNet [Huang et al. 2017]
Dense Block
– alleviate the vanishing-gradient problem
– strengthen feature propagation
– encourage feature reuse
– reduce the number of parameters
Each layer has direct access to the gradients from the loss function and the original
input signal, leading to an implicit deep supervision.
ResNets:
DenseNets:
The University of Sydney Page 50
CNN Architecture: Part II
q DenseNet [Huang et al. 2017]
Comparison of the DenseNet and ResNet on model size
The University of Sydney Page 51
CNN Architecture: Part III
(Efficient Convolutional Neural Networks)
The University of Sydney Page 52
CNN Architecture: Part III
q Efficient
q less parameter
q more efficient distributed training
q exporting new models to clients
q …
q less memory cost
q embedded deployment (e.g., FPGAs
often have less than 10MB memory)
q …
q less computation
(most memory is in early conv)
(most parameters in late FC)
The University of Sydney Page 53
CNN Architecture: Part III
q SqueezeNet [Iandola et al. 2017]
q MobileNet [Howard et al. 2017]
q ShuffleNet [Zhang et al. 2017]
The University of Sydney Page 54
CNN Architecture: Part III
q SqueezeNet [Iandola et al. 2017]
– Replace 3×3 filters with 1×1 filters
– Decrease the number of input channels to 3×3 filters
– Downsample late in the network so that convolution layers have large
activation maps
Strategy 1
Strategy 1
Strategy 2
Strategy 3
Strategy 3
Strategy 3
The University of Sydney Page 55
CNN Architecture: Part III
q SqueezeNet [Iandola et al. 2017]
AlexNet level accuracy on ImageNet with 50x fewer parameters
The University of Sydney Page 56
CNN Architecture: Part III
q MobileNet [Howard et al. 2017]
– Depthwise separable convolution
– Less channels
Standard convolutions have the computational
cost of:
Depthwise separable convolutions cost:
A reduction in computation of
The University of Sydney Page 57
CNN Architecture: Part III
q MobileNet [Howard et al. 2017]
– less parameters
– less computation
The University of Sydney Page 58
CNN Architecture: Part III
q ShuffleNet [Zhang et al. 2017]
help the information flowing across feature channels
The University of Sydney Page 59
CNN Architecture: Part III
q ShuffleNet [Zhang et al. 2017]