CS计算机代考程序代写 GPU algorithm The University of Sydney Page 1

The University of Sydney Page 1

Dr Chang Xu

School of Computer Science

Neural Network
Architectures

The University of Sydney Page 2

ILSVRC

q Image Classification
q one of the core problems in computer vision
q many other tasks (such as object detection, segmentation) can be

reduced to image classification

q ImageNet Large Scale Visual Recognition Challenge
q 2010 – 2017
q main tasks: classification and detection
q a large-scale benchmark for image classification methods

q1000 classes
q1.2 million training images
q50 thousand verification images
q150 thousand test images

* Some slides are borrowed from cs231n.stanford.edu

The University of Sydney Page 3

ILSVRC

The University of Sydney Page 4

ILSVRC

SVM

SVM

AlexNet

ZFNet

VGGNet

ResNet

0.154
0.112

0.067 0.0357

Human-level performance: 5.1%

– Big Data

– GPU

– Algorithm improvement

The University of Sydney Page 5

Winning CNN Architectures

Model AlexNet ZF Net GoogLeNet Resnet

Year 2012 2013 2014 2015

#Layer 8 8 22 152

Top 5 Acc 15.4% 11.2% 6.7% 3.57%

Data
augmentation ⎷ ⎷ ⎷ ⎷

Dropout ⎷ ⎷
Batch
normalization

The University of Sydney Page 6

CNN Architecture: Part I

q LeNet [LeCun et al., 1998]

– Architecture is [CONV-POOL-CONV-POOL-FC-FC]
– CONV: 5×5 filter, stride=1
– POOL: 2×2 filter, stride=2

The University of Sydney Page 7

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

ILSVRC2010: 28.2%
ILSVRC2011: 25.8%
ILSVRC2012: 16.4% (second winner 26.2%)

The University of Sydney Page 8

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

– Activation function: ReLU
– Data Augmentation
– Dropout (drop rate=0.5)
– Local Response Normalization
– Overlapping Pooling

11×11 filters

-5 conv layers, 3 fully connected layers.

The University of Sydney Page 9

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012] -5 conv layers, 3 fully connected layers.

– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling

Pros:
– Easy to feed forward
– Easy to back propagate
– Help prevent saturation
– Sparse

Cons:
– Dead ReLU

The University of Sydney Page 10

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

– Randomly Rotate Images
– Flip Images
– Shift Images
– Contrast Stretching
– Adaptive Equalization
– etc.

– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling

The University of Sydney Page 11

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling

In practice, dropout trains 2″ networks (” – number of units).

Drop is a technique which deal with overfitting by combining the
predictions of many different large neural nets at test time.

The University of Sydney Page 12

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling

The University of Sydney Page 13

CNN Architecture: Part I

q AlexNet [Krizhevsky et al. 2012]

– Activation function: ReLU
– Data Augmentation
– Dropout
– Local Response Normalization
– Overlapping Pooling

-1 4 1 2

0 1 3 -2

1 5 -2 6

3 -1 -2 -2

4 3

5 6

Feature map

Max pooling

1 1

2 0

Average pooling

Pooling stride < Pooling kernel size Pooling stride = Pooling kernel size Standard Pooling Overlapping Pooling The University of Sydney Page 14 CNN Architecture: Part I q AlexNet [Krizhevsky et al. 2012] The problem set is divided into 2 parts, half executing on GPU 1 & another half on GPU 2. The University of Sydney Page 15 CNN Architecture: Part I q ZFNet [Zeiler and Fergus, 2013] - An improved version of AlexNet: top-5 error from 16.4% to 11.7% - First convolutional layer: 11x11 filter, stride=4 -> 7×7 filter, stride=2
– The number of filters increase as we go deeper.

7×7 filters

The University of Sydney Page 16

CNN Architecture: Part I

q VGGNet [Simonyan and Zisserman, 2014]

Small filters:

– 3×3 convolutional layers (stride 1, pad 1)
– 2×2 max-pooling, stride 2

Deeper networks:

– AlexNet: 8 layers
– VGGNet: 16 or 19 layers

Number of Parameters
(millions)

Top-5
Error Rate (%)

The University of Sydney Page 17

CNN Architecture: Part I
Why small filters?

– Two 3×3 layer = 5×5 layer
Three 3×3 layer = 7×7 layer

– Deeper, more non-linearity
– Less parameters

Receptive Field (RF): the region in the input space that a
particular CNN’s feature is looking at (i.e. be affected by).

Image credit to mc.ai

The University of Sydney Page 18

CNN Architecture: Part I

Why popular?

– FC7 features generalize well to other tasks
– Plain network structure

The rule for network design

– Prefer a stack of small filters to a large filter

Image credit to: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf

Why small filters?

– Two 3×3 layer = 5×5 layer
Three 3×3 layer = 7×7 layer

– Deeper, more non-linearity
– Less parameters

The University of Sydney Page 19

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

Deeper networks

– 22 layers.
– Auxiliary loss

Computational efficiency

– Inception module
– Remove FC layer, use global average pooling
– 12x less parameters than AlexNet

The University of Sydney Page 20

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

Design a good local network topology (NIN) and
then stack these modules on top of each other.

Inception module

The University of Sydney Page 21

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

Naive Inception module

– Different receptive fields
– The same feature dimension
– The effectives of pooling

Output Size = !”#$%&’ + 1

The University of Sydney Page 22

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

1×1 convolutional layer
– preserve spatial dimensions (56×56)
– reduces depth (64 -> 32)

The University of Sydney Page 23

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

100 x 100 x 128

128 x 5 x 5

Input layer

256
5×5 convolutions

Feature map 100 x 100 x 256

stride=1, pad=2

100 x 100 x 128

32 x 5 x 5

Input layer

256
5×5 convolutions

Feature map 100 x 100 x 256

stride=1, pad=2

32
1 x 1 convolutions

stride=1, pad=0128 x 1 x 1

Feature map 100 x 100 x 32

#parameter: 128x5x5x256

#parameter: 128x1x1x32 + 32x5x5x256

The University of Sydney Page 24

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

Inception module

The University of Sydney Page 25

CNN Architecture: Part I

q GoogLeNet [Szegedy et al., 2014]

Stacked Inception modules

No FC layers

The University of Sydney Page 26

CNN Architecture: Part I

The deeper model should be able to perform at least as well as the shallower
model.

GoogLeNet (22 layers)VGG16 (16 layers)AlexNet (9 layers)

The University of Sydney Page 27

CNN Architecture: Part I

q ResNet [He et al., 2015]

Stacking deeper layers on a “plain” convolutional neural network
=> the deeper model performs worse, but not overfitting.

Hypothesis: deeper models are harder to optimize.

The University of Sydney Page 28

CNN Architecture: Part I

q ResNet [He et al., 2015]

Solution: use network layers to fit a residual mapping instead of directly
trying to fit a desired underlying mapping

The University of Sydney Page 29

CNN Architecture: Part I

q ResNet [He et al., 2015]

Very deep networks using residual connections

Basic design:

– all 3×3 conv (almost)
– spatial size /2 => #filter x2
– just deep
– average pooling (remove FC)
– no dropout

ResNet

The University of Sydney Page 30

CNN Architecture: Part I

q ResNet [He et al., 2015]

Use “bottleneck” layer to improve efficiency (similar to
GoogLeNet ).

ResNet

The University of Sydney Page 31

CNN Architecture: Part I

q ResNet [He et al., 2015]

The University of Sydney Page 32

CNN Architecture: Part I

q ResNet [He et al., 2015]

Features matter: deeper features are well transferrable

The University of Sydney Page 33

CNN Architecture: Part II

The University of Sydney Page 34

CNN Architecture: Part II

Inception-res-v1,v2

Xception

Inception v1,v2,v3,v4Inception:
ResNeXt

DenseNet

Wide-ResNetResNet:

The University of Sydney Page 35

CNN Architecture: Part II

q Inception v2

– batch normalization
– remove LRN and dropout
– small kernels (inspired by VGG)

Inception module v1
(In GoogLeNet)

The University of Sydney Page 36

CNN Architecture: Part II

q Inception v2

Factorization
– 7×7 filter -> 7×1 and 1×7 filters
– 3×3 filter -> 3×1 and 1×3 filters

The University of Sydney Page 37

CNN Architecture: Part II

q Inception v3

For a training example with
ground-truth label !, we
replace the label distribution
with

## $ %
= 1 − ) # $ % + )+($)

Label Smoothing

The University of Sydney Page 38

CNN Architecture: Part II

q Inception v4

The University of Sydney Page 39

CNN Architecture: Part II

q Inception-ResNet v1, v2

The University of Sydney Page 40

CNN Architecture: Part II

q Error on ImageNet-1000 classification

The University of Sydney Page 41

CNN Architecture: Part II

q Xception [Franc ̧ois Chollet ,2017]

$!×&!×’!

(“×(#×’!

$$×&$

Inception module is to explicitly factorize
it into a series of operations that would
independently look at cross-channel
correlations and at spatial correlations.

The University of Sydney Page 42

CNN Architecture: Part II

q Xception [Franc ̧ois Chollet ,2017]

$!×&!×’!

(“×(#×’!

$$×&$

Assume that cross-channel correlations
and spatial correlations can be mapped
completely separately.

Inception module is to explicitly factorize
it into a series of operations that would
independently look at cross-channel
correlations and at spatial correlations.

The University of Sydney Page 43

CNN Architecture: Part II

q Wide ResNet (WRNs) [Zagoruyko et al. 2016]

A percent of improved accuracy costs nearly doubling the number of layers.

trade-off: decrease depth and increase width of residual networks

The University of Sydney Page 44

CNN Architecture: Part II

q Wide ResNet (WRNs) [Zagoruyko et al. 2016]

– Residuals are important, not depth
– Increasing width instead of depth more computationally efficient

– more convolutional layers per block
– more feature planes

The University of Sydney Page 45

CNN Architecture: Part II

q Wide ResNet (WRNs) [Zagoruyko et al. 2016]

50-layer wide ResNet outperforms 152-layer original ResNet

The University of Sydney Page 46

CNN Architecture: Part II

q ResNeXt [Xie et al. 2016]

– Split-transform-merge strategy. In an Inception module, the input is split into a few lower-
dimensional embeddings (by 1×1 convolutions), transformed by a set of specialized filters
(3×3, 5×5, etc.), and merged by concatenation.

– Stack building blocks of the same shape, e.g. VGG and ResNets.

Inception model

Repeating a building block that aggregates a set of trans-
formations with the same topology.

The University of Sydney Page 47

CNN Architecture: Part II

q ResNeXt [Xie et al. 2016]

Similar complexity
ResNet = ResNeXt

The University of Sydney Page 48

CNN Architecture: Part II

q ResNeXt [Xie et al. 2016]

The University of Sydney Page 49

CNN Architecture: Part II

q DenseNet [Huang et al. 2017]

Dense Block

– alleviate the vanishing-gradient problem
– strengthen feature propagation
– encourage feature reuse
– reduce the number of parameters

Each layer has direct access to the gradients from the loss function and the original
input signal, leading to an implicit deep supervision.

ResNets:

DenseNets:

The University of Sydney Page 50

CNN Architecture: Part II

q DenseNet [Huang et al. 2017]

Comparison of the DenseNet and ResNet on model size

The University of Sydney Page 51

CNN Architecture: Part III

(Efficient Convolutional Neural Networks)

The University of Sydney Page 52

CNN Architecture: Part III

q Efficient

q less parameter
q more efficient distributed training
q exporting new models to clients
q …

q less memory cost
q embedded deployment (e.g., FPGAs

often have less than 10MB memory)
q …

q less computation

(most memory is in early conv)

(most parameters in late FC)

The University of Sydney Page 53

CNN Architecture: Part III

q SqueezeNet [Iandola et al. 2017]

q MobileNet [Howard et al. 2017]

q ShuffleNet [Zhang et al. 2017]

The University of Sydney Page 54

CNN Architecture: Part III

q SqueezeNet [Iandola et al. 2017]

– Replace 3×3 filters with 1×1 filters
– Decrease the number of input channels to 3×3 filters
– Downsample late in the network so that convolution layers have large
activation maps

Strategy 1

Strategy 1

Strategy 2

Strategy 3

Strategy 3

Strategy 3

The University of Sydney Page 55

CNN Architecture: Part III

q SqueezeNet [Iandola et al. 2017]

AlexNet level accuracy on ImageNet with 50x fewer parameters

The University of Sydney Page 56

CNN Architecture: Part III

q MobileNet [Howard et al. 2017]
– Depthwise separable convolution
– Less channels

Standard convolutions have the computational
cost of:

Depthwise separable convolutions cost:

A reduction in computation of

The University of Sydney Page 57

CNN Architecture: Part III

q MobileNet [Howard et al. 2017]

– less parameters
– less computation

The University of Sydney Page 58

CNN Architecture: Part III

q ShuffleNet [Zhang et al. 2017]
help the information flowing across feature channels

The University of Sydney Page 59

CNN Architecture: Part III

q ShuffleNet [Zhang et al. 2017]