The University of Sydney Page 1
Convolutional
Neural Networks
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
History of CNNs
Neocognitron (Kunihiko Fukushima, 1980)
The University of Sydney Page 3
History of CNNs
LeNet-5 (LeCun et al, 1998)
– Built the modern framework of CNNs: Convolutional Layer, Pooling
Layer, and Fully-Connected Layer
– Trained with the backpropagation algorithm
– Classify handwritten digits. However, it can not perform well on more
complex problems, e.g., large-scale image and video classification
The University of Sydney Page 4
History of CNNs
Linear classifier: 8% ~ 12% error
K-nearest-neighbor: 1.x% ~ 5% error
Support Vector Machine: 0.6% ~ 1.4% error
(Conventional) Neural Nets: 1% ~ 5% error
“The MNIST Database”
The University of Sydney Page 5
History of CNNs
AlexNet (Krizhevsky et al, 2012)
– Significant improvements on the image classification task, ImageNet 2012
– The network achieved a top-5 error of 15.3%, more than 10.8 percentage
points ahead of the runner up.
– Basic framework of CNNs with a deeper structure
– Benefit from ImageNet dataset, GPUs, ReLU, Dropout …
5 convolutional layers and 3 fully connected layers
The University of Sydney Page 6
Today, CNNs are everywhere
– Image classification, Image segmentation, Pose estimation, Style
transfer, Image detection, Image caption …
(Krizhevsky et al, 2012) (Shaoli et al, 2017)
(Jianfeng et al, 2017)(Xinyuan et al, 2018)
The University of Sydney Page 7
Basic CNNs Components
The University of Sydney Page 8
A general CNN
– Convolutional Layer
– Pooling
– Fully-connected Layer
(https://leonardoaraujosantos.gitbooks.io)
The University of Sydney Page 9
https://github.com/pytorch/examples/blob/master/mnist/main.py
A toy example
https://pytorch.org/docs/stable/nn.html#linear
https://pytorch.org/docs/stable/nn.html#convolution-layers
https://github.com/pytorch/examples/blob/master/mnist/main.py
https://pytorch.org/docs/stable/nn.html
https://pytorch.org/docs/stable/nn.html
The University of Sydney Page 10
Convolution layers in PyTorch
https://pytorch.org/docs/stable/nn.html#convolution-layers
The University of Sydney Page 11
Convolutional Layer
Grayscale Image: !
Filter: ”
Feature
– Give a simple example: take a grayscale image as input
#!,# #!,$
#%,% #%,# #%,$
##,% ##,# ##,$
#!,%
The University of Sydney Page 12
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1
– Convolution
The University of Sydney Page 13
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2
– Convolution
The University of Sydney Page 14
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0
– Convolution
The University of Sydney Page 15
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0 0
– Convolution
The University of Sydney Page 16
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0 0
0 1 3 -2
0 0 -1 4
3 -1 -2 -2
– Convolution
The University of Sydney Page 17
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1
– Stride
Stride = 1
2
The stride size is defined by how much you want to shift your filter at each step.
The University of Sydney Page 18
Convolutional Layer
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1
– Stride
Stride = 3
0
The University of Sydney Page 19
Convolutional Layer
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
– Zero padding (pad = 1)
By doing this you can apply the filter to every element of your input matrix.
The University of Sydney Page 20
Convolutional Layer
– Output Size
Output Size
!
”
#
Output Size = !”#$%&’ + 1
$
$
The University of Sydney Page 21
Learn multiple filters
The University of Sydney Page 22
Learn multiple filters
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0 0
0 1 3 -2
0 0 -1 4
3 -1 -2 -2
1
0 2 1
0 1 -1
-1 1 0
Filter 1
Filter 2
…
…
The University of Sydney Page 23
1
0 2 1
-1 1 0
0 1 -1
Learn multiple filters
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0 0
0 1 3 -2
0 0 -1 4
3 -1 -2 -2
3
Filter 1
Filter 2
…
…
The University of Sydney Page 24
1
0 2 1
-1 1 0
0 1 -1
Learn multiple filters
1 2 0 1 0 1
2 1 1 0 0 1
1 0 0 2 1 0
2 0 0 0 2 1
0 1 1 2 0 2
1 0 1 0 1 1
1 0 -1
-1 0 0
0 0 1
-1 2 0 0
0 1 3 -2
0 0 -1 4
3 -1 -2 -2
3 2 4 -1
1 0 1 4
1 2 4 1
-1 0 3 4
Filter 1
Filter 2
…
…
The University of Sydney Page 25
Convolutional Layer
– Above, we have only considered a 2-D image as input
– When the input has depth (e.g. RGB images), the
convolution ops should be…
$%×&%×’%
(&×(‘×’%
$#×&#
The University of Sydney Page 26
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 27
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 28
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 29
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 30
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 31
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 32
Convolutional Layer
Two filters
Stride=2
Zero-padding=1
The University of Sydney Page 33
Convolutional Layer
– Suppose stride is (“!, “”) and pad is (%!, %”)
!) =
*!+),”-.”
/”
+ 1 and %) =
0!+),”-.#
/#
+ 1
$%×&%×’%
(&×(‘×’%
$#×&#
The University of Sydney Page 34
%!,# %!,$
%#,! %#,# %#,$
%$,! %$,# %$,$
%!,!
Convolution as a matrix operation
&!,!%&! &!,#%&! &!,$%&! &!,’%&!
&#,!%&! &#,#%&! &#,$%&! &#,’%&!
&$,!%&! &$,#%&! &$,$%&! &$,’%&!
&’,!%&! &’,#%&! &’,$%&! &’,’%&!
&!,!% &!,#%
&#,!% &#,#%
&#$%×((%,%
#$%, ⋯ , (‘,’
#$%) ⊺= ((%,%
# , ⋯ , (),)
# ) ⊺
– If the input +#$%and output +# were to be unrolled into vectors, the
convolution could be represented as a sparse matrix&#$%where the non-zero
elements are the elements ,*,+ of the kernel.
(!”# =
*#,# *#,% *#,& 0 *%,# *%,% *%,& 0
0 *#,# *#,% *#,& 0 *%,# *%,% *%,&
*&,# *&,% *&,& 0 0 0 0 0
0 *&,# *&,% *&,& 0 0 0 0
0 0 0 0 *#,# *#,% *#,& 0
0 0 0 0 0 *#,# *#,% *#,&
*%,# *%,% *%,& 0 *&,# *&,% *&,& 0
0 *%,# *%,% *%,& 0 *&,# *&,% *&,&
1,-% 2,-% (4,-%) 1,
The University of Sydney Page 35
Back-propagation in convolutional layer
4,-%×(6%,%
,-%, ⋯ , 6.,.
,-%) ⊺= (6%,%
,-%, ⋯ , 6#,#
,-%) ⊺
,-.//
,0(,)
= ∑1,2
,-.//
,3*,+,
,3*,+,
,0(,)
,
– Backward pass
,3*,+,
,0(,)
= (14*$%,24+$%
#$% ,
∗ =
1,-% 2,-% (4,-%) 1,
01233
04!,#$
=
01233
04%$
= ∑5
01233
04&$'(
04&$'(
04%$
= ∑5
01233
04&$'(
45,6
, =
01233
07$'( 4∗,6
, = 4∗,6
, ⊺ 01233
07$'(.
,where
⋯ ⋯
(Note that 66
, represent i-th element in the 1,. Here, ; = (ℎ − 1)×& + @.
The University of Sydney Page 36
Receptive Field
The University of Sydney Page 37
Receptive Field
– The receptive field in Convolutional Neural Networks (CNN) is the
region of the input space that affects a particular unit of the network.
– In this example, we use the convolution filter . with size / = 3×3,
padding % = 1, stride ” = 2×2.
The University of Sydney Page 38
Receptive Field
– From the left column, we are hard to tell the receptive filed size,
especially for deep CNNs.
– The right column shows the fixed-sized CNN visualization, which
solves the problem by keeping the size of all feature maps constant
and equal to the input map. Each feature is then marked at the center
of its receptive field location.
The University of Sydney Page 39
Convolution layers in PyTorch
https://pytorch.org/docs/stable/nn.html#convolution-layers
The University of Sydney Page 40
– In simple terms, dilated convolution is just a convolution
applied to input with defined gaps.
– Dilation: Spacing between kernel elements. Default: 1.
– D=2 means skipping one pixel per input
– The receptive filed grows exponentially while the number of
parameters grows linearly.
Dilated Convolution
(Yu et al, 2015)
The University of Sydney Page 41
Pooling
The University of Sydney Page 42
Pooling
Max pooling
– Filter size: (2,2)
– Stride: (2,2)
– Pooling ops: max(6)
-1 2 0 0
0 1 3 -2
0 0 -1 4
3 -1 -2 -2
2 3
3 4
Feature map
Subsample map
The University of Sydney Page 43
Motivation: Pooling
– Pooling helps the representation become slightly invariant to
small translations of the input
– Invariance to local translation can be a very useful property if we care
more about whether some feature is present than exactly where it is
– Taking max pooling as an example:
1. 0.2 0.1 0.00.1
1. 1. 0.2 0.11.
0.2 1. 0.2 0.00.3
1. 1. 1. 0.20.3
The University of Sydney Page 44
Motivation: Pooling
– Because pooling summarizes the responses over a whole
neighbourhood, it is possible to use fewer pooling units than
detector units
– Since pooling is used for down sampling, it can be used to
handle inputs of varying sizes
1. 0.2 0.1 0.00.1 0.5
1. 0.2 0.5
The University of Sydney Page 45
Pooling
Average pooling
– Filter size: (2,2)
– Stride: (2,2)
– Pooling ops: mean(6)
-1 4 1 2
0 1 3 -2
1 5 -2 6
3 -1 -2 -2
4 3
5 6
Feature map
Max pooling
1 1
2 0
Average pooling
The University of Sydney Page 46
Pooling
&A norm pooling
– Filter size (Gaussian kernel size): (2,2)
– Stride: (2,2)
– Pooling ops: 95= ∑6#5&-,.#
6%,% 6%,# 6#,% 6#,#
#% ##
#$ #.
Feature map
Gaussian window
B% B#
B$ B.
Output
6%,$ 6%,. 6#,$ 6#,.
6$,% 6$,# 6.,% 6.,#
6$,$ 6$,. 6.,$ 6.,.
The University of Sydney Page 47
Pooling
– Other pooling
– !$ pooling (preserves the class-specific spatial/geometric information in
the pooled features)
!-= #
.
$.%-,./
⁄! /
– Mixed pooling (addresses the over-fitting problem)
“%= $max()%,’, … , )%,() + 1 − $ mean()%,’, … , )%,()
– Stochastic pooling (hyper-parameter free, regularizes large CNNs)
“%= )), where 2~4(5′, ⋯ , 5() and 5* =
+%,&
∑& +%,&
– Spectral pooling (preserves considerably more information per parameter
than other pooling strategies)
7 = ℱ(9) ∈ ℂ-×/, <7 = ℱ0' (7 ∈ ℂ1×2) – … The University of Sydney Page 48 Why CNNs ? The University of Sydney Page 49 Motivation: convolution – Problems of fully connected neural networks – Every output unit interacts with every input unit – The number of weights grows largely with the size of the input image – Pixels in distance are less correlated The University of Sydney Page 50 Motivation: convolution – Locally connected neural net – Sparse connectivity: a hidden unit is only connected to a local patch – It is inspired by biological systems, where a cell is sensitive to a small sub-region, called a receptive field. – Here, the receptive field can be called as filter or kernel The University of Sydney Page 51 Motivation: convolution – Problems of Locally connected neural net – The learned filter is a spatially local pattern – A hidden node at a higher layer has a larger receptive field in the input – Stacking many such layers leads to “filters”(not anymore linear) which become increasingly “global” Ranzato CVPR’13 The University of Sydney Page 52 Motivation: convolution – Shared weights – Translation invariance: capture statistics in local patches and they are independent of locations – Hidden nodes at different locations share the same weights. It greatly reduces the number of parameters to learn Example: 1000x1000 image 1 Filters Filter size: 10x10 100 parameters Ranzato CVPR’13 The University of Sydney Page 53 Motivation: convolution – Multiple filters – Multiple filters provide the probability of detecting the spatial distributions of multiple visual patterns – One filter can build a feature map, multiple filters will build a stack of feature maps Example: 1000x1000 image 100 Filters Filter size: 10x10 10k parameters Ranzato CVPR’13 The University of Sydney Page 54 Motivation: convolution – Multiple filters: intuitive examples Input Image blurring Edge detection Image enhancement Vertical detection The University of Sydney Page 55 Visualize features The University of Sydney Page 56 Visualize features – Why CNNs work so well? Hierarchical Convolution, Nonlinear operations (ReLU, max pooling…) What happens inside hidden layers? Class Scores: 1000 numbers Input image The University of Sydney Page 57 Visualize features – Give insight into the function of intermediate feature layers and the operation of the classifier (Zeiler and Fergus, 2014) The University of Sydney Page 58 Thank you!