13-cnn
Qiuhong Ke
Convolutional Neural Networks
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
vs
2
Multi-layer perceptron: A fully connected network
9×9
81×1
!!
!”
!#!
“!
#!
“$!…
…
…
Input
layer
Hidden
layer
Output
layer
Consists of only fully connected (FC) layers
3
Disadvantage: Not spatial invariant
≠
… …
Multi-layer perceptron: A fully connected network
4
Disadvantage: more parameters with more hidden layers
!!
!”
!#
“!
#!
#”
##”
“#!…
…
…
$!
%!
%”
%$
$#%
…
…
Multi-layer perceptron: A fully connected network
5Source: Welinder, Peter, et al. “Caltech-UCSD birds 200.” (2010).
Convolutional Neural Network (CNN)
Convolution, Max-Pooling, and Fully Connected (FC) layers
LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324. 6
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
Outline
• Convolutional layer
• Max-Pooling layer
• Additional notes in training neural network
• Batch size
• Optimisation algorithms
• Activation function
• How to prevent overfitting
7
Tool: Keras
Easy, simple and powerful
• Build the architecture (add layers from input to output. eg. FC layer,
convolution layer…)
8
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
Tool: Keras
Easy, simple and powerful
• Select an optimisation algorithm (eg. SGD, more in this lecture)
• Select the loss function
• Compile the model and train the model
9
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
AlexNet – ImageNet Classification with Deep Convolutional Neural Networks
• To do classification, we can first extract local features(: Identify local
patterns) and then combine the local features for classification
An image can be decomposed into local patches
Convolutional
layer Max-pooling layer
Additional
Training notes
• Different local patches could have different patterns
10
Identify different patterns at local patches
Convolutional
layer Max-pooling layer
Additional
Training notes
11
Identify different patterns
Convolutional
layer Max-pooling layer
Additional
Training notes
×
Element-wise
multiplication
×Sum ( ) =2
Sum ( )=1
Filter (kernel)
12
Input and kernel have the same pattern: high response
Convolutional
layer Max-pooling layer
Additional
Training notes
×
Element-wise
multiplication
×Sum ( ) =1
Sum ( )=2
Filter (kernel)
13
Identify different patterns
Convolutional
layer Max-pooling layer
Additional
Training notes
×
Element-wise
multiplication
×Sum ( ) =2
Filter (kernel)
Sum ( )=2
14
Different kernels identify different patterns
Convolutional
layer Max-pooling layer
Additional
Training notes
×
Element-wise
multiplication
×Sum ( ) =2
Filter (kernel)
Sum ( )=5
15
Convolution on 2D
Convolutional
layer Max-pooling layer
Additional
Training notes
Use kernel to perform
element-wise multiplication
and sum for every local patch
16 Figure 9.1 in Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville
Response map (Feature map)
Feature map: 2D map of the
presence of a pattern at different
locations in an input
Convolutional
layer Max-pooling layer
Additional
Training notes
kernel
Input
17
Figure 5.3 in Deep learning with python by Francois Chollet
willoweit.
Highlight
Different kernels identify different patterns: use multiple filters in each layer
Convolutional
layer Max-pooling layer
Additional
Training notes
The number of filters decides the number of output feature maps
18
willoweit.
Typewritten Text
multiple filter –> multiple response map
willoweit.
Highlight
Convolutional
layer Max-pooling layer
Additional
Training notes
32x32x1
Convolution layer
No.filters: 6
Filter size: 5×5
Two key parameters in Convolution
Input: 1 channel output: 6 channel
19
Filter (kernel) size: Size of the patches extracted from the inputs
Number of filters: Depth (channel) of the output feature map
willoweit.
Highlight
willoweit.
Highlight
willoweit.
Oval
willoweit.
Oval
Convolution on Multiple-channel input
R
G
B
Kernel: same channel (depth)
* K(Channel 1)
* K(Channel 2)
* K(Channel 3)
Element-wise
sum
One
channel
Convolutional
layer Max-pooling layer
Additional
Training notes
willoweit.
Typewritten Text
feature map 1
willoweit.
Typewritten Text
feature map 2
willoweit.
Typewritten Text
feature map 3
willoweit.
Underline
willoweit.
Typewritten Text
willoweit.
Typewritten Text
one kernel
willoweit.
Typewritten Text
if use multiple kernel,
it will get multiple channel
Advantage: learn translation-invariant pattern
Convolutional
layer Max-pooling layer
Additional
Training notes
21
Advantage: weight sharing and sparse connection
Fully connected layer:
Each arrow is a
weight (no sharing)
Convolutional
layer Max-pooling layer
Additional
Training notes
Convolutional layer:
22
Figure 9.3 in Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville
willoweit.
Typewritten Text
???
Advantage: learn hierarchical pattern
Convolutional
layer Max-pooling layer
Additional
Training notes
23
More layers: larger size of receptive field
(larger window of the input is seen)
Figure 5.2 in Deep learning with python by Francois Chollet Figure 9.4 in Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville
willoweit.
Typewritten Text
???
Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.
Convolutional
layer Max-pooling layer
Additional
Training notes
24
Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.
Convolutional
layer Max-pooling layer
Additional
Training notes
25
Layer 4 Layer 5
Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.
Convolutional
layer Max-pooling layer
Additional
Training notes
26
Convolutional
layer Max-pooling layer
Additional
Training notes
Σ
×”! ×”” ×”#
!! !” !# !$ !% !& !’
“! “” “# “$ “%
# is input vector
$ is output vector
output size input size≠
27
output size input size≠
Convolutional
layer Max-pooling layer
Additional
Training notes
28 Figure 9.1 in Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville
Padding
Convolutional
layer Max-pooling layer
Additional
Training notes
adding an appropriate number of rows and columns on each side of the input feature map
29
Figure 5.6 in Deep learning with python by Francois Chollet
willoweit.
Rectangle
willoweit.
Typewritten Text
5 x 5
willoweit.
Rectangle
willoweit.
Typewritten Text
7×7
willoweit.
Typewritten Text
input: 7 x 7
Kernel: 3×3
output: N-k+1 = 7-3+1 –> 5×5
Stride
padding: output_size=ceiling( input_size /stride)
Convolutional
layer Max-pooling layer
Additional
Training notes
the distance between two successive windows
No padding: output_size=ceiling( (input_size-kernel_size+1)/stride )
30
Figure 5.5 in Deep learning with python by Francois Chollet
willoweit.
Rectangle
willoweit.
Rectangle
willoweit.
Pencil
willoweit.
Underline
willoweit.
Underline
willoweit.
Typewritten Text
if the stride is larger than one the output size is smaller
willoweit.
Underline
willoweit.
Typewritten Text
smallest integer >= result
Convolutional layer
Convolutional
layer Max-pooling layer
Additional
Training notes
filters: the number of filters in the convolution
kernel_size: the height and width of the 2D
convolution window
padding: one of “valid” or “same”
stride: the strides of the convolution along the height
and width
31
willoweit.
Typewritten Text
the output size == the input size
willoweit.
Underline
willoweit.
Typewritten Text
do not perform the padding
willoweit.
Line
Convolutional
layer Max-pooling layer
Additional
Training notes
32
Convolutional
layer Max-pooling layer
Additional
Training notes
33
Convolutional
layer Max-pooling layer
Additional
Training notes
34
Convolutional
layer Max-pooling layer
Additional
Training notes
35
Advantage: downsample feature map, reduce computational burden
Convolutional
layer Max-pooling layer
Additional
Training notes
36
!! !” !# !$ !% !& !’
0.9 0.7 0.3 1 0.4 0.8
0.9 1 0.8
”
!(
Max-pooling
Conv
Advantage: increase size of receptive field (window of the input is seen)
!! !” !# !$ !% !& !’
0.9 0.7 0.3 1 0.4 0.8
0.9 1 0.8
”
!(!! !” !# !$ !% !& !’
”
!(
0.9 0.7 0.3 1 0.4 0.8
Max-pooling
Convolutional
layer Max-pooling layer
Additional
Training notes
37
ConvConv
Conv
Conv
Convolutional
layer Max-pooling layer
Additional
Training notes
38
Other pooling method: Average pooling
taking the average value over the patch
0.325 0.33 0.48
0.280.68
…
Convolutional
layer Max-pooling layer
Additional
Training notes
39
Why max-pooling to downsample feature map?
Convolution with stride>1: miss feature-presence information
Average pooling: dilute feature-presence information
kernel
Input
Feature map: 2D map of the
presence of a pattern at different
locations in an inputFigure 5.3 in Deep learning with python by Francois Chollet
willoweit.
Underline
willoweit.
Highlight
willoweit.
Typewritten Text
稀释
willoweit.
Underline
willoweit.
Highlight
Outline
Convolutional
layer Max-pooling layer
Additional
Training notes
• Batch size
• Othe optimisation methods (optimiser)
• Momentum
• Adaptive gradient (AdaGrad)
• Root Mean Square Propagation (Rmsprop)
• Adaptive moment estimation (Adam)
• Activation function
• How to prevent overfitting
40
Gradient descent Algorithm
• Randomly shuffle/split all training examples in ! batches
• Choose initial “(“)
• For # from 1 to %
• For & from 1 to !
• Do gradient descent update using data from batch &
• Advantage of such an approach: computational feasibility for
large datasets
Convolutional
layer Max-pooling layer
Additional
Training notes
Iterations over the
entire dataset are
called epochs
41
Stochastic gradient descent: B=N
Choose initial guess !(“), ! = 0
Here ! is a set of all weights form all layers
For $ from 1 to & (epochs)
For ‘ from 1 to ( (training examples)
Consider example )! , +!
Update: !($%&) = !($) − $%&(!($)); kßk+1
Convolutional
layer Max-pooling layer
Additional
Training notes
42
Stochastic gradient descent (SGD)
Batch number==N (Batch size==1)
• high variance in gradient
• update model too often
Convolutional
layer Max-pooling layer
Additional
Training notes
Quick update each step, but
x
error surface
43
Batch SGD: Batch number==1 (Batch size==N)
Stable update, but
Convolutional
layer Max-pooling layer
Additional
Training notes
x
• not computational feasible for large dataset
• takes long time to move each step
44
Mini-batch SGD
mini-batch: 1