COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Convolutional Networks 2
COMP9444 20T3
Convolutional Networks 3
3b. Convolutional Networks
Convolutional Networks (7.9) Softmax (6.2.2)
Convolution Operator (9.1-9.2) Max Pooling (9.3-9.4)
Textbook, Sections 6.2.2, 7.9, 9.1-9.5
Stride (9.5)
Convolutional Networks
Hubel and Weisel – Visual Cortex
Suppose we want to classify an image as a bird, sunset, dog, cat, etc.
If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image.
cells in the visual cortex respond to lines at different angles
cells in V2 respond to more sophisticated visual features
Convolutional Neural Networks are inspired by this neuroanatomy CNN’s can now be simulated with massive parallelism, using GPU’s
We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks 1
COMP9444 20T3 Convolutional Networks 4
COMP9444 20T3 Convolutional Networks 5
Convolutional Network Components
MNIST Handwritten Digit Examples
convolution layers: extract shift-invariant features from the previous layer
subsampling or pooling layers: combine the activations of multiple units from the previous layer into one unit
fully connected layers: collect spatially diffuse information
output layer: choose between classes
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks
6
COMP9444 20T3
Convolutional Networks 7
CIFAR Image Examples
Convolutional Network Architecture
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
There can be multiple steps of convolution followed by pooling, before reaching the fully connected layers.
Note how pooling reduces the size of the feature map (usually, by half in each direction).
COMP9444 20T3 Convolutional Networks
8
COMP9444 20T3 Convolutional Networks 9
Softmax (6.2.2)
Convolution Operator
Consider a classification task with N classes, and assume zj is the output of the unit corresponding to class j.
Continuous convolution
We assume the network’s estimate of the probability of each class j is proportional to exp(zj). Because the probabilites must add up to 1, we need to normalize by dividing by their sum:
s(t)=(x∗w)(t)= x(a)w(t−a)da Discrete convolution
exp(zi) Prob(i) = ∑Nj=1 exp(zj)
∞ s(t)=(x∗w)(t)= ∑ x(a)w(t−a)
N logProb(i) = zi −log∑j=1 exp(zj)
Two-dimensional convolution
If the correct class is i, we can treat −logProb(i) as our cost function. The first term pushes up the correct class i, while the second term mainly pushes down the incorrect class j with the highest activation (if j ̸= i).
S(j,k)=(K∗I)(j,k)=∑∑K(m,n)I(j+m,k+n) mn
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks
10
COMP9444 20T3 Convolutional Networks 11
Convolutional Neural Networks
Convolutional Neural Networks
j
j j+m
k
l
Assume the original image is J × K , with L channels.
We apply an M × N “filter” to these inputs to compute one hidden unit in the convolution layer. In this example J = 6,K = 7,L = 3,M = 3,N = 3.
i i M−1N−1il
Zj,k = gb +∑∑m=0 ∑n=0 Kl,m,nVj+m,k+n
COMP9444 ⃝c Alan Blair, 2017-20
compute the next hidden unit in the convolution layer (“weight sharing”). COMP9444 ⃝c Alan Blair, 2017-20
Note: Theoreticians sometimes write I( j − m, k − n) so that the operator is commutative. But, computationally, it is easier to write it with a plus sign.
k k+n
l
l
The same weights are applied to the next M × N block of inputs, to
a=−∞
COMP9444 20T3 Convolutional Networks 12
COMP9444 20T3 Convolutional Networks 13
Convolutional Neural Networks
Example: LeNet
If the original image size is J×K and the filter is size M×N, the convolution layer will be (J+1−M)×(K+1−N)
Question: If there are 6 filters in this layer, compute the number of: weights per neuron?
neurons?
connections?
COMP9444
⃝c Alan Blair, 2017-20
independent parameters? COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks
14 COMP9444 20T3
Convolutional Networks 15
Max Pooling (9.3-9.4)
Example: LeNet trained on MNIST
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
For example, in the first convolutional layer of LeNet, J = K = 32, M = N = 5.
The width of the next layer is
J + 1 − M = 32 + 1 − 5 = 28
The 5 × 5 window of the first convolution layer extracts from the original 32 × 32 image a 28 × 28 array of features. Subsampling then halves this size to 14 × 14. The second Convolution layer uses another 5 × 5 window to extract a 10 × 10 array of features, which the second subsampling layer reduces to 5 × 5. These activations then pass through two fully connected layers into the 10 output units corresponding to the digits ’0’ to ’9’.
COMP9444 20T3 Convolutional Networks 16
COMP9444 20T3 Convolutional Networks
17
Convolution with Zero Padding
Convolution with Zero Padding
Sometimes, we treat the off-edge inputs as zero (or some other value).
This is known as “Zero-Padding”.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks 18
COMP9444 20T3 Convolutional Networks
19
Convolution with Zero Padding
Example: AlexNet (2012)
With Zero Padding, the convolution layer is the same size as the original image (or the previous layer).
5 convolutional layers + 3 fully connected layers
max pooling with overlapping stride
softmax with 1000 classes
2 parallel GPUs which interact only at certain layers
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Convolutional Networks
20
COMP9444 20T3
Convolutional Networks 21
Stride (9.5)
Stride
Assume the original image is J × K , with L channels.
We again apply an M × N filter, but this time with a “stride” of s > 1. In this example J = 7, K = 9, L = 3, M = 3, N = 3, s = 2.
i i M−1N−1il
Zj,k = gb +∑∑m=0 ∑n=0 Kl,m,nVj+m,k+n
COMP9444
⃝c Alan Blair, 2017-20
The number of free parameters is 1+L×M×N
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Convolutional Networks
22
COMP9444 20T3 Convolutional Networks 23
Stride Dimensions
Stride with Zero Padding
j takes on the values 0,s,2s,…,(J−M) k takes on the values 0,s,2s,…,(K−N)
When combined with zero padding of width P,
j takes on the values 0,s,2s,…,(J+2P−M) k takes on the values 0,s,2s,…,(K+2P−N)
The next layer is (1+(J−M)/s) by (1+(K−N)/s) COMP9444
⃝c Alan Blair, 2017-20
The next layer is (1+(J+2P−M)/s) by (1+(K+2P−N)/s)
COMP9444 ⃝c Alan Blair, 2017-20
l
The same formula is used, but j and k are now incremented by s each time.
COMP9444 20T3 Convolutional Networks 24
COMP9444 20T3 Convolutional Networks 25
Example: AlexNet Conv Layer 1
Overlapping Pooling
For example, in the first convolutional layer of AlexNet, J = K = 224, P = 2, M = N = 11, s = 4.
If the previous layer is J × K, and max pooling is applied with width F and stride s, the size of the next layer will be
The width of the next layer is 1+(J+2P−M)/s=1+(224+2×2−11)/4=55
(1+(J−F)/s)×(1+(K−F)/s)
Question: If there are 96 filters in this layer, compute the number of: weights per neuron?
neurons?
connections?
Question: If max pooling with width 3 and stride 2 is applied to the features of size 55 × 55 in the first convolutional layer of AlexNet, what is the size of the next layer?
independent parameters? COMP9444
Answer:
Question: How many independent parameters does this add to the model? Answer:
COMP9444 20T3
Convolutional Networks 26
Convolutional Filters
COMP9444
First Layer Second Layer Third Layer
⃝c Alan Blair, 2017-20
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20