4a: Convolution
Week 4: Overview
This week we will focus on image processing, with particular emphasis on object classi�cation. We
will �rst describe convolutional neural networks, and then look at various methods which have been
introduced to allow progressively deeper networks to be trained by backpropagation, including
weight initialisation, batch normalisation, residual networks and dense networks. We will also discuss
how the content of one image can be combined with the style of another, using neural style transfer.
Weekly learning outcomes
By the end of this module, you will be able to:
describe convolutional networks, including convolution operator, stride, padding, max pooling
describe how to train image classi�cation networks
identify the limitations of 2-layer neural networks
explain the problem of vanishing or exploding gradients, and how it can be solved using
techniques such as weight initialisation, batch normalization, skip connections, dense blocks
describe the process of neural style transfer
Image Classi�cation
Ever since Rosenblatt trained his �rst Perceptron in 1957, image classi�cation has played a pivotal
role in driving technological advances in neural networks and deep learning. Three of the most
commonly used image classi�cation datasets are MNIST, CIFAR-10 and ImageNet.
MNIST Dataset
The MNIST dataset of handwritten digits was released in 1998, consisting of 50,000 training images
and 10,000 test images. The images are black and white, at a resolution of 28 28 pixels.×
CIFAR-10 Dataset
The CIFAR-10 dataset, introduced in 2009, consists of 60,000 colour images at resolution 32 32 in
ten di�erent categories.
×
ImageNet Dataset
The ImageNet Dataset consists of 1.2 million images in 1000 di�erent categories, and became the
basis for the ImageNet LSVRC competition.
In 2012 a big advance was made in prediction accuracy on the ImageNet dataset, using an 8-layer
convolutional neural network called AlexNet. Subsequent enhancements allowed even deeper
networks to be trained and achieve even higher accuracy.
Convolutional Networks
Suppose we want to classify an image as a bird, sunset, dog, cat, etc. If we can identify features such
as feather, eye, or beak which provide useful information in one part of the image, then those
features are likely to also be relevant in another part of the image. We can exploit this regularity by
using a convolutional layer, which applies the same weights to di�erent parts of the image.
[image source: LeCun, 1998]
Convolutional Neural Networks generally consist of the following components:
convolutional layers: extract shift-invariant features from the previous layer
subsampling or pooling layers: combine the activations from multiple units in the previous
layer into one unit
fully connected layers: collect spatially di�use information
output layer: choose between classes
There can be multiple steps of convolution followed by subsampling, before reaching the fully
connected layers. Note how subsampling reduces the size of the feature map (typically, by about half
in each direction). For example, in the LeNet architecture shown above, the 5×5 window of the �rst
convolution layer extracts from the original 32×32 image a 28×28 array of features. Subsampling
then halves this size to 14×14. The second convolution layer uses another 5×5 window to extract a
10×10 array of features, which the second subsampling layer reduces to 5×5. These activations then
pass through two fully connected layers into the 10 output units corresponding to the digits ’0’ to ’9’.
Visual cortex
The inspiration for convolutional neural networks comes in part from studies of the primary and
secondary visual cortex by Hubel and Wiesel. They found that cells in the primary visual cortex detect
low-level features such as a line at a speci�c angle, or a line at a speci�c angle moving in a particular
direction, while cells in the secondary visual cortex respond to more sophisticated visual features.
[image source: Hubel & Wiesel, 1959]
The advent of massively parallel General Purpose Graphics Programming Units (GPGPU’s) has made
it possible to simulate convolutional networks on a large scale.
References
LeCun, Y., Bottou, L., Bengio, Y., & Ha�ner, P.,1998. Gradient-based learning applied to document
recognition, Proceedings of the IEEE, 86(11), 2278 – 2324.
Hubel, D.H. & Wiesel, T.N., 1959. Receptive �elds of single neurones in the cat’s striate cortex, Journal
of Physiology, 148(3), 574-591.
Convolution in Detail
Convolution operator
Convolutional layers are an adaptation of the convolution operator commonly used in mathematical
analysis as well as image and signal processing.
Continuous convolution
s(t) = (x ∗ w)(t) = x(a)w(t −∫ a)da
Discrete convolution
s(t) = (x ∗ w)(t) = x(a)w(t −
a=−∞
∑
∞
a)
Two-dimensional convolution
S(j, k) = (K ∗ I)(j, k) = K(m, n)I(j +
m
∑
n
∑ m, k + n)
Note: Theoreticians sometimes write in the above formula so that the operator is
commutative. But, computationally, it is more convenient to write it with a plus sign.
I(j − m, k − n)
Convolutional Network Layer
Assume the network is processing an image of size , with channels (for example, there
could be three channels corresponding to the colours Red, Green, Blue). The intensity in channel of
the pixel at location is .
J × K L
l
(j, k) V
j,k
l
We apply an �lter to these inputs, in order to compute the activation of one hidden unit
in �lter of the convolution layer. In this example .
M × N
Z j,k
i i J = 6, K = 7, L = 3, M = 3, N = 3
Z =j,k
i g b + K V ( i
l
∑
m=0
∑
M −1
n=0
∑
N−1
l,m,n
i
j+m,k+n
l )
The same weights and bias are applied to the next block of inputs, to compute the
next hidden unit in the convolution layer, in a process known as weight sharing.
K l,m,n
i bi M × N
If the original image size is and the �lter size is , the size of the convolution layer will
be
J × K M × N
(J + 1−M) × (K + 1−N)
Example: LeNet
For example, in the �rst convolutional layer of LeNet, . J = K = 32, M = N = 5
The width of the next layer is
J + 1−M = 32 + 1−5 = 28
Question: If there are 6 �lters in this layer, and the network is applied to black-and-white images,
compute the number of:
weights per neuron
neurons
connections
independent parameters
Answers:
weights per neuron? 1 + 5 × 5 × 1 = 26
neurons? 28 × 28 × 6 = 4704
connections? 28 × 28 × 6 × 26 = 122304
independent parameters? 6 × 26 = 156
Further Reading
Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):
Convolutional Networks (7.9)
Convolution Operator (9.1-9.2)
Pooling and Padding
Max Pooling
One common form of subsampling is max pooling. The previous layer is divided into small groups
of units (typically, in a 2×2 grid) and the maximum activation among each small group of units is
copied to a single unit in the subsequent layer.
[image source]
Zero Padding
Sometimes, we expand the size of the input (or the previous layer) but treat the newly added units as
having a �xed value (typically, zero).
For example, with a 3×3 �lter and one extra row of zero padding on each side, the subsequent layer
will be the same size as the previous layer (or the input).
Further reading
Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):
Max Pooling (9.3-9.4) and Stride (9.5)
Stride and Data Augmentation
Example: AlexNet
Here is a summary of AlexNet (Krizhevsky, 2012)
5 convolutional layers plus 3 fully connected layers.
Softmax with 1000 classes applied at the output layer.
Two GPUs, which interact only at certain layers.
Max pooling of 3×3 neighbourhoods with stride 2.
50% dropout at the fully connected layers.
Data augmentation by randomly cropping as well as transforms on RGB space.
Stride
Assume the original image is , with L channels. J × K
We again apply an �lter, but this time with a stride of . M × N s > 1
In this example .J = 7, K = 9, L = 3, M = 3, N = 3, s = 2
Z =j,k
i g b + K V ( i
l
∑
m=0
∑
M −1
n=0
∑
N−1
l,m,n
i
j+m,k+n
l )
The same formula is used, but and are now incremented by each time, so activations are only
computed for values of with and . The number of free parameters for
each �lter is .
j k s
j, k j mod s = 0 k mod s = 0
1 + L × M × N
Stride Dimensions
takes on the values .j 0, s, 2s, …, (J−M)
takes on the values . k 0, s, 2s, …, (K−N)
The next layer is by .(1 + (J−M)/s) (1 + (K−N)/s)
Stride with Zero Padding
When combined with zero padding of width , P
takes on the values .j 0, s, 2s, …, (J + 2P −M)
takes on the values . The size of the next layer is k 0, s, 2s, …, (K + 2P −N) (1 + (J +
by .2P −M)/s) (1 + (K + 2P −N)/s)
Example: AlexNet Conv Layer 1
In the �rst convolutional layer of AlexNet, . The width
of the next layer is:
J = K = 224, P = 2, M = N = 11, s = 4
1 + (J + 2P −M)/s = 1 + (224 + 2 × 2−11)/4 = 55.
Question: If there are 96 �lters in this layer, compute the number of:
weights per neuron
neurons
connections
independent parameters
Answers:
weights per neuron? 1 + 11 × 11 × 3 = 364
neurons? 55 × 55 × 96 = 290,400
connections? 55 × 55 × 96 × 364 = 105,705,600
independent parameters? 96 × 364 = 34,944
Overlapping Pooling
If the previous layer is , and max pooling is applied with width and stride , the size of the
next layer will be:
J × K F s
(1 + (J−F )/s) × (1 + (K−F )/s).
For example, in the �rst convolutional layer of AlexNet and , thus reducing the size of
the layer from 55×55 to 27×27. Note that in this case the 3×3 neighbourhoods overlap, so that the
activation of a single unit in the previous layer could get copied to two distinct units in the following
layer.
F = 3 s = 2
Convolutional Filters
Each unit in a convolutional layer can be visualised by constructing the sub-image that would cause
that unit to be most strongly activated. Generally, the units in the �rst convolutional layer respond to
lines and boundaries in a similar way to Gabor �lters, or to the neurons in the primary visual cortex
studied by Hubel and Wiesel. As we move to deeper convolutional layers, the units extract
progressively more abstracted and high-level features.
First Layer Second Layer Third Layer
[image source]
Data Augmentation
patches of size are randomly cropped from the original images224 × 224
images can be re�ected horizontally
also include changes in intensity of RGB channels
at test time, average the prediction of 10 di�erent crops of each test image
References
Krizhevsky, A., Sutskever, I., & Hinton, G.E., 2012. Imagenet classi�cation with deep convolutional
neural networks. Advances in Neural Information Processing Systems, 25, 1097-1105.
Exercise: Convolution
Exercise – Convolution
One of the early papers on Deep Q-Learning for Atari games (Mnih et al, 2013) contains this
description of its Convolutional Neural Network:
“The input to the neural network consists of an 84 × 84 × 4 image. The �rst hidden layer convolves
16 8 × 8 �lters with stride 4 with the input image and applies a recti�er nonlinearity. The second
hidden layer convolves 32 4 × 4 �lters with stride 2, again followed by a recti�er nonlinearity. The
�nal hidden layer is fully-connected and consists of 256 recti�er units. The output layer is a fully-
connected linear layer with a single output for each valid action. The number of valid actions varied
between 4 and 18 on the games we considered.”
For each layer in this network, compute the number of:
1. weights per neuron in this layer (including bias)
2. width and height of layer (only for convolutional layers)
3. neurons in this layer
4. connections into the neurons in this layer
5. independent parameters in this layer
You should assume the input images are grayscale, there is no padding, and there are 18 valid
actions (outputs).
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Compute each of the above for the �rst convolutional layer.
Compute each of the above for the second convolutional Layer.
Compute each of the above for the fully connected layer.
Compute each of the above, for the output layer.
Quiz 4: Convolutional Networks
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Question 5
No response
Sketch the following activation functions, and write their formula: Sigmoid, Tanh, ReLU.
Write the formula for activation of the node at location in the th �lter of a Convolutional
Neural Network which is connected by weights to all nodes in an window from the
�lters (or channels) in the previous layer, assuming bias weights are included and the activation
function is . How many free parameters would there be in this layer?
Z j,k
i (j, k) i
K l,m,n
i M × N L
g()
If the previous layer has size , and a �lter of size is applied with stride and zero-
padding of width , what will be the size of the resulting convolutional layer?
J × K M × N s
P
If max pooling with �lter size and stride is applied to a layer of size , what will be the size
of the resulting (downsampled) layer?
F s J × K
Explain the concept of Data Augmentation, and how it was used in AlexNet.
Week 4 Wednesday video