Solution – Lab 7 – Neural Networks
Lab 7: Convolutional Neural Networks for Image Classification¶
Haiping Lu – COM4509/6509 MLAI2021 @ The University of Sheffield
Copyright By PowCoder代写 加微信 powcoder
Accompanying lectures: YouTube video lectures recorded in Year 2020/21.
Sources: This notebook is based on the CIFAR10 Pytorch tutorial, the CNN notebook from , and Lab 2 and Lab 3 of my SimplyDeep notebooks.
There are seven questions in this notebook.
Objective¶
To perform image classification using convolutional neural network in PyTorch.
Suggested reading:
Autograd tutorial
Convolutional neural network – Wikipedia
Feature/representation learning – Wikipedia
The fast rising of deep learning starts on 30 September 2012, when a convolutional neural network (CNN) called AlexNet achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up. This is considered a breakthrough and has grabbed the attention of increasing number of researchers, practioners, and the general public. Since then, deep learning has penetrated to many research and application areas. AlexNet contained eight layers. In 2015, it was outperformed by a very deep CNN with over 100 layers from Microsoft in the ImageNet 2015 contest. It will be interesting to take a look at the image classification task and a CNN that can do the job well.
1. Review of Autograd: Automatic Differentiation¶
In the previous lab, we briefly covered Tensor and Computational Graph. We have actually used Autograd already. Here, we learn the basics below, a condensed and modified version of the original PyTorch tutorial on Autograd
Why differentiation is important?¶
This is because it is a key procedure in optimisation to find the optimial solution of a loss function. The process of learning/training aims to minimise a predefined loss.
How automatic differentiation is done in PyTorch?¶
The PyTorch autograd package makes differentiation (almost) transparent to you by providing automatic differentiation for all operations on Tensors, unless you do not want it (to save time and space).
A torch.Tensor type variable has an attribute .requires_grad. Setting this attribute True tracks (but not computes yet) all operations on it. After we define the forward pass, and hence the computational graph, we call .backward() and all the gradients will be computed automatically and accumulated into the .grad attribute.
This is made possible by the chain rule of differentiation.
How to stop automatic differentiation (e.g., because it is not needed)¶
Calling method .detach() of a tensor will detach it from the computation history. We can also wrap the code block in with torch.no_grad(): so all tensors in the block do not track the gradients, e.g., in the test/evaluation stage.
Question 1¶
What is the benefit of stopping automatic differentiation when it is not needed?
Answer: Reduce memory usage. Speed up calculations.
Tensors are connected by Function to build an acyclic computational graph to encode a complete history of computation. The .grad_fn attribute of a tensor references a Function created
the Tensor, i.e., this Tensor is the output of its .grad_fn in the computational graph.
Learn more about autograd by referring to the documentation on autograd
2. Load the Image Data – CIFAR10¶
Libraries¶
Get ready by importing commonly used APIs
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
The CIFAR10 dataset has ten classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32×32 pixels in size.
Loading and normalizing CIFAR10¶
The output of torchvision datasets (after loading) are PILImage images of range [0, 1].
Check out the torchvision.transforms API for here (search for ToTensor and Normalize).
transforms.ToTensor() Convert a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0].
transforms.Normalize normalizes a tensor image with mean and standard deviation. Given mean: ($M1,…,Mn$) and std: ($S1,..,Sn$) for $n$ channels, this transform will normalize each channel of the input torch.*Tensor as $input[channel] = (input[channel] – mean[channel]) / std[channel]$
torch.utils.data.DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. See API here
We want to use more than one images at one time. That way, we can compute the average loss across a mini-batch of $n$ multiple images, and take a step to optimize the average loss. The average loss across multiple training inputs is going to be less “noisy” than the loss for a single input, and is less likely to provide “bad information” because of a “bad” input. The number $n$ is called the batch size.
The actual batch size that we choose depends on many things. We want our batch
size to be large enough to not be too “noisy”, but not so large as to make each
iteration too expensive to run.
People often choose batch sizes of the form $n=2^k$ so that it is easy to half
or double the batch size.
The way DataLoader works is that it randomly groups the training data into mini-batches
with the appropriate batch size. Each data point belongs to only one mini-batch. When there
are no more mini-batches left, the loop terminates.
In general, we may wish to train the network for longer. We may wish to use each training data
point more than once. In other words, we may wish to train a neural network for more than
one epoch. An epoch is a measure of the number of times all training data is used
once to update the parameters.
batchSize=4
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
#Load the training data
trainset = datasets.CIFAR10(root=’./data’, train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batchSize,
shuffle=True, num_workers=2)
#Load the test data
testset = datasets.CIFAR10(root=’./data’, train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batchSize,
shuffle=False, num_workers=2)
classes = (‘plane’, ‘car’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’)
print(‘Training set size:’, len(trainset))
print(‘Test set size:’,len(testset))
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data\cifar-10-python.tar.gz
100%|██████████████████████████████████████████████████████████████▉| 170426368/170498071 [01:29<00:00, 2629327.62it/s] Extracting ./data\cifar-10-python.tar.gz to ./data 170500096it [01:40, 2629327.62it/s] Files already downloaded and verified Training set size: 50000 Test set size: 10000 Note that the data has been downloaded at the data directory. Because the filesize is large, we do not upload it to GitHub. Dataset CIFAR10 Number of datapoints: 50000 Root location: ./data Split: Train StandardTransform Transform: Compose( ToTensor() Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) Dataset CIFAR10 Number of datapoints: 10000 Root location: ./data Split: Test StandardTransform Transform: Compose( ToTensor() Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) View the images¶ # functions to show an image def imshow(img): img = img / 2 + 0.5 # unnormalize back to range [0, 1] npimg = img.numpy() plt.imshow(np.transpose(npimg, (1, 2, 0))) #rearrange dimensions to numpy format for disply plt.show() # get some random training images dataiter = iter(trainloader) images, labels = dataiter.next() #Get one batch (4 here) # show images imshow(torchvision.utils.make_grid(images)) # print labels print(' '.join('%5s' % classes[labels[j]] for j in range(batchSize))) truck deer car ship 3. Define the Architecture of a Convolutional Neural Network¶ A typical CNN architecture: Let us look at the CNN in detail. Convolution layer - with a shared kernel/filter¶ The light blue grid (middle) is the input that we are given, e.g., a 5 pixel by 5 pixel greyscale image. The grey grid (left) is a convolutional kernel/filter of size $3 \times 3$, containing the parameters of this neural network layer. To compute the output, we superimpose the kernel on a region of the image. Let's start at the top left, in the dark blue region. The small numbers in the bottom right corner of each grid element corresponds to the number in the kernel. To compute the output at the corresponding location (top left), we "dot" the pixel intensities in the square region with the kernel. That is, we perform the computation: (3 * 0 + 3 * 1 + 2 * 2) + (0 * 2 + 0 * 2 + 1 * 0) + (3 * 0 + 1 * 1 + 2 * 2) The green grid (right) contains the output of this convolution layer. This output is also called an output feature map. The terms feature, and activation are interchangable. The output value on the top left of the green grid is consistent with the value we obtained by hand in Python. To compute the next activation value (say, one to the right of the previous output), we will shfit the superimposed kernel over by one pixel: The dark blue region is moved to the right by one pixel. We again dot the pixel intensities in this region with the kernel to get another 12, and continues to get 17, ... Question 2¶ Show how we get the value 19 in the output above (in cyan on the right of the figure in this section). $ \left( \begin{array}{ccc} 1 & 3 & 1 \\ 2 & 2 & 3 \\ \end{array} \right) \left( \begin{array}{ccc} 0 & 1 & 2 \\ 2 & 2 & 0 \\ \end{array} \right) input_grid = np.array([[3,3,2,1,0],[0,0,1,3,1],[3,1,2,2,3],[2,0,0,2,2],[2,0,0,0,1]]) print('Input Grid:') print(input_grid) print('\nPortion of Input grid corresponding to right middle output feature map:') print(input_grid[1:-1,2:]) kernel_grid = np.array([[0,1,2],[2,2,0],[0,1,2]]) print('\nKernel:') print(kernel_grid) output_rm = np.sum(np.multiply(input_grid[1:-1,2:], kernel_grid)) print('\nOutput of sum of element wise multiplication of input portion and kernel:') print(output_rm) Input Grid: [[3 3 2 1 0] [0 0 1 3 1] [3 1 2 2 3] [2 0 0 2 2] [2 0 0 0 1]] Portion of Input grid corresponding to right middle output feature map: Output of sum of element wise multiplication of input portion and kernel: Note the shrinked output: Here, we did not use zero padding (at the edges) so the output of this layers is shrinked by 1 on all sides. If the kernel size is $k=2m+1$, the output will be shrinked by $m$ on all sides so the width and height will be both reduced by $2m$. Convolutions with Multiple Input/Output Channels¶ For a colour image, the kernel will be a 3-dimensional tensor. This kernel will move through the input features just like before, and we "dot" the pixel intensities with the kernel at each region, exactly like before. This "size of the 3rd (colour) dimension" is called the number of input channels or number of input feature maps. We also want to detect multiple features, e.g., both horizontal edges and vertical edges. We would want to learn many convolutional filters on the same input. That is, we would want to make the same computation above using different kernels, like this: Each circle on the right of the image represents the output of a different kernel dotted with the highlighted region on the right. So, the output feature is also a 3-dimensional tensor. The size of the new dimension is called the number of output channels or number of output feature maps. In the picture above, there are 5 output channels. The conv2D layer expects as input a tensor in the format "NCHW", meaning that the dimensions of the tensor should follow the order: batch size Let us create a convolutional layer using nn.Conv2d: myconv1 = nn.Conv2d(in_channels=3, # number of input channels out_channels=7, # number of output channels kernel_size=5) # size of the kernel, #Emulate a batch of 32 colour images, each of size 128x128, like this: x = torch.randn(32, 3, 128, 128) y = myconv1(x) torch.Size([32, 7, 124, 124]) The output tensor is also in the "NCHW" format. We still have 32 images, and 7 channels (consistent with out_channels of conv), and of size 124x124. If we added the appropriate padding to conv, namely padding = $m$ (the kernel_size: $2m+1$), then our output width and height should be consistent with the input width and height: myconv2 = nn.Conv2d(in_channels=3, out_channels=7, kernel_size=5, padding=2) x = torch.randn(32, 3, 128, 128) y = myconv2(x) torch.Size([32, 7, 128, 128]) The parameters of Conv2d¶ conv_params = list(myconv2.parameters()) print("len(conv_params):", len(conv_params)) print("Filters:", conv_params[0].shape) #7 filters, each of size 3 x 5 x 5 print("Biases:", conv_params[1].shape) len(conv_params): 2 Filters: torch.Size([7, 3, 5, 5]) Biases: torch.Size([7]) Pooling Layers - Subsampling¶ A pooling layer can be created like this: mypool = nn.MaxPool2d(kernel_size=2, stride=2) y = myconv2(x) z = mypool(y) torch.Size([32, 7, 64, 64]) Usually, the kernel size and the stride length will be equal so each pixel is pooled only once. The pooling layer has no trainable parameters: list(mypool.parameters()) In Lab 6, we did not define a class for our linear regression NN. Here we do so and define a CNN class consisting of several layers as defined below (from the official the Pytorch tutorial). class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) #3: #input channels; 6: #output channels; 5: kernel size self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) myCNN = CNN() __init__() defines the layers. forward() defines the forward pass that transform the input to the output. backward() is automatically defined using autograd. ReLu() is the rectified linear unit), a popular activation function that performs a nonlinear transformation/mapping of an input variable (element-wise operation). Conv2d() defines a convolution layer, as shown below where blue maps indicate inputs, and cyan maps indicate outputs. Convolution with no padding, no strides. More convolution layers are illustrated nicely at Convolution arithmetic. This network CNN() defined above has two convolutional layers: conv1 and conv2. The first convolutional layer conv1 requires an input with 3 channels, outputs 6 channels, and has a kernel size of 5x5. We are not adding any zero-padding. The second convolutional layer conv2 requires an input with 6 channels (note this MUST match the output channel number of the previous layer), outputs 16 channels, and has a kernel size of (again) 5x5. We are not adding any zero-padding. In the forward function we see that the convolution operations are always followed by the usual ReLU activation function, and a pooling operation. The pooling operation used is max pooling, so each pooling operation reduces the width and height of the neurons in the layer by half. Because we are not adding any zero padding, we end up with 16 * 5 * 5 hidden units after the second convolutional layer (16 matches the output channel number of conv2, 5 * 5 is based on the input dimension 32x32, see below). These units are then passed to two fully-connected layers, with the usual ReLU activation in between. Notice that the number of channels grew in later convolutional layers! However, the number of hidden units in each layer is still reduced because of the convolution and pooling operation: Initial Image Size: $3 \times 32 \times 32 $ After conv1: $6 \times 28 \times 28$ ($32 \times 32$ is reduced by 2 on each side) After Pooling: $6 \times 14 \times 14 $ (image size halved) After conv2: $16 \times 10 \times 10$ ($14 \times 14$ is reduced by 2 on each side) After Pooling: $16 \times 5 \times 5 $ (halved) After fc1: $120$ After fc2: $84$ After fc3: $10$ (= number of classes) This pattern of doubling the number of channels with every pooling / strided convolution is common in modern convolutional architectures. It is used to avoid loss of too much information within a single reduction in resolution. Question 3¶ If the input image size is $3 \times 64 \times 64 $, can we use the same CNN defined above? If yes, show the feature sizes after each operation as above. If no, how shall we modify the network architecture to process such $3 \times 64 \times 64 $ images? Initial Image Size: $3 \times 64 \times 64 $ After conv1: $6 \times 60 \times 60$ ($64 \times 64$ is reduced by 2 on each side) After Pooling: $6 \times 30 \times 30 $ (image size halved) After conv2: $16 \times 26 \times 26$ ($30 \times 30$ is reduced by 2 on each side) After Pooling: $16 \times 13 \times 13 $ (halved) \ Over here, the input to the next linear layer, fc1, needs to be of size $16 \times 13 \times 13$, i.e. self.fc1 = nn.Linear(16 * 13 * 13, 120) and x = x.view(-1, 16 * 13 * 13). After making the change we get After fc1: $120$ After fc2: $84$ After fc3: $10$ (= number of classes) Inspect the NN architecture¶ Now let's take a look at the CNN built. Let us check the (randomly initialised) parameters of this NN. Below, we check the first 2D convolution. params = list(myCNN.parameters()) print(len(params)) print(params[0].size()) # First Conv2d's .weight print(params[1].size()) # First Conv2d's .bias print(params[1]) torch.Size([6, 3, 5, 5]) torch.Size([6]) Parameter containing: tensor([ 0.1106, -0.1091, -0.0352, 0.0279, -0.0254, -0.1146], requires_grad=True) Question 4¶ From the above, we can see the length of params is 10, i.e. there are 10 sets of parameters. Set 0 is for the weights of conv1. Set 1 is the bias of conv1. What are the remaining 8 sets for? weights of conv1 bias of conv1 weights of conv2 bias of conv2 weights of fc1 bias of f1c weights of fc2 bias of f2 weights of fc3 bias of fc3 for i in params: print(i.size()) torch.Size([6, 3, 5, 5]) torch.Size([6]) torch.Size([16, 6, 5, 5]) torch.Size([16]) torch.Size([120, 400]) torch.Size([120]) torch.Size([84, 120]) torch.Size([84])