CS代写 CIFAR 10 database

Machine Learning and Data Mining in Business
Lecture 11: Convolutional Neural Networks
Discipline of Business Analytics

Lecture 11: Convolutional Neural Networks
Learning objectives
• Convolutional neural networks. • Residual blocks.
• Techniques for image tasks.

Lecture 11: Convolutional Neural Networks
1. Computer vision
2. Convolutional layers
3. Padding and stride
4. Multiple input and output channels 5. Pooling
6. Convolutional neural networks
7. Techniques for image tasks

Computer vision

Computer vision
• This lecture introduces convolutional neural networks (CNNs), which are ubiquitous in the field of computer vision.
• To motivate this topic, let’s first look at some computer vision tasks and applications.

Image classification
Image credit: CIFAR 10 database

Classification with localisation

Object detection

Image segmentation

Comparing tasks

Image segmentation
Panoptic segmentation:

Image generation
These people don’t exist!
Image credit: StyleGAN

Image generation
Text prompt: “an astronaut riding a horse in a photorealistic style”.
Image credit: Dall-E 2 by OpenAI

Some applications
• Autonomous vehicles.
• Medical imaging.
• Manufacturing and construction. • Optical character recognition.
• Facial recognition. • Augmented reality.

Autonomous vehicles
Image credit: Tesla

Medical imaging
Image credit: V7 Labs

Optical character recognition
Image credit: Lakshmanan, Go ̈rner, and Gillard (2021)

Datasets: MNIST

Datasets: ImageNet
The ImageNet database has over 14 million annotated images. The well-known ImageNet competition had a training set of 1.2 million images belonging to a thousand classes.
Image credit: ImageNet

Datasets: CIFAR 100
The CIFAR 100 database is a collection of natural images from everyday life, with 100 different classes represented.
Image credit: ISL.

Deep learning

How are images stored in the computer?
We represent a 28 × 28 grayscale image as a 28 × 28 matrix of pixel values between 0 and 255.
Image credit: HS13 at .

In colour images, we have matrices of pixel intensities for the red, green and blue channels.
Image credit: W. Kang’s Dev Blog. 23/68

• In PyTorch, an image input is represented as a third-order tensor (multi-dimensional array) of dimensions C × H × W , where C is the number of channels, H is the height of the image, and W is the width.
• A batch of images is represented as a fourth-order tensor with dimensions N × C × H × W , where N is the batch size.

Convolutional layers

Why convolutions?
• Take the iPhone 13 as an example: the main camera has a resolution of 12 MP camera.
• The RGB image has 36 million elements.
• A single layer perceptron with 100 units would then have 3.6
billion parameters.
• 3.6B parameters = 14GB.

Why convolutions?
• So far, we focused on methods for tabular data. In tabular data applications, we typically don’t assume any structure a priori concerning how the features should interact.
• Images, on the other hand, have a spatial structure. Nearby pixels are typically related to each other.

• Translational invariance: the model should be able to recognise a pattern regardless of where it occurs in ta image. Therefore, the initial layers should treat all patches of an image in the same way.
• Locality: the earliest layers of the network should focus on local regions. We can eventually aggregate these local representations to make predictions at the whole image level.

2d cross correlation
We compute the cross correlation between a 3 × 3 input and a 2 × 2 kernel or filter as follows:
0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19 1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25
Image credit: Dive into Deep Learning by Zhang et al (2021).

2d cross correlation
w3 w4 x x x 789
􏰉(w1x1 + w2x2 + w3x4 + w4x5) (w1x4 + w2x5 + w3x7 + w4x8)
(w1x2 + w2x3 + w3x5 + w4x6)􏰊 (w1x5 + w2x6 + w3x8 + w4x9)
􏰉w w􏰊 1 2 3
x x x = 1 2 􏰨x4 x5 x6

Cross-correlating a 2d image with a 3 × 3 kernel produces a 2d response map:
Image credit: Probabilistic Machine Learning by . Murphy (2021).

We can think of 2d cross-correlation as template matching.
Image credit: ISL.

Convolutional layer
• A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output.
• The two parameters of a convolutional layer are the kernel and the scalar bias.

Cross correlation as matrix-vector multiplication
x2    w1 w2 0 w3 w4 0 0 0 0 x3
0 w w 0 w w 0 0 0x 1234 4
0 0 0 w w 0 w w 0x  12345 
0 0 0 0 w1 w2 0 w3 w4 x6 x7 
w1x1 + w2x2 + w3x4 + w4x5
w1x2 + w2x3 + w3x5 + w4x6 =wx +wx +wx +wx
 1 4 2 5 3 7 4 8 w1x5 + w2x6 + w3x8 + w4x9

Cross correlation as matrix-vector multiplication
• Thus, we can see that a CNN is like a MLP where the weight matrices have a special sparse structure and the elements are tied across locations.
• This implements the idea of translational invariance and massively reduces the number of parameters compared to a dense layer.

Padding and stride

Output size
If we cross-correlate a 4 × 4 image with a 3 × 3 filter, that produces a 2 × 2 output:
Image credit: vdumoulin at GitHub 35/68

Output size
In general, cross-correlating a nh × nw image with a kh × kw kernel produces an output of size (nh −kh +1)×(nw −kw +1).

• An issue when applying convolutional layers is that we tend to lose pixels on the borders of the image. This will add up as we apply many successive convolutional layers.
• One solution is adjust the size of the output by padding the borders with zeros.

• Ifweaddatotalofph rowsofpaddingandatotalofpw columns of padding, the output shape will be
(nh −kh +ph +1)×(nw −kw +pw +1).
• Inmanycases,wewillwanttosetph =kh−1and
pw = kw − 1 so that the output has the same shape as the input.
• We commonly use kernels with odd height and width values, such as 1, 3, 5, or 7. In this case, we can preserve dimensionality by padding with the same number of rows on top and bottom and the same number of columns on left and right.

• So far, we’ve been sliding the kernel over the image one element at a time. In this approach, neighbouring outputs will tend to very similar in value since their inputs overlap and nearby pixels tend to be similar.
• In practice, we can speed up the computations and downsample and by skipping every s−th element when sliding the kernel over the image.

Cross-correlation with a stride of 2 for both rows and columns:

• Given stride for the height and stride for the width, the output shape is
⌊(nh −kh +ph +sh)/sh⌋×⌊(nw −kw +pw +sw)/sw⌋, where ⌊x⌋ denotes the floor of x, rounding down to the
nearest integer.
• Ifph=kh−1andpw=kw−1,
⌊(nh +sh −1)/sh⌋×⌊(nw +sw −1)/sw⌋.
• If input height and width are divisible by strides, (nh/sh) × (nw/sw).

• Supposethattheinputis32×32andthekernelis5×5.
• With a stride of 1 and without padding, the output is 28 × 28.
• With a padding of 2 on all borders and a stride of 1, the output is 32 × 32.
• With a padding of 2 on all borders and a stride of 2, the output is 16 × 16.

Multiple input and output channels

Multiple input channels
If there are multiple input channels, we use a different kernel for each channel and sum the results over channels:
(1×1+2×2+4×3+5×4)+(0×0+1×1+3×2+4×3) = 56 44/68

Multiple output channels
• We want to have multiple kernels in each layer, each generating an output channel.
• In this case, the layer takes an input with shape (Cin, Hin, Win) and returns an output with shape (Cout, Hout, Wout).

Multiple output channels
Different filters learned by the first layer of AlexNet:
Image credit: Zeiler and Fergus (2014).

1×1 convolution
A 1×1 convolution (that is, kh = kw = 1) takes a weighted combination of features at a given location, rather than across locations.
This is useful for changing the number of channels without changing the spatial dimensionality.

2d convolution in PyTorch
This layer takes an input with size (N, Cin, Hin, Win) and return an output with size (N, Cout, Hout, Wout), where Hout and Wout depend on the input size, kernel size, padding and stride as before.

Pooling layers
We use pooling layers for two reasons:
• As we process images, we want to gradually reduce the spatial resolution of our hidden representations to aggregate information.
• Convolutions preserve the information about the location of input features, a property known as equivariance. We need some degree of invariance to the location of the features.

Max pooling
Max pooling takes the maximum value in a sliding window:

Average pooling
• In average pooling, we replace the maximum by the average.
• A global average pooling layer averages over all the locations in a feature map. That is, it converts a C × H × W feature map to a C ×1×1 feature map.

Pooling layers
• Pooling layers have no learnable parameters.
• We can use padding and stride as in convolutional layers.
• We apply pooling for each input channel to obtain the corresponding output channel.

Convolutional neural networks

A simple CNN
A simple CNN for classifying images may look as follows:
Image credit: Probabilistic Machine Learning: by . Murphy (2021).

LeNet-5 is an early CNN developed for handwritten digit recognition:
Image credit: Dive into Deep Learning by Zhang et al (2021).

We can implement this model as follows:

The idea of ResNet is to replace hl+1 = fl(hl) by hl+1 = g(hl + fl(hl)),
called a residual block since fl only needs to learn the residual, or difference, between the input and output of this layer.
The use of residual blocks allows us to train very deep models, since the gradient can flow freely from the output to earlier layers via the skip connections.

Batch normalisation
• A batch normalisation (BN) layer ensures that each of the units has a given sample mean and variance across the samples in a minibatch.
• More precisely, if hi denotes the vector of units for example i, the BN layer replaces it with h􏰩i computed as
h􏰩 i = γ · hˆ i + β , hˆ i = h i − μ B ,
σ B2 + ε μB = 1 􏰐 h,
σB2 = 1 􏰐(h−μB)2,
where γ and β are learnable parameters.

Batch normalisation
• Batch normalisation turns out to be highly beneficial in terms of training speed and stability, especially for deep CNNs. It also acts as a regulariser.
• Together with residual blocks, batch normalisation is one of the key techniques that enabled to practitioners to train neural networks with over 100 layers.
• Other types of normalisation layers appear in other contexts.

Techniques for image tasks

Data augmentation
• In data augmentation for images, we extend the training set by generating new training examples through a series of random changes to the original training images.
• Some popular augmentations are flipping, rotation, scaling, cropping, and changing colours.

Data augmentation
Image credit: Albumentations

is data augmentation that works as follows, for each image:
1. Select an image from the dataset at random.
2. Pick a weight at random.
3. Take a weighted average of the selected image with your image. This is the input for the constructed example.
4. Take a weighted average of the selected image’s label with your image. This is the output for the constructed example.

Image credit: Zhang et al (2017)

Label smoothing
• Suppose that the response is one-hot encoded. In label smoothing, we replace all zeros ε/C and all 1s by 1 − ε + ε/C, where C is the number classes and ε is a hyperparameter.
• This encourages the model to be less confident, making it more robust. This is especially helpful when there are mislabeled examples.

Progressive resizing
• In progressive resizing, we start training with small images and end training using large images.
• Starting with small images considerably speeds up training, while completing training with large images maximises the final accuracy.

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts