代写代考 CS 189 (CDSS offering)

Lecture 30:
Convolutional networks (1) CS 189 (CDSS offering)
2022/04/11

Today’s lecture
Today, we take a detailed look at the most widely used class of neural network
models for image based problems (computer vision)
Convolutional neural networks (conv nets) can be used for other applications Conversely, other types of neural networks can be used for computer vision
However, more often than not, conv nets and computer vision go “hand in hand”
But first, a small motivation for conv nets…
Today and Wednesday, we will cover the motivation behind conv nets, detail the
mathematical formulation, and cherrypick the last decade of developments

DALL·E 2: generating images from text
https://openai.com/dall-e-2/
“A painting of a small brown terrier with round ears wearing a regal robe and crown”
(my dog, Peanut)

Fully connected layers for processing images?
x z(1) a(1) z(2) a(2)
linear layer nonlinearity linear layer nonlinearity
= 150528 dims Let’s make z(l) modest, e.g., 128 dims (in reality, this is probably too small)
Then, we have 150528 × 128 ( ∼ 20M ) parameters in the first layer 4
x is an image, e.g., 224 (height) × 224 (width) × 3 (RGB)

The key idea behind conv nets
The key idea behind reducing the massive number of parameters is the
observation that many useful image features are local
E.g., edge information, used by many once-popular hand designed features
We won’t go so far as to hand design the features, but we will place limits on the
features that can be learned via the architecture
This is an inductive bias: knowledge we build You might be wondering: surely there is useful
into the model to make it learn more effectively
nonlocal information as well? More on this later…

“Locally connected” layers for processing images
Before, we had ∼ 20M parameters. How many parameters do we have now? The filter consists of 4 tensors each with 3 × 3 × 3 = 27 parameters, so we have
108 parameters — if we add a bias term to the output, 112 parameters
Wait, but, we haven’t yet processed the whole image!
width: 224 w: 3
height: 224 h: 3

“Sliding” the filter along the image
We process the whole image using the same filter — so, in the end, we will still
have only 112 parameters
What will our output look like?
It actually looks quite like an “image” itself… interesting…

The convolution layer
The processing step we have described is referred to as convolution Convolution is performed with a filter — a tensor with dimensions [K, K, O, I]
(e.g., [3, 3, 4, 3]) — and a O-dimensional bias term
(2D) convolutions take in an input of size [I, H, W] (or [H, W, I], depending on the
convention) and output a tensor of size [O, H′, W′]
What are H′and W′? It depends on certain hyperparameter values
Because the output has similar dimensions, we can stack convolutions on top of
each other to make deep convolutional networks

Stacking convolutions
Stacking convolutions increases the receptive field the deeper we go …
convolution convolution

Determining H′and W′
Two important hyperparameters determine the size of the convolution output First, we can choose to pad the input by a certain number of “pixels” on all sides
Most common choice: pad with zeros (make sure to use normalization)
Second, we can choose the stride that the filter shifts by, i.e., how many “pixels” it For a K × K filter, we will have [H′, W′] = 1 + ([H, W] + 2 × pad − K)/stride
It is common to choose a stride of 1 and (in total) pad by the size of the filter minus 1 (e.g., 1 on all sides for a 3 × 3 filter) such that H′ = H and W′ = W
moves over every time

Convolutional networks: attempt #1
Can we just stack convolutions on top of each other? What’s the issue with this?
Convolution is a linear operator!
x z(1) z(2)
convolution convolution

Introducing nonlinearities
Just like before, we will interleave our linear layer (convolution) with a nonlinearity, Sometimes, the term “convolution layer” is used to refer to the popular recipe of
e.g., ReLU, applied element wise to the output of the convolution
convolution → BN → ReLU (but it could also refer to just the convolution part)
Like input standardization for images, BN on inputs of shape [N, C, H, W ] If using LN instead, we compute statistics on the C, H, and W dimensions
compute statistics on the N, H, and W dimensions, rather than just N

Another common operation in convolutional networks is pooling, which reduces Pooling uses a window size (typically, 2 × 2) and a stride (typically, whatever the
the size of the input and possibly the number of parameters later in the network
window size is) and slides over the input as specified by these hyperparameters
Max pooling “lets through” only the largest element — a nonlinear operation Average pooling averages all the elements in the window — this is linear
The output of the pooling layer, with a 2 × 2 window size and stride of 2, will be 13
one quarter the size of the input

Convolutional networks: attempt #2
A simple convolutional network repeats the convolution → BN → ReLU recipe L times to process the input image into a representation a(L)
We flatten or pool a(L) into a one dimensional vector, pass it through one or more
linear layers, and then (for classification) get our final probabilities with softmax
x a(1) a(2) conv layer conv layer
conv layer linear layer

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts