CS代写 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 6 – Neural Nets II
Roger G. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec6 1 / 48

Training neural networks with backpropagation
Intro ML (UofT) CSC311-Lec6 2 / 48

Recap: Gradient Descent
Recall: gradient descent moves opposite the gradient (the direction of steepest descent)
Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize!
We want to define a loss L and compute the gradient of the cost dJ /dw, which is the vector of partial derivatives.
I This is the average of dL/dw over all the training examples, so in this lecture we focus on computing dL/dw.
Intro ML (UofT) CSC311-Lec6 3 / 48

Univariate Chain Rule
Let’s now look at how we compute gradients in neural networks. We’ve already been using the univariate Chain Rule.
Recall: if f(x) and x(t) are univariate functions, then d f(x(t)) = df dx.
Intro ML (UofT)
CSC311-Lec6

Univariate Chain Rule
Recall: Univariate logistic least squares model
z = wx + b y = σ(z)
L = 12 ( y − t ) 2 Let’s compute the loss derivatives ∂L , ∂L
Intro ML (UofT) CSC311-Lec6 5 / 48

Univariate Chain Rule
How you would have done it in calculus class
= ∂ 1(σ(wx+b)−t)2 ∂w 2
= 1 ∂ (σ(wx+b)−t)2 2 ∂w
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂w
∂ 1 2 = ∂b 2(σ(wx+b)−t)
= 1 ∂ (σ(wx+b)−t)2 2∂b
= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂b
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂b
1 L=2(σ(wx+b)−t)
= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂w
= (σ(wx+b)−t)σ′(wx+b) What are the disadvantages of this approach?
= (σ(wx + b) − t)σ′(wx + b)x
Intro ML (UofT) CSC311-Lec6

Univariate Chain Rule
A more structured way to do it
Computing the loss:
z = wx + b y=σ(z)
1 L=2(y−t)2
Computing the derivatives:
dL = y − t dy
dL = dLdy = dLσ′(z) dz dy dz dy
∂L=dLdz =dLx ∂w dz dw dz ∂L = dL dz = dL
∂b dz db dz
Remember, the goal isn’t to obtain closed-form solutions, but to be
able to write a program that efficiently computes the derivatives.
Intro ML (UofT) CSC311-Lec6

Univariate Chain Rule
We can diagram out the computations using a computation graph.
The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes.
Computing the loss:
z = wx + b y = σ(z)
L = 12 ( y − t ) 2
Intro ML (UofT) CSC311-Lec6 8 / 48

Univariate Chain Rule
A slightly more convenient notation:
Use y to denote the derivative dL/dy, sometimes called the error signal.
This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).
Computing the loss:
z = wx + b y = σ(z)
L = 21 ( y − t ) 2 Intro ML (UofT)
Computing the derivatives:
z = y σ′(z) w=zx
CSC311-Lec6

Multivariate Chain Rule
Problem: what if the computation graph has fan-out > 1? This requires the Multivariate Chain Rule!
L2-Regularized regression
z = wx + b y = σ(z)
L = 21 ( y − t ) 2
Softmax regression
zl =Xwljxj +bl j
R = 2w Lreg =L+λR
yk = P ezl l
L = − X tk log yk k
Intro ML (UofT)
CSC311-Lec6

Multivariate Chain Rule
Suppose we have a function f(x,y) and functions x(t) and y(t). (All the variables here are scalar-valued.) Then
d f(x(t),y(t)) = ∂f dx + ∂f dy dt ∂x dt ∂y dt
Plug in to Chain Rule:
f (x, y) = y + exy x(t) = cos t y(t) = t2
Intro ML (UofT)
df = ∂f dx + ∂f dy dt ∂x dt ∂y dt
= (yexy) · (− sin t) + (1 + xexy) · 2t CSC311-Lec6

Multivariable Chain Rule
In the context of backpropagation:
In our notation:
t = x dx + y dy dt dt
Intro ML (UofT) CSC311-Lec6 12 / 48

Backpropagation
Full backpropagation algorithm:
Letv1,…,vN beatopologicalorderingofthecomputationgraph (i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).
Intro ML (UofT) CSC311-Lec6 13 / 48

Backpropagation
Example: univariate logistic least squares regression Backward pass:
Forward pass:
z = wx + b y = σ(z)
L= 2(y−t)2
dLreg = Lreg λ
w= z ∂w + R dw = z x + R w
dy z = y dz
dLreg = Lreg dL
= L (y − t)
Lreg =L+λR
Intro ML (UofT)
CSC311-Lec6

Perceptron (multiple outputs): Backward pass:
Forward pass:
zi =Xw(1)xj +b(1) ij i
hi = σ(zi)
yk =Xw(2)hi +b(2) ki k
yk =L(yk −tk)
w(2) = yk hi ki
b(2) = yk k
hi =Xykw(2) ki
zi = hi σ′(zi)
w(1) = zi xj ij
L = 1 X(yk − tk)2
Intro ML (UofT)
CSC311-Lec6

Backpropagation
In vectorized form:
Forward pass:
z = W(1)x + b(1) h = σ(z)
y = W(2)h + b(2)
L = 21∥t − y∥2 Intro ML (UofT)
Backward pass:
y = L (y − t) W(2) = yh⊤
h = W(2)⊤y
z = h ◦ σ′(z) W(1) = zx⊤
CSC311-Lec6

Computational Cost
Computational cost of forward pass: one add-multiply operation per weight
zi =Xw(1)xj +b(1) ij i
Computational cost of backward pass: two add-multiply operations per weight
w(2) = yk hi ki
hi =Xykw(2) ki
Rule of thumb: the backward pass is about as expensive as two forward passes.
For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer.
Intro ML (UofT) CSC311-Lec6

is the algorithm for efficiently computing gradients in neural nets.
Gradient descent with gradients computed via backprop is used to train the overwhelming majority of neural nets today.
I Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients.
Despite its practical success, backprop is believed to be neurally implausible.
Intro ML (UofT) CSC311-Lec6 18 / 48

Pytorch, Tensorflow, et al. (Optional)
If we construct our networks out of a series of “primitive” operations (e.g., add, multiply) with specified routines for computing derivatives, backprop can be done in a completely mechanical, and automatic, way.
This is called autodifferentiation or just autodiff.
There are many autodiff libraries (e.g., PyTorch, Tensorflow, Jax, etc.)
Practically speaking, autodiff automates the backward pass for you — but it’s still important to know how things work under the hood.
In CSC413, you’ll learn more about how autodiff works and use an autodiff framework to build complex neural networks.
Intro ML (UofT) CSC311-Lec6 19 / 48

Convolutional Networks
Intro ML (UofT) CSC311-Lec6 20 / 48

What makes vision hard?
Vison needs to be robust to a lot of transformations or distortions:
I change in pose/viewpoint
I change in illumination
I deformation
I occlusion (some objects are hidden behind others)
Many object categories can vary wildly in appearance (e.g. chairs)
: “Imaging a medical database in which the age of the patient sometimes hops to the input dimension which normally codes for weight!”
Intro ML (UofT) CSC311-Lec6 21 / 48

Suppose we want to train a network that takes a 200 × 200 RGB image as input.
1000 hidden units
densely connected
What is the problem with having this as the first layer?
Too many parameters! Input size = 200 × 200 × 3 = 120K.
Parameters = 120K × 1000 = 120 million.
What happens if the object in the image shifts a little?
Intro ML (UofT) CSC311-Lec6

The same sorts of features that are useful in analyzing one part of the image will probably be useful for analyzing other parts as well.
E.g., edges, corners, contours, object parts
We want a neural net architecture that lets us learn a set of feature
detectors that are applied at all image locations.
Intro ML (UofT) CSC311-Lec6 23 / 48

Convolution Layers
Fully connected layers:
Each hidden unit looks at the entire image.
Intro ML (UofT) CSC311-Lec6 24 / 48

Convolution Layers
Locally connected layers:
Each column of hidden units looks at a small region of the image.
Intro ML (UofT) CSC311-Lec6 25 / 48

Convolution Layers
Convolution layers:
Tied weights
Each column of hidden units looks at a small region of the image, and the weights are shared between all image locations.
Intro ML (UofT) CSC311-Lec6 26 / 48

Going Deeply Convolutional
Convolution layers can be stacked:
Tied weights
Intro ML (UofT) CSC311-Lec6

Convolution
We’ve already been vectorizing our computations by expressing them in terms of matrix and vector operations. Convolution is another useful high-level operation.
Let’s look at the 1-D case first. If a and b are two arrays, the convolution is defined as:
(a∗b)t =Xaτbt−τ. τ
Note: indexing conventions are inconsistent. We’ll explain them in each case.
Intro ML (UofT) CSC311-Lec6

Convolution
Method 1: translate-and-scale
Intro ML (UofT) CSC311-Lec6 29 / 48

Convolution
Method 2: flip-and-filter
Intro ML (UofT) CSC311-Lec6 30 / 48

Convolution
Some properties of convolution: Commutativity
a ∗ (λ1b + λ2c) = λ1a ∗ b + λ2a ∗ c
Intro ML (UofT) CSC311-Lec6 31 / 48

2-D Convolution
2-D convolution is defined analogously to 1-D convolution. If A and B are two 2-D arrays, then:
(A∗B)ij =XXAstBi−s,j−t. st
Intro ML (UofT) CSC311-Lec6 32 / 48

2-D Convolution
Method 1: Translate-and-Scale
Intro ML (UofT) CSC311-Lec6 33 / 48

2-D Convolution
Method 2: Flip-and-Filter (note that when used as a neural net layer, the flipping step is often omitted)
Intro ML (UofT) CSC311-Lec6 34 / 48

2-D Convolution
The thing we convolve by is called a kernel, or filter. What does this convolution kernel do?
Intro ML (UofT)
CSC311-Lec6 35 / 48

2-D Convolution
What does this convolution kernel do?
Intro ML (UofT)
CSC311-Lec6 36 / 48

2-D Convolution
What does this convolution kernel do?
Intro ML (UofT)
CSC311-Lec6 37 / 48

2-D Convolution
What does this convolution kernel do?
Intro ML (UofT)
CSC311-Lec6 38 / 48

Convolutional networks
Let’s finally turn to convolutional networks. These have two kinds of layers: detection layers (or convolution layers), and pooling layers.
The convolution layer has a set of filters. Its output is a set of feature maps, each one obtained by convolving the image with a filter.
826 M.D. Zeiler and R. first-layer filters
(Zeiler and Fergus, 2013, Visualizing and convolution understanding convolutional networks)
Fig. 5. (a): 1st layer features without fe CSC311-Lec6 39 / 48
a inates. (b): 1st layer features from networks
It’s common to apply a linear rectification nonlinearity: yi = max(zi, 0) Why might we do this?
convolution linear rectification
convolution layer
Intro ML (UofT)
CSC311-Lec6 40 / 48

Convolutional networks
It’s common to apply a linear rectification nonlinearity: yi = max(zi, 0) Why might we do this?
convolution
linear rectification
Convolution is a linear operation. Therefore, we need a nonlinearity, otherwise 2 convolution layers would be no more powerful than 1.
Two edges in opposite directions shouldn’t cancel
convolution layer
Intro ML (UofT)
CSC311-Lec6

Pooling layers
The other type of layer in a pooling layer. These layers reduce the size of the representation and build in invariance to small transformations.
z1 z2 z3 z4 z5 z6 z7
Most commonly, we use max-pooling, which computes the maximum value of the units in a pooling group:
yi = max zj j in pooling group
Intro ML (UofT)
CSC311-Lec6 41 / 48

Convolutional networks
convolution linear rectification
convolution layer
max pooling
pooling layer
convolution
Intro ML (UofT) CSC311-Lec6

Convolutional networks
Because of pooling, higher-layer filters can cover a larger region of the input than equal-sized filters in the lower layers.
convolution linear rectification
convolution layer
max pooling
pooling layer
convolution
Intro ML (UofT) CSC311-Lec6

Equivariance and Invariance
We said the network’s responses should be robust to translations of the input. But this can mean two different things.
Convolution layers are equivariant: if you translate the inputs, the outputs are translated by the same amount.
We’d like the network’s predictions to be invariant: if you translate the inputs, the prediction should not change.
Pooling layers provide invariance to small translations.
Intro ML (UofT) CSC311-Lec6 44 / 48

Convolution Layers
Each layer consists of several feature maps, or channels each of which is an array.
If the input layer represents a grayscale image, it consists of one channel. If it represents a color image, it consists of three channels.
Each unit is connected to each unit within its receptive field in the previous layer. This includes all of the previous layer’s feature maps.
Intro ML (UofT) CSC311-Lec6 45 / 48

Here’s the LeNet architecture, which was applied to handwritten digit recognition on MNIST in 1998:
The!architecture!of!LeNet5!
Intro ML (UofT) CSC311-Lec6 46 / 48

AlexNet, essentially like LeNet but scaled up in every way (more layers, more units, more connections, etc.):
(Krizhevsky et al., 2012)
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities AlebextwNeetn’tshestwuonGnPiUnsg.OpnerGfoPrUmruannstcheloayner-tphaertsIamtthaegteopNoefthceofimgurpeewthiitlieothneoisthewrhruanstthgeoltayer-parts eveatrtyhoenboettoemxc. TithedGPaUbsocuomtmdueneipcatleeoanrlynaintcgertianin2la0y1er2s.Thenetwork’sinputis150,528-dimensional,and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.
Intro ML (UofT) CSC311-Lec6 47 / 48 neurons in a kernel map). The second convolutional layer takes as input the (response-normalized

Classification
ImageNet results over the years. There are 1000 classes. Note that errors are top-5 errors (the network gets to make 5 guesses), so chance = 0.5%.
Year Model
2010 Hand-designed descriptors + SVM
2011 Compressed Fisher Vectors + SVM
2012 AlexNet 16.4%
2013 a variant of AlexNet 11.7%
2014 GoogLeNet 6.6%
2015 deep residual nets 4.5%
Human-level performance is around 5.1%.
They stopped running the object recognition competition because the performance is already so good.
Top-5 error
28.2% 25.8%
Intro ML (UofT) CSC311-Lec6 48 / 48

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts