程序代写 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 4 – Linear Models II
Roger G. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec3 1 / 50

Copyright By PowCoder代写 加微信 powcoder

More about gradient descent I Choosing a learning rate
I Stochastic gradient descent
Classification: predicting a discrete-valued target
I Binary classification (this week): predicting a binary-valued target I Multiclass classification (next week): predicting a discrete-valued
Intro ML (UofT) CSC311-Lec3 2 / 50

Setting the learning rate
Intro ML (UofT) CSC311-Lec3 3 / 50

Learning Rate (Step Size)
In gradient descent, the learning rate α is a hyperparameter we need to tune. Here are some things that can go wrong:
α too small: α too large: slow progress oscillations
α much too large: instability
Good values are typically between 0.001 and 0.1. You should do a grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).
Intro ML (UofT) CSC311-Lec3 4 / 50

Training Curves
To diagnose optimization problems, it’s useful to look at training curves: plot the training cost as a function of iteration.
Warning: in general, it’s very hard to tell from the training curves whether an optimizer has converged. They can reveal major problems, but they can’t guarantee convergence.
Intro ML (UofT) CSC311-Lec3 5 / 50

Stochastic gradient descent
Intro ML (UofT) CSC311-Lec3 6 / 50

Stochastic Gradient Descent
So far, the cost function J has been the average loss over the training examples:
J (θ) = N i=1 L(i) = N i=1 L(y(x(i), θ), t(i)).
(θ denotes the parameters; e.g., in linear regression, θ = (w, b)) By linearity,
∂ J 1 XN ∂ L ( i ) ∂θ=Ni=1 ∂θ.
Computing the gradient requires summing over all of the training examples. This is known as batch training.
Batch training is impractical if you have a large dataset N ≫ 1 (e.g. millions of training examples)!
Intro ML (UofT) CSC311-Lec3 7 / 50

Stochastic Gradient Descent
Stochastic gradient descent (SGD) updates the parameters based on the gradient for a single training example:
1− Choose i uniformly at random,
2− θ ← θ − α∂L(i)
Cost of each SGD update is independent of N!
SGD can make significant progress before even seeing all the data!
Mathematical justification: if you sample a training example uniformly at random, the stochastic gradient is an unbiased estimate of the batch gradient:
 ∂ L ( i )  1 XN ∂ L ( i ) ∂ J E ∂θ =Ni=1 ∂θ =∂θ.
Intro ML (UofT) CSC311-Lec3 8 / 50

Stochastic Gradient Descent
Problems with using single training example to estimate gradient: I Variance in the estimate may be high
I We can’t exploit efficient vectorized operations
Compromise approach:
I compute the gradients on a randomly chosen medium-sized set of
training examples M ⊂ {1, . . . , N }, called a mini-batch.
I For purposes of analysis, we often assume the examples in the
mini-batch are sampled independetly and uniformly with
replacement.
I In practice, we typically permute the training set and then go
through it sequentially. Each pass over the data is called an epoch.
Intro ML (UofT) CSC311-Lec3 9 / 50

Stochastic Gradient Descent
Stochastic gradients computed on larger mini-batches have smaller variance. This is similar to bagging.
If the training examples are sampled independently, we can apply the linearity rule for variance.
“#”#”# 1X∂L(i) 1X ∂L(i) 1 ∂L(i)
|M| ∂θ = |M|2 Var ∂θ = |M| Var ∂θ i∈Mji∈Mj j
The mini-batch size |M| is a hyperparameter that needs to be set. I Too large: requires more compute; e.g., it takes more memory to
store the activations, and longer to compute each gradient update I Too small: can’t exploit vectorization, has high variance
I A reasonable value might be |M| = 100.
Intro ML (UofT) CSC311-Lec3 10 / 50

Stochastic Gradient Descent
Batch gradient descent moves directly downhill (locally speaking).
SGD takes steps in a noisy direction, but moves downhill on average.
batch gradient descent
stochastic gradient descent
Intro ML (UofT) CSC311-Lec3 11 / 50

SGD Learning Rate
In stochastic training, the learning rate also influences the amount of noise in the parameters resulting from the stochastic updates.
Typical strategy:
I Use a large learning rate early in training so you can get close to
the optimum
I Gradually decay the learning rate to reduce the fluctuations
Intro ML (UofT) CSC311-Lec3 12 / 50

Binary Linear Classification
Intro ML (UofT) CSC311-Lec3 13 / 50

Binary linear classification
classification: given a D-dimensional input x ∈ RD predict a discrete-valued target
binary: predict a binary target t ∈ {0, 1}
I Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples. Sorry.
I t ∈ {0, 1} or t ∈ {−1, +1} is for computational convenience. linear: model prediction y is a linear function of x, followed by a
threshold r:
y= 0 ifzCS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com