CS计算机代考程序代写 deep learning ER algorithm Linear Regression

Linear Regression

Liang Zheng
Australian National University
liang. .au

Incremental Learning Z. Li and D. Hoiem. “Learning without forgetting.” TPAMI, 2017

Problem setting:
A classifier is trained on classes 1,2,3;
I have new classes 4,5,6,7;
I want a new classifier that can recognize 1,2,…,7;
In training the new classifier, we do not have access
to the old data 1,2,3, but only new data;
Question:
Can we learn new classes without forgetting old
classes?

Regression problem

Regression problem: observed noisy
function values from which we wish to
infer the underlying function that
generated the data.

Regression solution: possible function
that could have generated the data
(blue) with indication of the
measurement noise of the function
value at the corresponding inputs
(orange distributions)

• Regression is a fundamental problem in machine learning

Linear Regression from a statistical point of view

Problem Formulation

• Regression belongs to supervised learning

• Task. Find function 𝑓:ℝ! → ℝ such that 𝑦 ≈ 𝑓 𝒙; 𝜽
• Experience. Training set 𝒟 ≔ 𝒙”, 𝑦” , … , 𝒙#, 𝑦# , consisting of 𝑁 inputs
𝒙# ∈ ℝ! and corresponding observations/targets 𝑦$ ∈ ℝ, 𝑛 = 1,… ,𝑁

• Performance. Estimated empirical risk on the test data
𝐑%&’ 𝑓, 𝑿(%)(, 𝒚(%)(

Working Example

Prediction Error

Loss Function for Training (review)

• Under the i.i.d assumption, the empirical mean is a good estimate of the
population mean.

• We can use the empirical mean of the loss on the training data

• Given a training set 𝒙”, 𝑦” , … , 𝒙#, 𝑦# , we use the notation of an example
matrix

𝑿 ≔ 𝒙”, … , 𝒙# * ∈ ℝ#×!

and a label vector
𝒚 = 𝑦”, … , 𝑦# * ∈ ℝ#

• The average loss is given by

𝐑,-. 𝑓, 𝑿, 𝒚 =
1
𝑁
6

$/”

#
ℓ 𝑦$, 8𝑦$

where 8𝑦$ = 𝑓 𝒙$, 𝜽 . The above equation is called the empirical risk. The
learning strategy is called empirical risk minimization.

Least Squares Linear Regression (review)
• We use the squared loss function

ℓ 𝑦$, 8𝑦$ = 𝑦$ − 8𝑦$ 0

• The empirical risk becomes,

𝐑,-. 𝑓, 𝑿, 𝒚 =
1
𝑁
6

$/”

#
ℓ 𝑦$, 8𝑦$ =

1
𝑁
6

$/”

#
𝑦$ − 𝑓 𝒙$, 𝜽

Using the linear predictor 𝑓 𝒙$, 𝜽 = 𝜽*𝒙$, we obtain the empirical risk as

𝐑,-. 𝑓, 𝑿, 𝒚 =
1
𝑁
6

$/”

#
𝑦$ − 𝑓 𝒙$, 𝜽

• This equation can be equivalently expressed in matrix form

ℒ 𝜽 := 𝐑,-. 𝑓, 𝑿, 𝒚 =
1
𝑁

𝒚 − 𝑿𝜽 0 =
1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽

Working Example

Prediction Error

A closed-form analytic solution

• Loss function (mean square error)

ℒ 𝜽 =
1
𝑁

𝒚 − 𝑿𝜽 0 =
1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽

• We calculate the gradient of ℒ with respect to the parameters 𝜽 as
dℒ
d𝜽

=
d
d𝜽

1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽

=
1
𝑁
d
d𝜽

𝒚*𝒚 − 𝜽*𝑿*𝒚 − 𝒚*𝑿𝜽 + 𝜽*𝑿*𝑿𝜽

=
1
𝑁
d
d𝜽

𝒚*𝒚 − 2𝒚*𝑿𝜽 + 𝜽*𝑿*𝑿𝜽

=
1
𝑁

−2𝒚*𝑿 + 2𝜽*𝑿*𝑿 ∈ ℝ”×!

• The minimum is attained when the gradient is zero.
dℒ
d𝜽

= 𝟎* ⇔ 𝜽*𝑿*𝑿 = 𝒚*𝑿

⇔ 𝜽* = 𝒚*𝑿 𝑿*𝑿
1𝟏

⇔ 𝜽 = 𝑿*𝑿
1𝟏
𝑿*𝒚

3𝒂!𝒙
3𝒙

= 𝒂*
𝜕𝒙B𝑩𝒙
𝜕𝒙

= 𝒙B(𝑩 + 𝑩B)

Linear Regression from a probability point of view

9.1 Problem formulation

• A linear regression problem with likelihood function
𝑝 𝑦|𝒙, 𝜽 = 𝒩 𝑦|𝒙B𝜽, 𝜎F

⇔ 𝑦 = 𝒙B𝜽 + 𝜀, 𝜀~𝒩 0, 𝜎F

• The likelihood in 𝑝 𝑦|𝒙, 𝜽 = 𝒩 𝑦|𝒙B𝜽, 𝜎F is the
probability density function of 𝑦 evaluated at 𝒙B𝜽.

• The only source of uncertainty originates from the
observation noise 𝜀 (we assume 𝜀 is known)

9.2 Parameter estimation

• Given a training set 𝒙”, 𝑦” , … , 𝒙#, 𝑦# where 𝒙$ ∈ ℝ! and corresponding
observations/targets 𝑦$ ∈ ℝ

• 𝑦C and 𝑦D are conditionally independent given their inputs 𝒙C and 𝒙D
• we use the notation of set of training input

𝒳 ≔ 𝒙”, … , 𝒙#
and set of corresponding targets

𝒴 ∶= 𝑦”, … , 𝑦#
• The likelihood factorizes

𝑝 𝒴|𝒳, 𝜽 = 𝑝 𝑦”, … , 𝑦#|𝒙”, … , 𝒙#, 𝜽

L
$/”

𝑝 𝑦$|𝒙$, 𝜽 =L
$/”

𝒩 𝑦$|𝒙$*𝜽, 𝜎0

9.2.1 Maximum Likelihood Estimation

• Intuitively, maximizing the likelihood means maximizing the predictive
distribution of the training data given the model parameters. We obtain the
maximum likelihood parameters as

𝜽ML = arg max𝜽
𝑝 𝒴|𝒳, 𝜽 = arg max

𝜽
L
$/”

𝑝 𝑦$|𝒙$, 𝜽

• Instead of maximizing the likelihood directly, we apply the log-transformation
to the likelihood function and minimize the negative log-likelihood,

−log 𝑝 𝒴|𝒳, 𝜽 = −logL
$/”

𝑝 𝑦$|𝒙$, 𝜽 = −6
$/”

log 𝑝 𝑦$|𝒙$, 𝜽

• In the linear regression model 𝑦 = 𝒙*𝜽 + 𝜀, 𝜀~𝒩 0, 𝜎0 , the likelihood is
Gaussian (due to the noise term 𝜀), such that we arrive at

log 𝑝 𝑦$|𝒙$, 𝜽 = −
1
2𝜎0

𝑦$ − 𝒙*𝜽
0
+ const

Maximum Likelihood Estimation

• Negative log-likelihood:

−log 𝑝 𝒴|𝒳, 𝜽 = −logL
$/”

𝑝 𝑦$|𝒙$, 𝜽 = −6
$/”

log 𝑝 𝑦$|𝒙$, 𝜽

• We have

log 𝑝 𝑦$|𝒙$, 𝜽 = −
1
2𝜎0

𝑦$ − 𝒙$*𝜽
0
+ const

• By substitution and discarding const terms, we have

ℒ 𝜽 ≔
1
2𝜎0

6
$/”

𝑦$ − 𝒙$*𝜽
0
=

1
2𝜎0

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽 =
1
2𝜎0

𝒚 − 𝑿𝜽 0

where 𝑿 is the example feature matrix
𝑿 ≔ 𝒙”, … , 𝒙# * ∈ ℝ#×!

and 𝒚 the label vector
𝒚 = 𝑦”, … , 𝑦# * ∈ ℝ#

Linear regression from Stats and Prob views
• From the Stats view:

ℒ 𝜽 =
1
𝑁

𝒚 − 𝑿𝜽 0 =
1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽

• From the Prob view:

ℒ 𝜽 =
1
2𝜎0

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽 =
1
2𝜎0

𝒚 − 𝑿𝜽 0

• Both 𝑁 and 𝜎 are known, so these two loss functions are equivalent and have
the same closed-form solution

𝜽 = 𝑿*𝑿
1𝟏
𝑿*𝒚

Issues.
1. Need 𝑿*𝑿 ∈ ℝ!×!to be invertible

• Feature vectors 𝒙_,… , 𝒙` must span ℝa

• Must have more data than features, N ≥ 𝐷
• Use regularization if 𝑿B𝑿 is not invertible

2. What if 𝑿*𝑿 ∈ ℝ!×! is a large matrix?
• Takes long time to invert
• Use stochastic gradient descent if 𝑿B𝑿 is too large

Linear regression with features

• So far, we have discussed linear regression which fits straight lines to data.

• However straight lines are not sufficiently expressive when fitting more
complex data

• Fortunately, “linear regression” only refers to “linear in the parameters”
• we can perform an arbitrary nonlinear transformation 𝜙 𝒙 of the inputs 𝒙 and

then linearly combine the components of this transformation.
𝑝 𝑦 𝒙, 𝜽 = 𝒩 𝑦 𝜙* 𝒙 𝜽, 𝜎0

⟺ 𝑦 = 𝜙E 𝒙 𝜽 + 𝜖 = 6
F/G

F1″

𝜃H𝜙H 𝒙 + 𝜖

where 𝜙:ℝ! → ℝF is a (nonlinear) transformation of the inputs 𝒙 and
𝜙H: ℝ! → ℝ is the kth component of the feature vector 𝜙. Note that the model
parameters 𝜽 still appear only linearly

Example – Polynomial Regression

• We are concerned with a regression problem 𝑦 = 𝜙* 𝑥 𝜽 + 𝜖, where 𝑥 ∈ ℝ
and 𝜽 ∈ ℝF. A transformation that is often used in this context is

𝜙 𝑥 =

𝜙G 𝑥
𝜙” 𝑥
⋮

𝜙F1″ 𝑥

1
𝑥
𝑥0
𝑥I
⋮

𝑥F1″

∈ ℝF

• We create a 𝐾-dimensional feature from a 1-dimensional input
• With these features, we can model polynomials of degree ≤ 𝐾 − 1 within the

framework of linear regression: A polynomial of degree 𝐾 − 1 is

𝑓 𝑥 = 6
H/G

F1″

𝜃H 𝑥H = 𝜙* 𝑥 𝜽

where 𝜽 = 𝜃G, … , 𝜃F1″ * ∈ ℝF contains the (linear) parameters.

Linear regression with features
• We consider training inputs 𝒙$ ∈ ℝ! and targets 𝑦$ ∈ ℝ, 𝑛 = 1, . . . , 𝑁 and

define the feature matrix (design matrix) as

𝚽:=
𝜙* 𝒙”

⋮
𝜙* 𝒙#

𝜙G 𝒙” ⋯ 𝜙F1″ 𝒙”
𝜙G 𝒙0 ⋯ 𝜙F1″ 𝒙0

⋮
𝜙G 𝒙# ⋯

⋮
𝜙F1″ 𝒙#

∈ ℝ#×F

whereΦCD = 𝜙D 𝒙C and 𝜙D: ℝ! → ℝ.
• With this feature matrix𝚽, the negative log-likelihood for the linear regression

model can be written as
−log 𝑝 𝒴|𝓧, 𝜽 =

1
2𝜎0

𝒚 −𝚽𝜽 * 𝒚 −𝚽𝜽 + const.

Linear regression with features
• With this feature matrix𝚽, the negative log-likelihood for the linear regression

model can be written as
−log p 𝒴|𝓧, 𝛉 =

1
2σ0

𝐲 − 𝚽𝛉 * 𝐲 − 𝚽𝛉 + const (1)

• Recall we have

ℒ 𝜽 ≔
1
2𝜎0

6
$/”

𝑦$ − 𝒙$E𝜽 0 =
1
2𝜎0

𝒚 − 𝑿𝜽 E 𝒚 − 𝑿𝜽 =
1
2𝜎0

𝒚 − 𝑿𝜽 0 (2)

where 𝐗 is the matrix of the data points.
• Comparing (1) and (2), we immediately see we just need to replace X with Φ.
• The solution to (2) was

𝛉 = 𝐗*𝐗
1𝟏
𝐗*𝐲

• we arrive immediately at the closed form solution to the Linear regression
with features problem

𝛉 = 𝚽𝐓𝚽
1″
𝚽𝐓y

Example – Feature Matrix for Second-order Polynomials

• For a second-order polynomial and 𝑁 training points 𝑥$ ∈ ℝ, 𝑛 = 1, . . . , 𝑁, the
feature matrix is

𝚽 =

1 𝑥” 𝑥”0

1 𝑥0 𝑥0
0

⋮
1

⋮
𝑥#

⋮
𝑥#
0

• Another example:

• The dataset consists of 𝑁 = 20 data pairs 𝑥$, 𝑦$ , where 𝑥$~𝒰 −5,5 , and
𝑦$ = −sin ⁄𝑥$ 5 + cos 𝑥$ + 𝜀, where 𝜀~𝒩 0, 0.20

• In this figure, we fit a polynomial of degree 𝐾 = 4 using maximum likelihood
estimation. 21

Check your understanding

• Linear regression belongs to supervised learning.

• If we use a polynomial feature expression, linear regression is no longer
linear.

• Linear regression is linear in that it uses a linear modeling of the input feature.
• If we have limited data but very high dimensional data, we will likely have

overfitting, but it is still possible to use the equation 𝜽 = 𝑿*𝑿
1𝟏
𝑿*𝒚 to

calculate the optimal parameters.

• You are running a restaurant and want to use linear regression to estimate
your daily income. Your prior knowledge tells you that the income is cyclical
with different seasons. Which feature is best for your model?

• Polynomials of days
• Cosine function of days
• Exponential function of days
• Days

NeRF: Representing Scenes as Neural Radiance Fields for
View Synthesis. Mildenhall et al., ECCV 2020

9.2.2 Overfitting in Linear Regression

• Loss function (mean square error)

ℒ 𝜽 =
1
𝑁

𝒚 −𝚽𝜽 0 =
1
𝑁
6
$/”

𝑦$ − 𝜙E 𝒙$ 𝜽 0

• Root mean square error

ℒ 𝜽 =
1
𝑁

𝒚 −𝚽𝜽 0 =
1
𝑁
6
$/”

𝑦$ − 𝜙E 𝒙$ 𝜽 0

• We have 𝑁 data pairs and want to fit a polynomial model to the data

• We want to determine the polynomial degree 𝑀 that yields a low training loss
and generalizes well (low test error).

• To determine the best value of 𝑀 (polynomial degree), we can perform a
brute-force search and enumerate all (reasonable) values of 𝑀.

• When 𝑀 = 0,1, polynomials fit data poorly

• When 𝑀 = 3,4, the fits look plausible and smoothly interpolate the data
• When 𝑀 = 6,9, polynomials fit data better and better. In the extreme case of
𝑀 = 𝑁 − 1 = 9, the function will pass through every single data point. These
high-degree polynomials oscillate wildly and are a poor representation of the
underlying function that generated the data, such that we suffer from
overfitting.

𝑁 = 10 Data points

9.2.2 Overfitting in Linear Regression

• Evaluate whether a model generalizes well

A test set of 200 data points generated by the same procedure as the training set
For each choice of𝑀, calculate the RMSE value.

• Evaluate whether a model generalizes well

• Initially the test error decreases
• For fourth-order polynomials, the test error is relatively low and stays

relatively constant up to degree 5

• From degree 6 onward the test error increases significantly, and high-order
polynomials have very bad generalization properties.

• The training error never increases when the degree of the polynomial
increases

• the best generalization (smallest test error) is obtained for 𝑀 = 4.

A test set of 200 data points
generated by the same procedure
as the training set

For each choice of𝑀, calculate
the RMSE value.

9.2.4 Regularized Least Squares

• Overfitting occurs because the model is too complex (𝜽 has too many non-
zero entries), while there are too limited training sample.

• We want to penalize the amplitude of parameters by regularization

• In regularized least squares, we consider the loss function

ℒK 𝜽 =
1
𝑁

𝒚 −𝚽𝜽 0 + 𝜆 𝜽 0

• First term: data-fit term; second term: regularizer

• We can use any 𝑝-norm � .
• smaller 𝑝 leads to sparser solutions, i.e., many parameter values 𝜃L = 0.

Regularization
parameter 𝜆 ≥ 0

Regularizer

Pressure to
simplify
model

Pressure to fit data

9.2.4 Regularized Least Squares

• Loss function

ℒK 𝜽 =
1
𝑁

𝒚 −𝚽𝜽 0 + 𝜆 𝜽 0

• We calculate the gradient of ℒ with respect to the parameters 𝜽 as
dℒ
d𝜽

=
d
d𝜽

1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽 + 𝜆 𝜽 0

=
1
𝑁

−2𝒚*𝑿 + 2𝜽*𝑿*𝑿 +
d
d𝜽

𝜆 𝜽 0

=
1
𝑁

−2𝒚*𝑿 + 2𝜽*𝑿*𝑿 + 2𝜆𝜽* ∈ ℝ”×!

• The minimum is attained when the gradient is zero.
dℒ
d𝜽

= 𝟎* ⇔ 2𝜆𝜽* +
1
𝑁
2𝜽*𝑿*𝑿 =

1
𝑁
2𝒚*𝑿

⇔ 𝜽* = 𝒚*𝑿 𝑁𝜆𝑰 + 𝑿*𝑿
1𝟏

⇔ 𝜽 = 𝑁𝜆𝑰 + 𝑿*𝑿
1𝟏
𝑿*𝒚

𝜕𝒙B𝑩𝒙
𝜕𝒙

= 𝒙B(𝑩 + 𝑩B)

Effect of Regularization

Regularization parameter 𝜆

Er
ro

r Test error

Training error

Overfitting Underfitting

7. Continuous Optimization – analytic solution

• The function below has a global minimum global minimum around 𝑥 = −4.5,
where the function value is about -47

• There exists another local minimum around 𝑥 = 0.7

• We can solve for all the stationary points of a function by calculating its
derivative and setting it to zero

ℓ 𝑥 = 𝑥M + 7𝑥I + 5𝑥0 − 17𝑥 + 3

7. Continuous Optimization – analytic solution

• Considering the following polynomial
ℓ 𝑥 = 𝑥M + 7𝑥I + 5𝑥0 − 17𝑥 + 3

• We calculate the gradient
𝑑ℓ 𝑥
𝑑𝑥

= 4𝑥I + 21𝑥0 + 10𝑥 − 17

• A cubic equation, it has in general three solutions when set to zero
• Two are minimums (𝑥 ≈ −4.5, 0.7)
• One is maximum (𝑥 ≈ −1.4)

• To determine whether it is minimum or maximum, we calculate the second
derivative

d0ℓ 𝑥
d0𝑥

= 12𝑥0 + 42𝑥 + 10

• Substitute our visually estimated values of 𝑥 = −4.5, −1.4, 0.7

• We observe that 𝑥 ≈ −1.4 is a maximum, because d
�ℓ O
d�O < 0 • 𝑥 ≈ 4.5, 0.7 are two minimums, because d �ℓ O d�O > 0 32

7. Continuous Optimization – Motivation for
gradient descent
• In general, we are unable to find analytic solutions

• We need to start at some value e.g., 𝑥G = −10, and follow the gradient
• The gradient points in the direction of steepest ascent of 𝑓.

• When we start from 𝑥G = −10, we know we should go right, but don’t know
how far we should go (step size)

• When starting from 𝑥G > −1, we might find the local minimum.

7.1 Optimization Using Gradient Descent

• Given 𝑓 ∶ ℝL → ℝ, we consider the problem of solving for the minimum of 𝑓
min
𝒙
𝑓 𝒙

• Gradient descent is a first-order optimization algorithm.

• To find a local minimum of a function using gradient descent, one takes steps
proportional to the negative of the gradient of the function at the current point.

• Gradient descent exploits the fact that 𝑓 𝒙G decreases fastest if one moves
from 𝒙G in the direction of the negative gradient − ∇𝑓 𝒙G

*
of 𝑓 at 𝒙G.

• Observation: if

𝒙” = 𝒙G − 𝛾 ∇𝑓 𝒙G
*

• For a small step size 𝛾 ≥ 0, then 𝑓 𝒙” ≤ 𝑓 𝒙G

7.1 Optimization Using Gradient Descent

• A simple gradient descent algorithm:

• If we want to find a local optimum 𝑓 𝒙∗ of a function 𝑓 ∶ ℝ$ → ℝ, 𝒙 ↦ 𝑓 𝒙 ,
we start with an initial guess 𝒙G (parameter we want to optimize), and then
iterate according to

𝒙CQ” = 𝒙C − 𝛾C ∇𝑓 𝒙C
*

• For suitable step-size 𝛾C, the sequence 𝑓 𝒙G ≥ 𝑓 𝒙” ≥ ⋯ converges to a
local minimum.

Example – gradient descent

• Consider a quadratic function in two dimensions

𝑓
𝑥”
𝑥0

=
1
2
𝑥”
𝑥0

* 2 1
1 20

𝑥”
𝑥0

− 5
3
* 𝑥”
𝑥0

with gradient

∇𝑓
𝑥”
𝑥0

=
𝑥”
𝑥0

* 2 1
1 20

− 5
3
*

• Starting at the initial location 𝒙G = −3,−1 *, we apply gradient descent
(iteratively) to obtain a sequence of estimates that converge to the minimum
value

• The gradient of 𝒙G points north and east

• leading to 𝒙” = −1.98, 1.21 *

• We can then have 𝒙0 = −1.32, −0.42 *

Example – Solving a Linear Equation System

• When we solve linear equations of the form 𝑨𝒙 = 𝒃, in practice we aim to
approximately find 𝒙∗ that minimizes the squares error

ℒ 𝒙 = 𝑨𝒙 − 𝒃 0 = 𝑨𝒙 − 𝒃 * 𝑨𝒙 − 𝒃

• The gradient of ℒ 𝒙 with respect to 𝒙 is
∇𝒙= −2𝒃*𝑨 + 2𝒙*𝑨*𝑨

• We can use this gradient directly in a gradient descent algorithm

7.1.3 Stochastic Gradient Descent

• Drawback of gradient descent
• Computing the gradient can be very time consuming, why?
• Given 𝑁 data points 1,2, … , 𝑁, the loss function is a sum of losses ℒ$ incurred

by each example 𝑛. That is,

ℒ# 𝜽 =
1
𝑁
6
$/”

ℒ$ 𝜽

• Example (regression)

ℒ# 𝜽 =
1
𝑁

𝒚 −𝚽𝜽 0 =
1
𝑁
6
$/”

𝑦$ − 𝜙E 𝒙$ 𝜽 0

where 𝒙$ ∈ ℝ! are the training samples, 𝑦$ are the labels, and 𝜽 are the
parameters of the linear regression model we want to optimize.
• In standard gradient descent, optimization is performed using the full training

set, by updating the vector of parameters according to

𝜽CQ” = 𝜽C − 𝛾C ∇ℒ 𝜽C
*
= 𝜽C − 𝛾C

1
𝑁
6
$/”

∇ℒ$ 𝜽 E
38

7.1.3 Stochastic Gradient Descent

𝜽CQ” = 𝜽C − 𝛾C ∇ℒ 𝜽C
*
= 𝜽C − 𝛾C

1
𝑁
6
$/”

∇ℒ$ 𝜽 E

• Evaluating the sum gradient may require expensive evaluations of the
gradients from all individual functions ℒ$

• Stochastic Gradient Descent

• Estimate the gradient by averaging over a smaller minibatch (subset of the
training data).

∇ℒ 𝜽 ≈ ∇ℒR 𝜽 =
1
𝑀

6
$∈ℬ�

∇ℒ$ 𝜽 E

where 𝑀 ≪ 𝑁, and ℬR is a subset of the permutated indices of the samples in
the minibatch. ℬR = 𝑀.

• For example, the whole dataset has 𝑁 = 4 samples indexed with 𝑛 = 1,2,3,4.
The minibatch contains 𝑀 = 2 samples. ℬR could be 1, 3 , 2, 4 ,…

Stochastic Gradient Descent
1. Initialize 𝜽 randomly.

2. Select minibatch ℬR of data from the training set at random.
𝜽 𝜽 − 𝛾C∇ℒR 𝜽

3. Repeat Step (2) until convergence.

Gradient Descent
1. Initialize 𝜽 randomly.

2. Update (use all the training set calculate the gradient)
𝜽 𝜽 − 𝛾C∇ℒ# 𝜽

3. Repeat Step (2) until convergence.

7.1.1 Step size

• Choosing a good step-size (learning rate) is important in gradient descent

• If the step-size is too small, gradient descent can be slow
• If the step-size is chosen too large, gradient descent can overshoot, fail to

converge, or even diverge

• There are several heuristics to adapt the step size

• When the function value increases after a gradient step, the step-size was too
large. Undo the step and decrease the step-size

• When the function value decreases the step could have been larger. Try to
increase the step-size.

• Typically, we choose a learning rate that starts big and ends small, e.g. 𝛾C =
1/(𝑖 + 1)

7.1.1 Step size

• There has been quite a few papers studying the step size/learning rate.

• [1] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
• [2] Large Batch Training of Convolutional Networks

• [3] A Closer Look at Deep Learning Heuristics: Learning Rate Restarts,
Warmup and Distillation

• [4] Deep Residual Learning for Image Recognition

• A recent practice in deep learning is to start training with a small learning rate,
switch to a large learning rate, and then decrease it gradually.

Impact of step size

𝜂�
𝜂�
𝜂�

𝜂�

ℒ �
𝜃

7.1.2 Gradient Descent With Momentum

• The convergence of gradient descent may be very slow if the curvature of the
optimization surface is such that there are regions that are poorly scaled

• The proposed method to improve convergence is to give gradient descent
some memory

• Gradient descent with momentum is a method that introduces an additional
term to remember what happened in the previous iteration.

• This memory dampens oscillations and smooths out the gradient updates
• The idea is to have a gradient update with memory to implement a moving

average

7.1.2 Gradient Descent With Momentum

• The momentum-based method remembers the update ∆𝒙C at each iteration 𝑖
and determines the next update as a linear combination of the current and
previous gradients

𝒙CQ” = 𝒙C − 𝛾C ∆(C)

∆(C)= 1 − 𝛼 ∆ C1″ + 𝛼 ∇𝑓 𝒙C
*

• Where 𝛼 ∈ 0,1 .

• Sometimes we only know the gradient approximately.

• In such cases, the momentum term is useful since it averages out different
noisy estimates of the gradient.

Linear regression with gradient descent

• Loss function

ℒ 𝜽 =
1
𝑁

𝒚 − 𝑿𝜽 0 =
1
𝑁

𝒚 − 𝑿𝜽 * 𝒚 − 𝑿𝜽

• Gradient

∇ℒ𝑵 =
1
𝑁

−2𝒚*𝑿 + 2𝜽*𝑿*𝑿

• Gradient descent

𝜃 𝜃 − 𝛾C
2
𝑁
𝜽*𝑿*𝑿 −

2
𝑁
𝒚*𝑿

• Stochastic gradient descent

𝜃 𝜃 − 𝛾C
2
𝑀
𝜽*𝑿*𝑿 −

2
𝑀
𝒚*𝑿

• Why not use the exact solution? 𝜽 = 𝑿*𝑿
1𝟏
𝑿*

𝑿*𝑿 ∈ ℝ!×! could be a large matrix
• Takes long time to invert

Regularized Linear regression with gradient descent

• Loss function

ℒK 𝜽 =
1
𝑁

𝒚 − 𝑿𝜽 0 + 𝜆 𝜽 0

• Gradient

∇ℒ𝑵 =
1
𝑁

−2𝒚*𝑿 + 2𝜽*𝑿*𝑿 + 2𝜆𝜽*

• Gradient descent

𝜽 1 − 2𝛾C𝜆 𝜽 − 𝛾C
2
𝑁
𝜽*𝑿*𝑿 −

2
𝑁
𝒚*𝑿

• Stochastic gradient descent

𝜽 1 − 2𝛾C𝜆 𝜽 − 𝛾C
2
𝑀
𝜽*𝑿*𝑿 −

2
𝑀
𝒚*𝑿

Check your understanding

• When you increase the step size, the training time will always be shortened.

• The step size is a hyperparameter.
• “Using a small step size all the time” vs “First use a big step size, then use a

small step size”: both methods give you similar optimization results, when
starting from the same random seed.

• We can use gradient descent to solve k-means.

• SGD gives you a more precise optimization result than standard GD.
• In both GD and SGD, the overall loss function will always decrease with each

iteration.

• In linear regression, the training error ”
#
𝒚 −𝚽𝜽 0 is usually greater when

using regularization than not using regularization.

• In ℒK 𝜽 =
”
#
𝒚 −𝚽𝜽 0 + 𝜆 𝜽 0, we can instead put 𝜆 in front of ”

#
𝒚 −𝚽𝜽 0,

which will give us similar optimization results.
48

Related Posts