Linear Regression
Liang Zheng Australian National University liang.zheng@anu.edu.au
Analyzing Worldwide Social Distancing through Large-Scale Computer Vision, Ghodgaonkar et al., Arxiv 2020
Live data stream crawling
30,261 discovered cameras distributed by geographic location
A park in Schwarzsee, Switzerland, on May 22, 2020
Vehicle count Vehicle count
Regression problem
• Regression is a fundamental problem in machine learning
Regression problem: observed noisy function values from which we wish to infer the underlying function that generated the data.
Regression solution: possible function that could have generated the data (blue) with indication of the measurement noise of the function value at the corresponding inputs (orange distributions)
Linear Regression from a statistical point of view
Problem Formulation
• Regression belongs to supervised learning
• Task. Find function 𝑓:R$ → R such that 𝑦 ≈ 𝑓 𝒙;𝜽
• Experience. Training set 𝒟 ≔ 𝒙-, 𝑦- , … , 𝒙0, 𝑦0 , consisting of 𝑁 inputs
𝒙0 ∈ R$ and corresponding observations/targets 𝑦3 ∈ R, 𝑛 = 1, … , 𝑁
• Performance. Estimated empirical risk on the test data 𝐑89: 𝑓,𝑿<8=<,𝒚<8=<
Working Example
Prediction Error
Loss Function for Training (review)
• Under the i.i.d assumption, the empirical mean is a good estimate of the population mean.
• We can use the empirical mean of the loss on the training data
• Given a training set 𝒙-, 𝑦- , ... , 𝒙0, 𝑦0 , we use the notation of an example
matrix
and a label vector
𝑿≔ 𝒙-,...,𝒙0 ? ∈R0×$
𝒚= 𝑦-,...,𝑦0 ? ∈R0 • The average loss is given by 1 0
𝐑 𝑓,𝑿,𝒚=Dl𝑦,𝑦G ABC 𝑁 3E- 3 3
where 𝑦G = 𝑓 𝒙 , 𝜽 . The above equation is called the empirical risk. The 33
learning strategy is called empirical risk minimization.
Least Squares Linear Regression (revision)
• We use the squared loss function
I
l 𝑦 , 𝑦G = 𝑦 − 𝑦G 3333
• The empirical risk becomes,
𝐑 𝑓,𝑿,𝒚=1D0l𝑦,𝑦G=1D0 𝑦−𝑓𝒙,𝜽
I ABC 𝑁 3E- 3 3 𝑁 3E- 3 3
Usingthelinearpredictor𝑓𝒙3,𝜽 =𝜽?𝒙3,weobtaintheempiricalriskas 𝐑 A B C 𝑓 , 𝑿 , 𝒚 = 𝑁1 D 0 𝑦 3 − 𝑓 𝒙 3 , 𝜽 I
3E-
• This equation can be equivalently expressed in matrix form
L 𝜽 : = 𝐑 A B C 𝑓 , 𝑿 , 𝒚 = 𝑁1 𝒚 − 𝑿 𝜽 I = 𝑁1 𝒚 − 𝑿 𝜽 ? 𝒚 − 𝑿 𝜽
Working Example
Prediction Error
A closed-form analytic solution • Loss function (mean square error)
L 𝜽 = 𝑁1 𝒚 − 𝑿 𝜽 I = 𝑁1 𝒚 − 𝑿 𝜽 ? 𝒚 − 𝑿 𝜽
• We calculate the gradient of L with respect to the parameters 𝜽 as
dL=d 1𝒚−𝑿𝜽?𝒚−𝑿𝜽 d𝜽 d𝜽 𝑁
= 1 d 𝒚?𝒚−𝜽?𝑿?𝒚−𝒚?𝑿𝜽+𝜽?𝑿?𝑿𝜽 𝑁d𝜽
= 1 d 𝒚?𝒚−2𝒚?𝑿𝜽+𝜽?𝑿?𝑿𝜽 𝑁d𝜽
= 𝑁1 −2𝒚?𝑿 + 2𝜽?𝑿?𝑿 ∈ R-×$ • The minimum is attained when the gradient is zero.
dL=𝟎? ⇔ 𝜽?𝑿?𝑿=𝒚?𝑿 d𝜽 ? ? ? P𝟏
⇔𝜽=𝒚𝑿𝑿𝑿 ⇔ 𝜽=𝑿?𝑿P𝟏𝑿?𝒚
Linear Regression from a probability point of view
9.1 Problem formulation
• A linear regression problem with likelihood function
𝑝 𝑦|𝒙,𝜽 = 𝒩 𝑦|𝒙?𝜽,𝜎I ⇔𝑦=𝒙?𝜽+𝜀, 𝜀~𝒩 0,𝜎I
• The class of functions described by 𝑦 = 𝒙?𝜽 + 𝜀 are straight lines that pass through the origin.
• The likelihood in 𝑝 𝑦|𝒙, 𝜽 = 𝒩 𝑦|𝒙?𝜽, 𝜎I is the probability density function of 𝑦 evaluated at 𝒙?𝜽.
• The only source of uncertainty originates from the observation noise 𝜀 (we assume 𝜀 is known)
9.2 Parameter estimation
• Given a training set 𝒙-, 𝑦- , ... , 𝒙0, 𝑦0 where 𝒙3 ∈ R$ and corresponding
observations/targets 𝑦3 ∈ R
• 𝑦Y and 𝑦Z are conditionally independent given their inputs 𝒙Y and 𝒙Z
• we use the notation of set of training input
𝒳 ≔ 𝒙-,...,𝒙0 and set of corresponding targets
𝒴∶= 𝑦-,...,𝑦0
𝑝 𝒴|𝒳,𝜽 = 𝑝 𝑦-,...,𝑦0|𝒙-,...,𝒙0,𝜽
00
^𝑝𝑦3|𝒙3,𝜽 =^𝒩𝑦3|𝒙3?𝜽,𝜎I 3E- 3E-
• The likelihood factorizes
9.2.1 Maximum Likelihood Estimation
• Intuitively, maximizing the likelihood means maximizing the predictive distribution of the training data given the model parameters. We obtain the
maximum likelihood parameters as
0
𝜽ML = arg max𝑝 𝒴|𝒳,𝜽 = arg max^𝑝 𝑦3|𝒙3,𝜽 𝜽 𝜽 3E-
• Instead of maximizing the likelihood directly, we apply the log-transformation to the likelihood function and minimize the negative log-likelihood,
00
−log𝑝 𝒴|𝒳,𝜽 =−log^𝑝 𝑦3|𝒙3,𝜽 =−Dlog𝑝 𝑦3|𝒙3,𝜽 3E- 3E-
• In the linear regression model 𝑦 = 𝒙?𝜽 + 𝜀, 𝜀~𝒩 0, 𝜎I , the likelihood is Gaussian (due to the noise term 𝜀), such that we arrive at
log𝑝𝑦3|𝒙3,𝜽 =− 1 𝑦3−𝒙?𝜽I+const 2𝜎I
Maximum Likelihood Estimation
• Negative log-likelihood:
00
−log𝑝 𝒴|𝒳,𝜽 =−log^𝑝 𝑦3|𝒙3,𝜽 =−Dlog𝑝 𝑦3|𝒙3,𝜽 3E- 3E-
• We have
• By substitution and discarding const terms, we have
log𝑝𝑦3|𝒙3,𝜽 =− 1 𝑦3−𝒙3?𝜽I+const 2𝜎I
10I11 L𝜽≔2𝜎ID𝑦3−𝒙3?𝜽 =2𝜎I𝒚−𝑿𝜽?𝒚−𝑿𝜽=2𝜎I𝒚−𝑿𝜽I
3E-
where 𝑿 is the example feature matrix
𝑿≔ 𝒙-,...,𝒙0 ? ∈R0×$
and 𝒚 the label vector
𝒚= 𝑦-,...,𝑦0 ? ∈R0
Linear regression from Stats and Prob views • FromtheStatsview: 1 I 1 ?
L𝜽=𝑁𝒚−𝑿𝜽 =𝑁𝒚−𝑿𝜽 𝒚−𝑿𝜽
•FromtheProbview: 1 ? 1 I L 𝜽 =2𝜎I 𝒚−𝑿𝜽 𝒚−𝑿𝜽 =2𝜎I 𝒚−𝑿𝜽
• Both 𝑁 and 𝜎 are known, so these two loss functions are equivalent and have the same closed-form solution
𝜽= 𝑿?𝑿P𝟏𝑿?𝒚
Issues.
1. Need 𝑿?𝑿 to be invertible
• Feature vectors 𝒙-, ... , 𝒙0 must span R$
• Must have more data than features, N ≥ 𝐷
• Use regularization if 𝑿?𝑿 is not invertible
2. What if 𝑿?𝑿 ∈ Ro×o is a large matrix?
• Takes long time to invert
• Use stochastic gradient descent if 𝑿?𝑿 is too large
Linear regression with features
• So far, we have discussed linear regression which fits straight lines to data.
• However straight lines are not sufficiently expressive when fitting more
complex data
• Fortunately, “linear regression” only refers to “linear in the parameters”
• we can perform an arbitrary nonlinear transformation 𝜙 𝒙 of the inputs 𝒙 and then linearly combine the components of this transformation.
𝑝 𝑦 𝒙,𝜽 = 𝒩 𝑦 𝜙? 𝒙 𝜽,𝜎I tP-
⟺𝑦=𝜙r 𝒙𝜽+𝜖=D𝜃w𝜙w 𝒙 +𝜖 tEu
where 𝜙: R$ → Rt is a (nonlinear) transformation of the
𝜙w: R$ → R is the kth component of the feature vector 𝜙. Note that the model parameters 𝜽 still appear only linearly
inputs
𝒙 and
Example - Polynomial Regression
• We are concerned with a regression problem 𝑦=𝜙? 𝑥 𝜽+𝜖, where 𝑥∈R
and 𝜽 ∈ Rt. A transformation that is often used in this context is 𝜙u𝑥 𝑥1
𝜙𝑥= 𝜙-𝑥 = 𝑥I ∈Rt ⋮ 𝑥z
𝜙tP- 𝑥 ⋮ 𝑥tP-
• We create a 𝐾-dimensional feature from a 1-dimensional input
• With these features, we can model polynomials of degree ≤ 𝐾 − 1 within the
framework of linear regression: A polynomial of degree 𝐾 − 1 is tP-
𝑓 𝑥 = D 𝜃w 𝑥w = 𝜙? 𝑥 𝜽 wEu
where 𝜽 = 𝜃u, ... , 𝜃tP- ? ∈ Rt contains the (linear) parameters.
Check your understanding
• Linear regression belongs to supervised learning.
• If we use a polynomial feature expression, linear regression is not longer
linear.
• Linear regression is linear in that it uses a linear modeling of the input feature.
• If we have limited data but very high dimensional data, we will likely have
overfitting, but it is still possible to use the equation 𝜽 = 𝑿?𝑿 P𝟏𝑿?𝒚 to calculate the optimal parameters.
• You are running a restaurant and want to use linear regression to estimate your daily income. Your prior knowledge tells you that the income is cyclical with different seasons. Which feature is best for your model?
• Polynomialsofdays
• Cosinefunctionofdays
• Exponentialfunctionofdays • Days
Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild. Wu et al, Arxiv 2020
gradient reversal layer (GRL)
Linear regression with features
• We consider training inputs 𝒙3 ∈ R$ and targets 𝑦3 ∈ R,𝑛 = 1,...,𝑁 and
define the feature matrix (design matrix) as
𝜙? 𝒙- 𝜙u 𝒙- ⋯ 𝜙tP- 𝒙-
𝚽:= ⋮ = 𝜙u 𝒙I ⋯ 𝜙tP- 𝒙I
𝜙?𝒙0 ⋮ ⋮
𝜙u 𝒙0 ⋯ 𝜙tP- 𝒙0
∈R0×t
whereΦYZ =𝜙Z 𝒙Y and𝜙Z:R$ →R.
• With this feature matrix 𝚽, the negative log-likelihood for the linear regression
model can be written as 1 ?
−log𝑝 𝒴|𝓧,𝜽 =2𝜎I 𝒚−𝚽𝜽 𝒚−𝚽𝜽 +const.
Linear regression with features
• With this feature matrix 𝚽, the negative log-likelihood for the linear regression
model can be written as 1 ?
−logp 𝒴|𝓧,𝛉 =2σI 𝐲−𝚽𝛉 𝐲−𝚽𝛉 +const (1)
3E-
where 𝐗 is the matrix of the data points.
• Comparing (1) and (2), we immediately see we just need to replace X with Φ.
• Recall we have
1011
L 𝜽 ≔ 2𝜎I D 𝑦3 − 𝒙3r𝜽 I = 2𝜎I 𝒚 − 𝑿𝜽 r 𝒚 − 𝑿𝜽 = 2𝜎I 𝒚 − 𝑿𝜽 I (2)
• The solution to (2) was
𝛉= 𝐗?𝐗P𝟏𝐗?𝐲
• we arrive immediately at the closed form solution to the Linear regression
with features problem
𝛉= 𝚽𝐓𝚽P-𝚽𝐓y
Example - Feature Matrix for Second-order Polynomials • For a second-order polynomial and 𝑁 training points 𝑥3 ∈ R, 𝑛 = 1, . . . , 𝑁, the
feature matrix is
1 𝑥- 𝑥-I
• Another example:
1 𝑥I 𝑥I ⋮⋮⋮
𝚽=
1 𝑥0 𝑥0I
• The dataset consists of 𝑁 = 20 data pairs 𝑥3, 𝑦3 , where 𝑥3~𝒰 −5,5 , and 𝑦3 = −sin 𝑥3⁄5 + cos 𝑥3 + 𝜀, where 𝜀~𝒩 0,0.2I
• In this figure, we fit a polynomial of degree 𝐾 = 4 using maximum likelihood estimation.
9.2.2 Overfitting in Linear Regression • Loss function (mean square error) 0
L 𝜽 = 𝑁1 𝒚 − 𝚽 𝜽 I = 𝑁1 D 𝑦 3 − 𝜙 r 𝒙 3 𝜽 I 3E-
• Root mean square error
110
L𝜽= 𝑁𝒚−𝚽𝜽I= 𝑁D𝑦3−𝜙r𝒙3𝜽I
3E-
• We have 𝑁 data pairs and want to fit a polynomial model to the data
• We want to determine the polynomial degree 𝑀 that yields a low training loss and generalizes well (low test error).
• To determine the best value of 𝑀 (polynomial degree), we can perform a brute-force search and enumerate all (reasonable) values of 𝑀.
𝑁 = 10 Data points
• When 𝑀 = 0,1, polynomials fit data poorly
• When 𝑀 = 3,4, the fits look plausible and smoothly interpolate the data
• When 𝑀 = 6,9, polynomials fit data better and better. In the extreme case of 𝑀 = 𝑁 − 1 = 9, the function will pass through every single data point. These high-degree polynomials oscillate wildly and are a poor representation of the underlying function that generated the data, such that we suffer from overfitting.
9.2.2 Overfitting in Linear Regression • Evaluate whether a model generalizes well
A test set of 200 data points generated by the same procedure as the training set For each choice of 𝑀, calculate the RMSE value.
• Evaluate whether a model generalizes well
A test set of 200 data points generated by the same procedure as the training set
For each choice of 𝑀, calculate the RMSE value.
• Initially the test error decreases
• For fourth-order polynomials, the test error is relatively low and stays
relatively constant up to degree 5
• From degree 6 onward the test error increases significantly, and high-order
polynomials have very bad generalization properties.
• The training error never increases when the degree of the polynomial increases
• the best generalization (smallest test error) is obtained for 𝑀 = 4.
9.2.4 Regularized Least Squares
• Overfitting occurs because the model is too complex (𝜽 has too many non- zero entries).
• We want to penalize the amplitude of parameters by regularization
• In regularized least squares, we consider the loss function
Pressure to fit data
L î 𝜽 = 𝑁1 𝒚 − 𝚽 𝜽 I + 𝜆 𝜽 I
Pressure to simplify model
Regularization parameter 𝜆 ≥ 0
Regularizer
• First term: data-fit term; second term: regularizer
• We can use any 𝑝-norm ñ
• smaller 𝑝 leads to sparser solutions, i.e., many parameter values 𝜃o = 0.
C
9.2.4 Regularized Least Squares
• Loss function
• We calculate the gradient of L with respect to the parameters 𝜽 as
L î 𝜽 = 𝑁1 𝒚 − 𝚽 𝜽 I + 𝜆 𝜽 I dL=d 1𝒚−𝑿𝜽?𝒚−𝑿𝜽+𝜆𝜽I
d𝜽 1d𝜽 𝑁 d
=𝑁 −2𝒚?𝑿+2𝜽?𝑿?𝑿 +d𝜽 𝜆 𝜽 I
= 𝑁1 −2𝒚?𝑿 + 2𝜽?𝑿?𝑿 + 2𝜆𝜽? ∈ R-×$ • The minimum is attained when the gradient is zero.
dL=𝟎? ⇔ 2𝜆𝜽?+12𝜽?𝑿?𝑿=12𝒚?𝑿 d𝜽 𝑁 𝑁
⇔ 𝜽? =𝒚?𝑿 𝑁𝜆𝑰+𝑿?𝑿 P𝟏 ⇔ 𝜽=𝑁𝜆𝑰+𝑿?𝑿P𝟏𝑿?𝒚
Effect of Regularization
Underfitting
Test error
Training error
Overfitting
Regularization parameter 𝜆
Error
7. Continuous Optimization – analytic solution
• The function below has a global minimum global minimum around 𝑥 = −4.5,
where the function value is about -47
• There exists another local minimum around 𝑥 = 0.7
• We can solve for all the stationary points of a function by calculating its
derivative and setting it to zero
l𝑥 =𝑥ô+7𝑥z+5𝑥I−17𝑥+3
7. Continuous Optimization – analytic solution
• Considering the following polynomial
l𝑥 =𝑥ô+7𝑥z+5𝑥I−17𝑥+3
• We calculate the gradient
𝑑l 𝑥 𝑑𝑥
• A cubic equation, it has in general three solutions when set to zero • Twoareminimums(𝑥≈−4.5,0.7)
• Oneismaximum(𝑥≈−1.4)
• To determine whether it is minimum or maximum, we calculate the second
= 4𝑥z + 21𝑥I + 10𝑥 − 17
derivative
• Substitute our visually estimated values of 𝑥 = −4.5, −1.4, 0.7
dIl 𝑥 dI𝑥
= 12𝑥I + 42𝑥 + 10
• We observe that 𝑥 ≈ −1.4 is a maximum, because dõl ú < 0
• 𝑥 ≈ 4.5, 0.7 are two minimums, because dõl ú > 0 dõú
dõú
7. Continuous Optimization – Motivation for gradient descent
• In general, we are unable to find analytic solutions
• We need to start at some value e.g., 𝑥u = −10, and follow the gradient
• The gradient points in the direction of steepest ascent of 𝑓.
• When we start from 𝑥u = −10, we know we should go right, but don’t know how far we should go (step size)
• When starting from 𝑥u > −1, we might find the local minimum.
7.1 Optimization Using Gradient Descent
• Given 𝑓 ∶ Ro → R, we consider the problem of solving for the minimum of 𝑓
min𝑓 𝒙 𝒙
• Gradient descent is a first-order optimization algorithm.
• To find a local minimum of a function using gradient descent, one takes steps
proportional to the negative of the gradient of the function at the current point. • Gradient descent exploits the fact that 𝑓 𝒙u decreases fastest if one moves
from 𝒙u in the direction of the negative gradient − ∇𝑓 𝒙u ? of 𝑓 at 𝒙u. • Observation: if
? • Forasmallstepsize𝛾≥0,then𝑓 𝒙- ≤𝑓 𝒙u
𝒙-=𝒙u−𝛾 ∇𝑓𝒙u
7.1 Optimization Using Gradient Descent
• A simple gradient descent algorithm:
• Ifwewanttofindalocaloptimum𝑓 𝒙∗ ofafunction𝑓∶R3 →R,𝒙↦𝑓 𝒙 , we start with an initial guess 𝒙u (parameter we want to optimize), and then
iterate according to
?
𝒙Y£-=𝒙Y−𝛾Y ∇𝑓𝒙Y
• For suitable step-size 𝛾Y, the sequence 𝑓 𝒙u ≥ 𝑓 𝒙- ≥ ⋯ converges to a local minimum.
Example – gradient descent
• Consider a quadratic function in two dimensions
with gradient
𝑓 𝑥- = 1 𝑥- ? 2 1 𝑥- − 5 ? 𝑥- 𝑥I 2𝑥I120𝑥I 3𝑥I
∇𝑓 𝑥- = 𝑥- ? 2 1 − 5 ? 𝑥I 𝑥I1203
• Starting at the initial location 𝒙u = −3, −1 ?, we apply gradient descent (iteratively) to obtain a sequence of estimates that converge to the minimum value
• The gradient of 𝒙u points north and east
• leading to 𝒙- = −1.98, 1.21 ?
• We can then have 𝒙I = −1.32, −0.42 ?
Example – Solving a Linear Equation System
• When we solve linear equations of the form 𝑨𝒙 = 𝒃, in practice we aim to
approximately find 𝒙∗ that minimizes the squares error
L 𝒙 = 𝑨𝒙 − 𝒃 I = 𝑨𝒙 − 𝒃 ? 𝑨𝒙 − 𝒃
• The gradient of L 𝒙 with respect to 𝒙 is
∇𝒙= −2𝒃?𝑨 + 2𝒙?𝑨?𝑨
• We can use this gradient directly in a gradient descent algorithm
7.1.3 Stochastic Gradient Descent
• Drawback of gradient descent
• Computing the gradient can be very time consuming, why?
• Given 𝑁 data points 1,2, … , 𝑁, the loss function is a sum of losses L3 incurred
3E-
• Example (regression)
L 0 𝜽 = 𝑁1 𝒚 − 𝚽 𝜽 I = 𝑁1 D 𝑦 3 − 𝜙 r 𝒙 3 𝜽 I 3E-
where 𝒙3 ∈ R$ are the training samples, 𝑦3 are the labels, and 𝜽 are the parameters of the linear regression model we want to optimize.
by each example 𝑛. That is, 0
L 0 𝜽 = 𝑁1 D L 3 𝜽
0
• In standard gradient descent, optimization is performed using the full training set, by updating the vector of parameters according to
?10 𝜽Y£-=𝜽Y−𝛾Y ∇L𝜽Y =𝜽Y−𝛾Y𝑁D∇L3 𝜽r
3E-
7.1.3 Stochastic Gradient Descent
?10 𝜽Y£-=𝜽Y−𝛾Y ∇L𝜽Y =𝜽Y−𝛾Y𝑁D∇L3 𝜽r
3E-
• Evaluating the sum gradient may require expensive evaluations of the gradients from all individual functions L3
• Stochastic Gradient Descent
• Estimate the gradient by averaging over a smaller minibatch (subset of the
training data).
∇ L 𝜽 ≈ ∇ L ß 𝜽 = 𝑀1 D ∇ L 3 𝜽 r 3∈B©
where 𝑀 ≪ 𝑁, and Bß is a subset of the permutated indices of the samples in the minibatch. Bß = 𝑀.
• For example, the whole dataset has 𝑁 = 4 samples indexed with 𝑛 = 1,2,3,4. The minibatch contains 𝑀 = 2 samples. Bß could be 1, 3 , 2, 4 ,…
Stochastic Gradient Descent
1. Initialize 𝜽 randomly.
2. Select minibatch Bß of data from the training set at random. 𝜽 𝜽−𝛾Y∇Lß 𝜽
3. Repeat Step (2) until convergence.
Gradient Descent
1. Initialize 𝜽 randomly.
2. Update (use all the training set calculate the gradient)
𝜽 𝜽−𝛾Y∇L0 𝜽
3. Repeat Step (2) until convergence.
7.1.1 Step-size
• Choosing a good step-size (learning rate) is important in gradient descent
• If the step-size is too small, gradient descent can be slow
• If the step-size is chosen too large, gradient descent can overshoot, fail to converge, or even diverge
• There are several heuristics to adapt the step size
• When the function value increases after a gradient step, the step-size was too large. Undo the step and decrease the step-size
• When the function value decreases the step could have been larger. Try to increase the step-size.
• Typically, we choose a learning rate that starts big and ends small, e.g. 𝛾Y = 1/(𝑖 + 1)
7.1.1 Step-size
• There has been quite a few papers studying the step size/learning rate.
• [1] Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
• [2] Large Batch Training of Convolutional Networks
• [3] A Closer Look at Deep Learning Heuristics: Learning Rate Restarts, Warmup and Distillation
• [4] Deep Residual Learning for Image Recognition
• A recent practice in deep learning is to start training with a small learning rate, switch to a large learning rate, and then decrease it gradually.
Impact of step size
𝜂w
𝜂w 𝜂w 𝜂w
L3 𝜃
7.1.2 Gradient Descent With Momentum
• The convergence of gradient descent may be very slow if the curvature of the optimization surface is such that there are regions that are poorly scaled
• The proposed method to improve convergence is to give gradient descent some memory
• Gradient descent with momentum is a method that introduces an additional term to remember what happened in the previous iteration.
• This memory dampens oscillations and smooths out the gradient updates
• The idea is to have a gradient update with memory to implement a moving average
7.1.2 Gradient Descent With Momentum
• The momentum-based method remembers the update ∆𝒙Y at each iteration 𝑖 and determines the next update as a linear combination of the current and
previous gradients
• Where𝛼∈ −1,1.
𝒙Y£-=𝒙Y−𝛾Y ∇𝑓𝒙Y ?+𝛼∆𝒙Y? ∆𝒙Y =𝒙Y −𝒙YP- =𝛾YP- ∇𝑓 𝒙YP-
• Sometimes we only know the gradient approximately.
• In such cases, the momentum term is useful since it averages out different noisy estimates of the gradient.
Linear regression with gradient descent
• Loss function
• Gradient
• Gradient descent
L 𝜽 = 𝑁1 𝒚 − 𝑿 𝜽 I = 𝑁1 𝒚 − 𝑿 𝜽 ? 𝒚 − 𝑿 𝜽 ∇L𝑵 = 𝑁1 −2𝒚?𝑿 + 2𝜽?𝑿?𝑿
𝜃 − 𝛾 2 𝜽?𝑿?𝑿 − 2 𝒚?𝑿 Y𝑁𝑁
𝜃
• Stochastic gradient descent 2 ? ? 2 ?
𝜃
𝜃−𝛾Y 𝑀𝜽𝑿𝑿−𝑀𝒚𝑿
• Why not use the exact solution? 𝜽 = 𝑿?𝑿 P𝟏𝑿?
𝑿?𝑿 ∈ R$×$ could be a large matrix • Takes long time to invert
Regularized Linear regression with gradient descent
• Loss function
• Gradient
• Gradient descent
𝜽
L î 𝜽 = 𝑁1 𝒚 − 𝑿 𝜽 I + 𝜆 𝜽 I ∇L𝑵 = 𝑁1 −2𝒚?𝑿 + 2𝜽?𝑿?𝑿 + 2𝜆𝜽?
1−2𝛾Y𝜆 𝜽−𝛾Y
2 ? ? 2 ? 𝑁𝜽 𝑿 𝑿−𝑁𝒚 𝑿
2 ? ? 2 ? 𝑀𝜽 𝑿 𝑿−𝑀𝒚 𝑿
• Stochastic gradient descent
𝜽 1−2𝛾Y𝜆 𝜽−𝛾Y
Check your understanding
• When you increase the step size, the training time will be shortened.
• The step size is a hyperparameter.
• “Using a small step size all the time” vs “First use a big step size, then use a small step size”: both methods give you similar optimization results, when starting from the same random seed.
• We can use gradient descent to solve k-means.
• SGD gives you a more precise optimization result than standard GD.
• In both GD and SGD, the loss function will decrease with each iteration.
• In linear regression, the training loss 0- 𝒚 − 𝚽𝜽 I is usually greater when using regularization than not using regularization.
• InLî 𝜽 =0- 𝒚−𝚽𝜽 I+𝜆 𝜽 I,wecaninsteadput𝜆infrontof0- 𝒚−𝚽𝜽 I, which will give us similar optimization results.