CSC 311: Introduction to Machine Learning
Lecture 2 – Linear Regression & Gradient Descent for Optimization
Anthony Bonner &
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec2
Last class, we discussed:
Supervised learning: we are given a training set consisting of
inputs (x ∈ Rd) and corresponding labels (t)
Training set {(x(1), t(1)), . . . , (x(N), t(N))}
The x(i) are vectors
The t(i) are also often called targets or the ground truth
Note that x(i) = x(i) x(i) . . . x(i)T 12d
The x(i) are called features, components or fields j
Separate validation set (for tuning hyperparameters) and test set (to assess model generalizability)
So far, we have talked about procedures for learning: k Nearest Neighbours
Intro ML (UofT) CSC311-Lec2 2 / 55
Modular Approach to ML Algorithm Design
For the remainder of this course, we will take a more modular approach:
choose a model describing the relationships between variables of interest
define a loss function quantifying how bad the fit to the data is choose a regularizer saying how much we prefer different candidate
models (or explanations of data)
fit the model that minimizes the loss function and satisfy the constraint/penalty imposed by the regularizer, possibly using an optimization algorithm
Mixing and matching these modular components gives us a lot of new ML methods.
Intro ML (UofT) CSC311-Lec2 3 / 55
Modular Approach to Supervised Learning
Given training data D = {(x(i), t(i)) for i = 1, 2, …, N}, the objective is to learn a function f such that
t ≈ y = f(x)
Here, t is the ground truth and y is the prediction
Example: Grade Prediction (here, x ∈ R1 and t ∈ R)
We wish to learn a function t ≈ y = f(x), f : R → R. This task challenging to visualize when there are no restrictions on f. Intro ML (UofT) CSC311-Lec2
A model is a set of assumptions about (or restrictions on) the structureoff int≈y=f(x).
The model or architecture restricts hypothesis space, or the possible functions f that we learn.
Intro ML (UofT) CSC311-Lec2
Linear Regression Model
In the Linear Regression Model, we restrict f to linear functions of the inputs x = (x1,…,xD) to make predictions y of the target value t:
y=f(x)=wjxj +b j
y is the prediction w is the weights
b is the bias (or intercept) (do not confuse with the bias-variance tradeoff in the next lecture)
w and b together are the parameters
We hope to choose “good” parameter values, so that our prediction is
close to the target: y ≈ t.
Intro ML (UofT) CSC311-Lec2 6 / 55
What is Linear? 1 Feature vs. D Features
2.0 Fitted line Data
1.5 1.0 0.5 0.0 0.5 1.0
21012 x: features
Relation between the prediction y and inputs x is linear in both cases. Intro ML (UofT) CSC311-Lec2 7 / 55
If we have only 1 feature:
y = wx + b where w, x, b ∈ R.
y is linear in x.
If we have D features:
y = w⊤x + b where w, x ∈ RD, b∈R
y is linear in x.
y: response
Which hypothesis is better suited to the data?
Intro ML (UofT) CSC311-Lec2 8 / 55
Quantifying “badness”
A good hypothesis should make good predictions about our labeled data (x(1), t(1)), (x(2), t(2)), … (x(N), t(N))
That is, y(i) = wx(i) + b should be “close to” t(i)
find the “best” line (w, b) with: N
minimize L(y(i), t(i))
(UofT) CSC311-Lec2 9 / 55
Linear Regression: Summary So Far
We have a dataset D = {(x(i),t(i)) for i = 1,2,…,N} where,
x(i) = (x(i), x(i), …, x(i))⊤ ∈ RD are the inputs, e.g., age, height.
t(i) ∈ R is the target or response (e.g. income),
predict t(i) with a linear function of x(i):
2.0 1.5 1.0 0.5 0.0 0.5 1.0
21012 x: features
Intro ML (UofT)
CSC311-Lec2
y: response
Linear Regression: Summary So Far
We have a dataset D = {(x(i),t(i)) for i = 1,2,…,N} where,
x(i) = (x(i), x(i), …, x(i))⊤ ∈ RD are the inputs, e.g., age, height.
t(i) ∈ R is the target or response (e.g. income),
predict t(i) with a linear function of x(i):
Fitted line Data
2.0 1.5 1.0 0.5 0.0 0.5 1.0
t(i) ≈ y(i) = w⊤x(i) + b Find the “best” line (w, b).
21012 x: features
minimize N
L(y(i), t(i))
Intro ML (UofT)
CSC311-Lec2
y: response
Loss Function: Square Error Loss
How to quantify the quality of the fit to data?
A loss function L(y, t) defines how bad it is if, for some example x, the algorithm predicts y, but the target is actually t.
Squared error loss function:
L ( y , t ) = 12 ( y − t ) 2
y − t is the residual, and we want to make its magnitude small
As we will see later, square error loss is both mathematically
simple and computationally efficient, and has a simple statistical interpretation.
The 21 factor is just to make the calculations convenient.
Intro ML (UofT) CSC311-Lec2 11 / 55
Loss Function vs Cost Function
Squared error loss function:
L ( y , t ) = 12 ( y − t ) 2
Cost function: loss function averaged over all training examples J(w,b)= 1 N y(i)−t(i)2
= 1 N w⊤x(i) + b − t(i)2
The terminology is not universal. Some might call “loss” pointwise
loss and the “cost function” the empirical loss or average loss.
Intro ML (UofT) CSC311-Lec2 12 / 55
What to do now…
Now that we defined the loss function L, we can formalize what it means for y(i) = w⊤x(i) + b to be “close to” t(i), and find (w, b)
minimize L(y(i), t(i))
But before we do so, let’s vectorize our notation.
Vectorization helps us express ideas succinctly, and write faster code later on.
Intro ML (UofT) CSC311-Lec2 13 / 55
Vector Notation
We can organize all the training examples into a design matrix X with one row per training example, and all the targets into the target vector t.
Intro ML (UofT) CSC311-Lec2 14 / 55
Vector Notation
We can organize all the training examples into a design matrix X with one row per training example, and all the targets into the target vector t.
(Exercise) Computing the predictions for the whole dataset: wTx(1)+b y(1)
.. Xw+b1= . = . =y
wTx(N) +b y(N)
1 is a column vector of ones
Intro ML (UofT) CSC311-Lec2
Vectorization
(Exercise) Computing the squared error cost across the whole dataset:
y = Xw + b1 J= 1∥y−t∥2
Note that sometimes we may use J = 12 ∥y − t∥2 , without N1 . That
would correspond to the sum of losses, and not the average loss. The optimal (w,b) do not depend on N.
(Exercise) We can also add a column of 1s to the design matrix, combine the bias and the weights, and conveniently write
[x(1)]⊤ [x(2)]⊤
D+1 and w=w2∈R
Then, our predictions reduce to y = Xw. Intro ML (UofT) CSC311-Lec2
Solving the Minimization Problem
We defined a cost function. This is what we would like to minimize.
Recall from your calculus class: minimum of a smooth function (if it exists) occurs at a critical point, i.e., point where the derivative is zero.
Multivariate generalization: set all the partial derivatives to zero (or equivalently set the gradient vector to zero).
We would like to find a point where the gradient is (close to) zero. How can we do it?
Sometimes it is possible to directly find the parameters that make the gradient zero in a closed-form. We call this the direct solution.
We may also use optimization techniques that iteratively get us closer to the solution. We will get back to this soon.
Intro ML (UofT) CSC311-Lec2 16 / 55
Direct Solution
Partial derivatives: derivatives of a multivariate function with respect to one of its arguments.
∂ f(x1,x2)= lim f(x1 +h,x2)−f(x1,x2) ∂x1 h→0 h
To compute, take the single variable derivatives, pretending the other arguments are constant.
Example: partial derivatives of the prediction y
∂y∂ = wkxk +b
∂wj ∂wj k = xj
∂y∂ = wkxk +b
Intro ML (UofT)
CSC311-Lec2
Direct Solution
Chain rule for derivatives:
∂wj dy ∂wj
= (y−t) ·xj
= (y − t)xj
Cost derivatives (average over data points):
= (y(i) −t(i))x(i)
∂wjN j i=1
y(i) − t(i)
CSC311-Lec2 18 / 55
Direct Solution
The minimum must occur at a point where the partial derivatives
are zero, i.e.,
∂J=0 (∀j), ∂J=0. ∂wj ∂b
If ∂J/∂wj ̸= 0, you could reduce the cost by changing wj.
Intro ML (UofT) CSC311-Lec2 19 / 55
Direct Solution
The minimum must occur at a point where the partial derivatives
are zero, i.e.,
∂J=0 (∀j), ∂J=0. ∂wj ∂b
If ∂J/∂wj ̸= 0, you could reduce the cost by changing wj.
This turns out to give a system of linear equations, which we can
solve efficiently.
Optimal weights for least squares (LS):
wLS = (XT X)−1XT t
Intro ML (UofT) CSC311-Lec2 19 / 55
Direct Solution
The minimum must occur at a point where the partial derivatives
are zero, i.e.,
∂J=0 (∀j), ∂J=0. ∂wj ∂b
If ∂J/∂wj ̸= 0, you could reduce the cost by changing wj.
This turns out to give a system of linear equations, which we can
solve efficiently.
Optimal weights for least squares (LS):
wLS = (XT X)−1XT t
Linear regression is one of only a handful of models in this course
that permit direct solution.
Intro ML (UofT) CSC311-Lec2 19 / 55
Feature Mapping (Basis Expansion)
The relation between the input and output may not be linear.
We can still use linear regression by mapping the input feature to another space using feature mapping (or basis expansion)
ψ(x) : RD → Rd and treat the mapped feature (in Rd) as the input of a linear regression procedure.
Let us see how it works when x ∈ R and we use polynomial feature mapping.
Intro ML (UofT) CSC311-Lec2 20 / 55
Polynomial Feature Mapping
Fit the data using a degree-M polynomial function of the form:
M y=w0+w1x+w2x2+…+wMxM =wixi
Here the feature mapping is ψ(x) = [1, x, x2, …]⊤.
We can still use least squares to find w since y = ψ(x)⊤w is linear
in w0, w1, ….
In general, ψ can be any function. Another example: ψ =
[1, sin(2πx), cos(2πx), sin(4πx), cos(4πx), sin(6πx), cos(6πx), · · · ]⊤. Intro ML (UofT) CSC311-Lec2 21 / 55
Polynomial Feature Mapping with M = 0
-Pattern Recognition and Machine Learning, .
Intro ML (UofT)
CSC311-Lec2 22 / 55
Polynomial Feature Mapping with M = 1
y = w0 + w1x
-Pattern Recognition and Machine Learning, .
Intro ML (UofT)
CSC311-Lec2 23 / 55
Polynomial Feature Mapping with M = 3
y = w0 + w1x + w2x2 + w3x3 1
-Pattern Recognition and Machine Learning, .
Intro ML (UofT)
CSC311-Lec2 24 / 55
Polynomial Feature Mapping with M = 9
y = w0 + w1x + w2x2 + w3x3 + . . . + w9x9 1
Intro ML (UofT)
CSC311-Lec2 25 / 55
-Pattern Recognition and Machine Learning, .
Model Complexity and Generalization
Underfitting (M=0): model is too simple — does not fit the data. Overfitting (M=9): model is too complex — fits perfectly.
0x10x1 Good model (M=3): Achieves small test error (generalizes well).
Intro ML (UofT)
CSC311-Lec2
Model Complexity and Generalization
As M increases, the magnitude of coefficients gets larger.
For M = 9, the coefficients have become finely tuned to the data. Between data points, the function exhibits large oscillations.
Intro ML (UofT) CSC311-Lec2 27 / 55
Regularization for Controlling the Model Complexity
The degree of the polynomial M controls the complexity of the model.
The value of M is a hyperparameter for polynomial expansion, just like k in KNN. We can tune it using a validation set.
Restricting the number of parameters of a model (M here) is a crude approach to control the complexity of the model.
A better solution: keep the number of parameters of the model large, but enforce “simpler” solutions within the same space of parameters.
This is done through regularization or penalization.
Regularizer (or penalty): a function that quantifies how much we prefer one hypothesis vs. another
Intro ML (UofT) CSC311-Lec2 28 / 55
l2 (or L2) Regularization
We can encourage the weights to be small by choosing as our regularizer
the l2 (or L2) penalty.
Note: To be precise, we are regularizing the squared l2 norm.
R ( w ) = 12 ∥ w ∥ 2 2 = 1 w j2 . 2j
Intro ML (UofT) CSC311-Lec2 29 / 55
l2 (or L2) Regularization
We can encourage the weights to be small by choosing as our regularizer
the l2 (or L2) penalty.
Note: To be precise, we are regularizing the squared l2 norm.
The regularized cost function makes a tradeoff between fit to the data and the norm of the weights:
Jreg(w) = J (w) + λR(w) = J (w) + λ wj2. 2j
The basic idea is that “simpler” functions have smaller l2-norm of their weights w, and we prefer them to functions with larger l2-norms.
If you fit training data poorly, J is large. If your optimal weights have high values, R is large.
Large λ penalizes weight values more.
Here, λ is a hyperparameter that we can tune with a validation set.
Intro ML (UofT) CSC311-Lec2 29 / 55
R ( w ) = 12 ∥ w ∥ 2 2 = 1 w j2 . 2j
l2 Regularized Least Squares: Ridge Regression For the least squares problem, we have J (w) = 1 ∥Xw − t∥2.
When λ > 0 (with regularization), regularized cost gives
wRidge=argminJreg(w)=argmin 1 ∥Xw−t∥2+λ∥w∥2 λw w2N2
=(XT X + λNI)−1XT t
The case λ = 0 (no regularization) reduces to least squares solution! Q: What happens when λ → ∞?
Note that it is also common to formulate this problem as argminw ∥Xw − t∥2 + λ2 ∥w∥2 in which case the solution is
wRidge = (XT X + λI)−1XT t. λ
Intro ML (UofT) CSC311-Lec2
Probabilistic Interpretation of the Squared Error
For the least squares: we minimize the sum of the squares of the errors between the predictions for each data point x(i) and the corresponding target values t(i), i.e.,
minimize (w⊤x(i) + b − t(i))2
t ≈ x⊤w + b, (w, b) ∈ RD × R
We measure the quality of the fit using the squared error loss. Why?
Even though the squared error loss looks natural, we did not really justify it.
We provide a probabilistic perspective here.
There are other justifications too; we get to them in the next lecture.
Intro ML (UofT)
CSC311-Lec2 31 / 55
Probabilistic Interpretation of the Squared Error
Suppose that our model arose from a statistical model (b=0 for simplicity)
y(i) = w⊤x(i) + ε(i)
where ε(i) ∼ N (0, σ2) is independent
Thus, y(i)|x(i) ∼ p(y|x(i), w) =
N (w⊤x(i), σ2).
Under these assumptions, the Least-Squares solution is the Maximum-Likelihood solution
of anything else.
CSC311-Lec2
Probabilistic Interpretation of the Squared Error: Maximum Likelihood Estimation (Optional)
Suppose that the input data {x(1), x(2), . . . , x(N)} are given and the outputs are independently drawn from
y(i) ∼ p(y|x(i), w)
with an unknown parameter w. So the dataset is
D = {(x(1), y(1)), (x(2), y(2)), . . . , (x(N), y(N))}.
The likelihood function is Pr(D|w).
The maximum likelihood estimation (MLE) is the “principle” that suggests we have to find a parameter wˆ that maximizes the likelihood, i.e.,
wˆ ← argmax Pr(D|w). w
Maximum likelihood estimation: after observing the data samples (x(i),y(i)) for i = 1,2,…,N, we should choose w that maximizes the likelihood.
Intro ML (UofT) CSC311-Lec2 33 / 55
Probabilistic Interpretation of the Squared Error: Maximum Likelihood Estimation (Optional)
For independent samples, the likelihood function of samples D is the product of their likelihoods
p(y(1), y(2), …, y(n)|x(1), x(2), …, x(N), w) = p(y(i)|x(i), w) = L(w).
Product of N terms is not easy to minimize. Taking log reduces it to a sum! Two objectives are equivalent since log is strictly increasing.
Maximizing the likelihood is equivalent to minimizing the negative log-likelihood:
l(w) = − log L(w) = − log p(z(i)|w) = − log p(z(i)|w)
Maximum Likelihood Estimator (MLE)
After observing z(i) = (x(i), y(i)) for i = 1, …, n i.i.d. samples from p(z|w), MLE is
wMLE = argmin w
l(w) = − log p(z(i)|w).
CSC311-Lec2 34 / 55
Intro ML (UofT)
Probabilistic Interpretation of the Squared Error: From MLE to Squared Error (Optional)
Suppose that our model arose from a statistical model: y(i) = w⊤x(i) + ε(i)
where ε(i) ∼ N (0, σ2) is independent of anything else. p(y(i)|x(i), w) = √ 1 exp − 1 (y(i) − w⊤x(i))2
2πσ2 2σ2 √
log p(y(i)|x(i), w) = − 1 (y(i) − w⊤x(i))2 − log( 2πσ2)
2σ2 The MLE solution is
wMLE = argmin L(w) =
As C and σ do not depend on w, they do not contribute to the
minimization.
wMLE = wLS when we work with Gaussian densities. Intro ML (UofT) CSC311-Lec2
1 (y(i) − w⊤x(i))2 + C.
Gradient Descent
Intro ML (UofT) CSC311-Lec2 36 / 55
Gradient Descent
Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent.
Intro ML (UofT) CSC311-Lec2 37 / 55
Gradient Descent
Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply an update repeatedly until some criterion is met.
Intro ML (UofT) CSC311-Lec2 37 / 55
Gradient Descent
Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent.
Gradient descent is an iterative algorithm, which means we apply an update repeatedly until some criterion is met.
We initialize the weights to something reasonable (e.g. all zeros) and repeatedly adjust them in the direction of steepest descent.
Intro ML (UofT) CSC311-Lec2 37 / 55
Gradient Descent in 1D
Consider this function f(x) : R → R
Q: If we want to move the red point closer to the minima, do we move left or right?
Intro ML (UofT) CSC311-Lec2 38 / 55
Gradient Descent in 1D
Consider this function f(x) : R → R
Q: If we want to move the red point closer to the minima, do we move left or right?
Q: At the red point x, is the derivative f′(x) positive or negative? Intro ML (UofT) CSC311-Lec2 38 / 55
Gradient Descent in 1D
Consider this function f(x) : R → R
Q: If we want to move the red point closer to the minima, do we move left or right?
Q: At the red point x, is the derivative f′(x) positive or negative? We want to move x towards the negative direction of the gradient!
Intro ML (UofT) CSC311-Lec2 38 / 55
Gradient Descent in 1D
How much should we move?
Q: Should we make a larger jump at the red point or green?
Intro ML (UofT) CSC311-Lec2 39 / 55
Gradient Descent in 1D
How much should we move?
Q: Should we make a larger jump at the red point or green?
The larger |f′(x)|, the more we should move. We slow down close to a minima.
x ← x − αf′(x)
The term α is called the learning rate
Intro ML (UofT) CSC311-Lec2 39 / 55
Gradient Descent in 2D
The same idea holds in higher dimensions:
Aside: Do you know how to read a contour plot?
Intro ML (UofT) CSC311-Lec2 40 / 55
Contours: Mixture of Gausian
Intro ML (UofT) CSC311-Lec2 41 / 55
Contours: Mixture of Gausian
Intro ML (UofT) CSC311-Lec2 42 / 55
Contour Maps
Intro ML (UofT) CSC311-Lec2 43 / 55
Contour Maps
Intro ML (UofT) CSC311-Lec2 44 / 55
Gradient Descent for Linear Regression
if ∂J /∂wj > 0, then increasing wj increases J .
if ∂J /∂wj < 0, then increasing wj decreases J .
Intro ML (UofT) CSC311-Lec2 45 / 55
Gradient Descent for Linear Regression
if ∂J /∂wj > 0, then increasing wj increases J .
if ∂J /∂wj < 0, then increasing wj decreases J .
The following update decreases the cost function:
wj ← wj − α ∂J ∂wj
= wj − (y(i) − t(i)) x(i)
CSC311-Lec2 45 / 55
Gradient Descent for Linear Regres