Machine Learning Linear Regression
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
Lecture Overview
Learning Outcomes for Today’s Lecture
By the end of this lecture you should:
1 Understand the purpose of Linear Regression
2 Understand motivations for some approaches to linear regression, in particular Ordinary Least Squares (OLS)
3 Know how the normal equations and gradient descent can be used to solve the OLS approach to linear regression
The Purpose of Linear Regression
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
The Purpose of Linear Regression
Recall that in regression problems we seek to learn a mapping between input features and a continuous output label
We can then use this mapping to make output predictions given novel input data
In linear regression we seek mappings which are linear functions of the input features
The Purpose of Linear Regression
x=[1,×1,…,xm]T ∈Rm+1
Training Data
S = {(x(i), y(i))}ni=1
The Purpose of Linear Regression
Probabilistic Environment
We assume:
x is the outcome of a random variable X
y is the outcome of a random variable Y
(x, y ) are drawn i.i.d. from some data generating distribution, D, i.e.:
(x,y) ∼ D S ∼ Dn
The Purpose of Linear Regression
Representation
We seek to learn a linear mapping, fw, characterised by a weight
vector w ∈ Rm+1, and drawn from a function class F: F={fw(x)=w·x|w=[w0,w1,…,wm]T ∈Rm+1}
The mapping for a particular w is:
fw(x)=w0 +w1x1 +…+wmxm
Where w0 is the bias term
The Purpose of Linear Regression
Example: CAPM
rm is the Market Return over the Risk-Free Return
ri is the Individual Stock Return over the Risk-Free Return
Equilibrium model of relationship between the returns of individual stocks and the broad market return
The Purpose of Linear Regression
Example: CAPM
rm is the Market Return over the Risk-Free Return
ri is the Individual Stock Return over the Risk-Free Return
Equilibrium model of relationship between the returns of individual stocks and the broad market return
The Purpose of Linear Regression
Example: CAPM
For our purposes we can treat it as a 1-dimensional illustration of linear regression, in which:
x = [1,rm]T y = ri
And we seek w such that: fw(rm)=w1 ×rm
Empirically, the relationship is not strong, and often the model is
extended to encompass further input attributes:
For example the Fama-French Three Factor Model includes inputs related to Market Capitalisation and Book-to-Market Value
The Purpose of Linear Regression
Rephrasing The Problem
We can represent our input training data more compactly as the
design matrix, X:
(1)T (1) (1) x 1×1··xm
(2) (2) x 1x··x
1 m X=·=· · ·· ·
· · · · · ·
x(n)T 1 x(n) · · x(n) 1m
The Purpose of Linear Regression
Rephrasing The Problem
And our output training data as y:
y(2) y= · ·
And we seek w such that:
The Purpose of Linear Regression
Using our knowledge of matrix algebra can’t we just solve this system of equations as follows:
w = X−1y No! (at least not in general)
In general X is singular and therefore non-invertible
X is only invertible if it is square, n = (m + 1), and it is of full rank, rank(X) = m + 1
The Purpose of Linear Regression
Recall from our work on linear algebra that:
If rank(X) = rank(X|y) = (m + 1) n The equations are consistent
We have at least the same number of such equations as unknowns We have a unique solution
If rank(X) = rank(X|y) < m + 1 The equations are consistent...
...But n < (m + 1) (assuming the n equations are distinct), so our problem is underdetermined with no unique solutions
We have too few equations to specify the unknowns uniquely
If rank(X) < rank(X|y)
The equations are inconsistent and no solutions exists
We need a further condition to make the problem well-posed...
The Purpose of Linear Regression
Evaluation
In general we cannot find a w such that y(i) = fw(x(i)) for all i
We need an extra condition to guide us in choosing a particular w given S
Let’s try using the empirical squared error evaluation function1: 1 n
L(E,S,fw)= 2
(y(i) −fw(x(i)))2 (y(i) −w·x(i))2
1N.B. - here and throughout this lecture, we drop the usual n1 multiplicative constant for convenience. We can do this because the resulting optimisation problem will be equivalent.
The Purpose of Linear Regression
Ordinary Least Squares
We can use this evaluation function to select the optimal weight vector:
1 n argmin 2 (y(i) − w · x(i))2
In this case our optimal w is known as the Ordinary Least Squares
(OLS) estimator (wOLS)
And this approach to linear regression is known as OLS regression
Note that other evaluation functions lead to other approaches to linear regression
Motivation
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
Motivation
Motivation
Why should we pick the least squares estimator?
We can motivate this choice in a number of different ways:
Motivation
Motivation 1:
Additive Noise Model with i.i.d. Noise
Assume (y(i),x(i)) are related by an additive noise model: y(i) =w·x(i) +ε(i)
Where ε(i) are the outcomes of an i.i.d. random variable ε
If we make these assumptions then the Gauss Markov Theorem tells us that the OLS estimator, wOLS, is the Unbiased Estimator (BLUE)
Motivation
Motivation 2:
Additive Noise Model with i.i.d. Gaussian Noise Assume further that the noise is normally distributed:
ε∼N(0,σ2) =⇒ y|x∼N(w·x,σ2) Given this further assumption, then the Maximum Likelihood
Estimator, wMLE = wOLS:
pY(y(i)|x(i); w, σ)
1 (y(i)−w·x(i))2
{y(i)}ni=1|{x(i)}ni=1; w, σ = =
√2πσ2 exp −
Motivation
Motivation 2:
Additive Noise Model with i.i.d. Gaussian Noise
1 (y(i)−w·x(i))2
wMLE = argmax ln
n 1 (y(i)−w·x(i))2
=argmax ln√ 2exp− w i=1 2πσ
√ 2 n (y(i)−w·x(i))2 −n ln 2πσ − 2σ2
= argmax w
i=1 = argmin2 (y(i) − w · x(i))2
w i=1 = wOLS
Motivation
Additive Noise Model with i.i.d. Laplacian Noise Assume instead that the noise is Laplace distributed:
ε ∼ Laplace(μ,b) where: μ ∈ R, This has a characteristic pdf, pε:
1 |ε−μ| pε(ε;μ,b)=2bexp − b
ED[ε] = μ VarD[ε] = 2b2
Motivation
Additive Noise Model with i.i.d. Laplacian Noise
For μ = 0, the likelihood of our data is given by: n
pY(y(i)|x(i); w, σ)
n 1 |y(i)−w·x(i)|
= 2bexp− b i=1
{y(i)}ni=1|{x(i)}ni=1; w, σ =
Motivation
Additive Noise Model with i.i.d. Laplacian Noise
n 1 | y ( i ) − w · x ( i ) |
2b exp − b i=1
=argmax ln 2bexp − b
wMLE = argmax ln w
1 |y(i)−w·x(i)|
=argmax −nln2b− w
n |y(i)−w·x(i)|
= argmin w
i=1 |y(i) − w · x(i)|
This is the absolute deviation evaluation function
Estimator is more robust to outliers than that of the squared error
Motivation
Motivation 3
The squared error function itself has directly attractive properties, it is:
Intuitive Smooth Convex
Recall the PAC approach:
Does not attempt to characterise a data generating distribution (in
this case the additive noise distribution)
Instead characterises a well-behaved evaluation function, appropriate to the task at hand
Optimisation
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
Optimisation
Representation:
F = {w · x|w ∈ Rm+1} 1 n
Evaluation:
But how do we perform this optimisation?
Optimisation:
wOLS = argmin2 (y(i) − w · x(i))2
(y(i) − w · x(i))2
1 n w i=1
Optimisation / Analytic Optimisation
Analytic Optimisation
Let us re-write our evaluation function: 1 n
L(E,S,f)= 2
= 12∥y−Xw∥2
This function is continuous and differentiable in w
(y(i) −w·x(i))2
Optimisation / Analytic Optimisation
Analytic Optimisation
For functions of this type we can solve by finding the stationary
point, i.e. the w for which ∇wL = 0 Here:
∂L ∂L ∂LT ∇wL= ∂w0,∂w1,···,∂wm
Optimisation / Analytic Optimisation
Analytic Optimisation
...And by checking for convexity i.e. ∇2wL ≽ 0:
This ensures that the stationary point is globally optimal Here ∇2wL is the Hessian Matrix, H:
∂2L ∂2L ∂2L ∂w02 ∂w0∂w1 ··· ∂w0∂wm
∂2L ∂2L ··· ∂2L ∂w ∂w ∂w2 ∂w ∂w
H = . . ... .
222 ∂L ∂L ∂L
∂ w m ∂ w 0 ∂ w m ∂ w 1 · · · ∂ w m2
Optimisation / Analytic Optimisation
Analytic Optimisation
Compute the derivative of L(E, S, f ):
∇wL = 21∇w (y − Xw)T (y − Xw)
= 12∇w(yTy−yTXw−wTXTy+wTXTXw)
= 21∇w(wT XT Xw − 2yT Xw) = XT Xw − XT y
Optimisation / Analytic Optimisation
Analytic Optimisation And set it equal to zero:
XTXwOLS −XTy=0 XT XwOLS = XT y
This is a system of linear equations (again)...how should we proceed?
Optimisation / Analytic Optimisation
Analytic Optimisation
First note the following important fact:
rank(XT X) = rank(XT X|XT y)
So the system is consistent and we have one or infinitely many
solutions...
Let’s consider two cases:
Optimisation / Analytic Optimisation
Case 1: XT X is Invertible
If (XT X)−1 then we may solve the system to achieve the following,
unique solution:
wOLS = (XT X)−1XT y
These are the Normal Equations which give an analytic solution to the OLS approach to linear regression
Optimisation / Analytic Optimisation
Case 2: XT X is Not Invertible
If (XT X)−1 does not exist then rank(XT X) < (m + 1)
So there are too few equations to fully specify the number of unknowns and our problem is underdetermined
And since we know that our system is consistent (and so solutions exist) then this must mean that there is no unique solution, instead there are an infinite number
Optimisation / Analytic Optimisation
Case 2: Overfitting
When does this occur?
For example, consider fitting quadratic functions to just 2 data points:
y(1) 1 x(1) x(1)2 w0
y(2) = 1 x(2) x(2)2 w1 w2
Optimisation / Analytic Optimisation
Case 2: Overfitting
Optimisation / Analytic Optimisation
Case 2: Overfitting
Optimisation / Analytic Optimisation
Case 2: Overfitting
Optimisation / Analytic Optimisation
Case 2: Overfitting
Optimisation / Analytic Optimisation
Case 2: Overfitting
There are infinite solutions...
...We have encountered a case of overfitting
We will return to consider this problem and its remedy in later lectures...
Optimisation / Analytic Optimisation
Recall that in our optimisation we needed to check for convexity, let’s do that now and check the Hessian:
∂2L =(XTX)ij ∂wi ∂wj
H = ∇ 2w L = X T X
Optimisation / Analytic Optimisation
Now, for any a ∈ Rm+1
aTXTXa = ∥Xa∥2 0
Therefore ∇w2L ≽ 0
Therefore L is convex [Theorem A.1]
Therefore our solution is globally optimal [Theorem A.2] ...But not necessarily unique
Let’s consider two cases again:
Optimisation / Analytic Optimisation
Case 1: XT X is Invertible
Recall from our work on linear algebra that:
∃(XTX)−1 ⇐⇒ XTX≻0
But if XT X ≻ 0 then our objective is strictly convex
And if our objective is strictly convex then our solution must be unique [Theorem A.3]
Optimisation / Analytic Optimisation
Case 2: XT X is Not Invertible
Recall from our work on linear algebra that:
∃(XTX)−1 ⇐⇒ XTX≻0 So in this case XT X 0
Now, recall our work on the optimisation of quadratic functions in the Calculus lecture. In particular, if XT X 0 then:
Our objective is not strictly convex (although it is bounded since XT y ∈ range(XT X))
So we will have an infinite number of solutions
We will return to these considerations later in the module, but for now we note the symmetry between invertibility and strict convexity
Optimisation / Analytic Optimisation
Normal Equations: Drawbacks
The normal equations are elegant, but for large m they have some drawbacks:
Time Constraints:
Matrix inversion takes O(m3) operations - for large XT X this is expensive
Space Constraints:
For very high m, XT X will not fit into memory
Is there an alternative?
Optimisation / Numerical Optimisation
Numerical Optimisation
There are several different approaches to numerical optimisation. We will investigate one of the most common:
Gradient Descent, a first order, iterative, optimisation algorithm
Optimisation / Numerical Optimisation
Gradient Descent
Takes steps proportional to the negative of the gradient at each step point...
...Because the gradient of a function is in the direction of steepest descent at that point
Magnitude of descent in a general direction, u, is: ∇wL·u=∥∇wL∥2 ∥u∥2cosθ
This is maximal when θ = 0 and ∇wL and u lie in the same direction Makes monotonic progress (subject to step size)
Stops when the gradient is zero
Optimisation / Numerical Optimisation
Batch Gradient Descent
Algorithm 1 Batch Gradient Descent
1: 2: 3: 4: 5: 6: 7:
Initialise t ← 0
Initialise w(t=0) randomly repeat
w(t+1) ← w(t) − α∇wL|w=w(t)
t←t+1 until convergence
Here α > 0 is the learning rate which determines the step size
Optimisation / Numerical Optimisation
Batch Gradient Descent in 1 Dimension2
Update is large when the gradient is large
As the optimum is approached steps get smaller
2Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]
Optimisation / Numerical Optimisation
Batch Gradient Descent: Learning Rate Size3
For α too small, convergence is slow For α too large, we get divergence
3Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]
Optimisation / Numerical Optimisation
Batch Gradient Descent: Convexity
Gradient Descent is guaranteed to converge to a local optimum
should one exist (for suitable α)… …But not necessarily a global one
If the objective is non-convex then GD may not find the global optimum
But for OLS recall that ∇2wL ≽ 0, which implies convexity [Theorem A.1], (and recall that XT y ∈ range(XT X), which implies that the objective is bounded below)…
…So GD will find a global optimum… [Theorem A.2]
…Although, for OLS, recall that unless ∇2wL ≻ 0 then such an optimum will not be unique
Optimisation / Numerical Optimisation
Batch Gradient Descent: Scaling4
For unscaled attributes the gradient of the largest parameter will dominate the update leading to slow convergence
For scaled attributes all parameters are updated in similar proportions
4Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]
Optimisation / Numerical Optimisation
Batch Gradient Descent: Iteration Speed Algorithm 2 Batch Gradient Descent for OLS
1: 2: 3: 4: 5: 6: 7: 8: 9:
Initialise t ← 0
Initialise w(t=0) randomly repeat
for j = 0 to m do n (i)
(i) (i) )[x ]j
[w(t+1)]j ← [w(t)]j − α end for
t←t+1 until convergence
i=1(w(t) · x
For each update we need to scan through the entire training set For large n this can be costly
Optimisation / Numerical Optimisation
Stochastic Gradient Descent
Algorithm 3 Stochastic Gradient Descent for OLS
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14:
Initialise t ← 0
Initialise w(t=0) randomly repeat
Randomly shuffle dataset: Initialise i ← 1
forj =0tomdo
[w(t+1)]j ←[w(t)]j−α(w(t)·x(i)−y(i))[x(i)]j
i←i+1 untili =n+1 until convergence
Here an update is made immediately as each example is examined
Optimisation / Numerical Optimisation
Stochastic Gradient Descent
Each individual SGD update is less optimal than a BGD update… …And the algorithm will not converge monotonically…
…But overall convergence (for convex objectives) is generally faster, and scales better for large n
Optimisation / Numerical Optimisation
Analytic & Numerical Solutions: Symmetry
1 In general, for OLS:
rank(XT X) = rank(XT X|XT y), so the system is consistent and we have a solution of some sort
XT X ≽ 0, so the system is convex, and XT y ∈ range(XT X), so the system is bounded, thus gradient descent will converge to a global solution of some sort
2 If (XT X)−1 exists:
The normal equations will work, rank(XT X) = (m + 1), and we have a unique solution
XT X ≻ 0 so our objective is strictly convex and gradient descent will converge to a unique solution
3 If (XT X)−1 does not exist:
The normal equations will not work, rank(XT X) < (m + 1), and we have an infinite number of solutions
XT X 0, our objective is not strictly convex and gradient descent will converge to one of the infinite number of solutions
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
1 Linear Regression can be used to estimate a linear relationship that might exist between our input and our output data
2 Ordinary Least Squares is a well-motivated approach to linear regression which leads to a smooth, convex, optimisation problem
3 We can use the normal equations or various gradient descent procedures to solve the OLS optimisation problem and to learn the parameters of the linear hypothesis
In the next lecture we will consider extending our representation to capture richer relationships via: polynomial regression.
Appendix: Convex Optimisation
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation
Appendix: Convex Optimisation
Convex Sets5
Definition:
A set Ω is convex if, for any x, y ∈ Ω and θ ∈ [0, 1], then
θx + (1 − θ)y ∈ Ω.
In other words, if we take any two elements in Ω, and draw a line segment between these two elements, then every point on that line segment also belongs to Ω.
5Boyd & Vandenberghe, ‘Convex Optimisation’ [2004]
Appendix: Convex Optimisation
Convex Functions
Definition:
A function f : Rn → R is convex if its domain is a convex set and if,
for all x,y in its domain, and all λ ∈ [0,1], we have:
f (λx + (1 − λ)y) λf (x) + (1 − λ)f (y)
f(y) λf(x) + (1 − λ)f(y)
f (λx + (1 − λ)y ) f(x)
Non-Convex
f(y) f(λx + (1 − λ)y)
λf(x) + (1 − λ)f(y) f(x)
λx +(1−λ)y y
λx +(1−λ)y
Appendix: Convex Optimisation
1st & 2nd Order Characterisations of Convex, Differentiable Functions
Theorem A.1:
Suppose f is twice differentiable over an open domain. Then, the following are equivalent:
f is convex
f (y) f (x) + ∇x f (x) · (y − x) ∀ x, y ∈ dom(f ) ∇2xf(x) ≽ 0 ∀ x ∈ dom(f)
Appendix: Convex Optimisation
Global Optimality
Theorem A.2:
Consider an unconstrained optimisation problem: min f (x)
subject to: x ∈ Rn
Where f : Rn → R is convex and differentiable.
Then any point x that satisfies ∇xf(x) = 0 is a globally optimal solution
Appendix: Convex Optimisation
Strict Convexity
Definition:
A function f : Rn → R is strictly convex if its domain is a convex set and if, for all x,y,x ̸= y in its domain, and all λ ∈ (0,1), we have:
f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y) First Order Characterisation:
A function f is strictly convex on Ω ⊆ Rn, if and only if:
f (y) > f (x) + ∇xf (x) · (y − x) ∀x, y ∈ Ω, x