CS代考 Machine Learning Linear Regression

Machine Learning Linear Regression

Department of Computer Science University College London

Copyright By PowCoder代写 加微信 powcoder

Lecture Overview
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation

Lecture Overview
Learning Outcomes for Today’s Lecture
By the end of this lecture you should:
1 Understand the purpose of Linear Regression
2 Understand motivations for some approaches to linear regression, in particular Ordinary Least Squares (OLS)
3 Know how the normal equations and gradient descent can be used to solve the OLS approach to linear regression

The Purpose of Linear Regression
Lecture Overview
1 Lecture Overview
2 The Purpose of Linear Regression
3 Motivation
4 Optimisation
6 Appendix: Convex Optimisation

The Purpose of Linear Regression
Recall that in regression problems we seek to learn a mapping between input features and a continuous output label
We can then use this mapping to make output predictions given novel input data
In linear regression we seek mappings which are linear functions of the input features

The Purpose of Linear Regression
x=[1,×1,…,xm]T ∈Rm+1
Training Data
S = {(x(i), y(i))}ni=1

The Purpose of Linear Regression
Probabilistic Environment
We assume:
x is the outcome of a random variable X
y is the outcome of a random variable Y
(x, y ) are drawn i.i.d. from some data generating distribution, D, i.e.:
(x,y) ∼ D S ∼ Dn

The Purpose of Linear Regression
Representation
We seek to learn a linear mapping, fw, characterised by a weight
vector w ∈ Rm+1, and drawn from a function class F: F={fw(x)=w·x|w=[w0,w1,…,wm]T ∈Rm+1}
The mapping for a particular w is:
fw(x)=w0 +w1x1 +…+wmxm
Where w0 is the bias term

The Purpose of Linear Regression
Example: CAPM
rm is the Market Return over the Risk-Free Return
ri is the Individual Stock Return over the Risk-Free Return
Equilibrium model of relationship between the returns of individual stocks and the broad market return

The Purpose of Linear Regression
Example: CAPM
rm is the Market Return over the Risk-Free Return
ri is the Individual Stock Return over the Risk-Free Return
Equilibrium model of relationship between the returns of individual stocks and the broad market return

The Purpose of Linear Regression
Example: CAPM
For our purposes we can treat it as a 1-dimensional illustration of linear regression, in which:
x = [1,rm]T y = ri
And we seek w such that: fw(rm)=w1 ×rm
Empirically, the relationship is not strong, and often the model is
extended to encompass further input attributes:
For example the Fama-French Three Factor Model includes inputs related to Market Capitalisation and Book-to-Market Value

The Purpose of Linear Regression
Rephrasing The Problem
We can represent our input training data more compactly as the
design matrix, X:
 (1)T  (1) (1) x 1×1··xm
 (2) (2) x 1x··x
1 m X=·=· · ·· ·
 ·   · · · · · 
x(n)T 1 x(n) · · x(n) 1m

The Purpose of Linear Regression
Rephrasing The Problem
And our output training data as y:
y(2) y= ·   · 
And we seek w such that:

The Purpose of Linear Regression
Using our knowledge of matrix algebra can’t we just solve this system of equations as follows:
w = X−1y No! (at least not in general)
In general X is singular and therefore non-invertible
X is only invertible if it is square, n = (m + 1), and it is of full rank, rank(X) = m + 1

The Purpose of Linear Regression
Recall from our work on linear algebra that:
If rank(X) = rank(X|y) = (m + 1) 􏰦 n The equations are consistent
We have at least the same number of such equations as unknowns We have a unique solution
If rank(X) = rank(X|y) < m + 1 The equations are consistent... ...But n < (m + 1) (assuming the n equations are distinct), so our problem is underdetermined with no unique solutions We have too few equations to specify the unknowns uniquely If rank(X) < rank(X|y) The equations are inconsistent and no solutions exists We need a further condition to make the problem well-posed... The Purpose of Linear Regression Evaluation In general we cannot find a w such that y(i) = fw(x(i)) for all i We need an extra condition to guide us in choosing a particular w given S Let’s try using the empirical squared error evaluation function1: 1 􏰑n L(E,S,fw)= 2 (y(i) −fw(x(i)))2 (y(i) −w·x(i))2 1N.B. - here and throughout this lecture, we drop the usual n1 multiplicative constant for convenience. We can do this because the resulting optimisation problem will be equivalent. The Purpose of Linear Regression Ordinary Least Squares We can use this evaluation function to select the optimal weight vector: 􏰘 1 􏰑n 􏰙 argmin 2 (y(i) − w · x(i))2 In this case our optimal w is known as the Ordinary Least Squares (OLS) estimator (wOLS) And this approach to linear regression is known as OLS regression Note that other evaluation functions lead to other approaches to linear regression Motivation Lecture Overview 1 Lecture Overview 2 The Purpose of Linear Regression 3 Motivation 4 Optimisation 6 Appendix: Convex Optimisation Motivation Motivation Why should we pick the least squares estimator? We can motivate this choice in a number of different ways: Motivation Motivation 1: Additive Noise Model with i.i.d. Noise Assume (y(i),x(i)) are related by an additive noise model: y(i) =w·x(i) +ε(i) Where ε(i) are the outcomes of an i.i.d. random variable ε If we make these assumptions then the Gauss Markov Theorem tells us that the OLS estimator, wOLS, is the Unbiased Estimator (BLUE) Motivation Motivation 2: Additive Noise Model with i.i.d. Gaussian Noise Assume further that the noise is normally distributed: ε∼N(0,σ2) =⇒ y|x∼N(w·x,σ2) Given this further assumption, then the Maximum Likelihood Estimator, wMLE = wOLS: pY(y(i)|x(i); w, σ) 1 􏰂 (y(i)−w·x(i))2􏰃 {y(i)}ni=1|{x(i)}ni=1; w, σ = = √2πσ2 exp − Motivation Motivation 2: Additive Noise Model with i.i.d. Gaussian Noise 1 􏰂 (y(i)−w·x(i))2􏰃􏰙 wMLE = argmax ln 􏰑n 􏰂 1 􏰂 (y(i)−w·x(i))2􏰃􏰃 =argmax ln√ 2exp− w i=1 2πσ 􏰘 √ 2 􏰑n 􏰂(y(i)−w·x(i))2􏰃􏰙 −n ln 2πσ − 2σ2 = argmax w i=1 = argmin2 (y(i) − w · x(i))2 w i=1 = wOLS Motivation Additive Noise Model with i.i.d. Laplacian Noise Assume instead that the noise is Laplace distributed: ε ∼ Laplace(μ,b) where: μ ∈ R, This has a characteristic pdf, pε: 1 􏰂 |ε−μ|􏰃 pε(ε;μ,b)=2bexp − b ED[ε] = μ VarD[ε] = 2b2 Motivation Additive Noise Model with i.i.d. Laplacian Noise For μ = 0, the likelihood of our data is given by: 􏰚 􏰛􏰒n pY(y(i)|x(i); w, σ) 􏰒n 1 􏰂 |y(i)−w·x(i)|􏰃 = 2bexp− b i=1 {y(i)}ni=1|{x(i)}ni=1; w, σ = Motivation Additive Noise Model with i.i.d. Laplacian Noise 􏰘 􏰒n 1 􏰂 | y ( i ) − w · x ( i ) | 􏰃 􏰙 2b exp − b i=1 =argmax ln 2bexp − b wMLE = argmax ln w 􏰂1 􏰂 |y(i)−w·x(i)|􏰃􏰃 =argmax −nln2b− w 􏰑n |y(i)−w·x(i)|􏰙 = argmin w i=1 |y(i) − w · x(i)| This is the absolute deviation evaluation function Estimator is more robust to outliers than that of the squared error Motivation Motivation 3 The squared error function itself has directly attractive properties, it is: Intuitive Smooth Convex Recall the PAC approach: Does not attempt to characterise a data generating distribution (in this case the additive noise distribution) Instead characterises a well-behaved evaluation function, appropriate to the task at hand Optimisation Lecture Overview 1 Lecture Overview 2 The Purpose of Linear Regression 3 Motivation 4 Optimisation 6 Appendix: Convex Optimisation Optimisation Representation: F = {w · x|w ∈ Rm+1} 1 􏰑n Evaluation: But how do we perform this optimisation? Optimisation: wOLS = argmin2 (y(i) − w · x(i))2 (y(i) − w · x(i))2 1 􏰑n w i=1 Optimisation / Analytic Optimisation Analytic Optimisation Let us re-write our evaluation function: 1 􏰑n L(E,S,f)= 2 = 12∥y−Xw∥2 This function is continuous and differentiable in w (y(i) −w·x(i))2 Optimisation / Analytic Optimisation Analytic Optimisation For functions of this type we can solve by finding the stationary point, i.e. the w for which ∇wL = 0 Here: 􏰈∂L ∂L ∂L􏰉T ∇wL= ∂w0,∂w1,···,∂wm Optimisation / Analytic Optimisation Analytic Optimisation ...And by checking for convexity i.e. ∇2wL ≽ 0: This ensures that the stationary point is globally optimal Here ∇2wL is the Hessian Matrix, H:  ∂2L ∂2L ∂2L   ∂w02 ∂w0∂w1 ··· ∂w0∂wm  ∂2L ∂2L ··· ∂2L  ∂w ∂w ∂w2 ∂w ∂w  H =  . . ... .    222 ∂L ∂L ∂L ∂ w m ∂ w 0 ∂ w m ∂ w 1 · · · ∂ w m2 Optimisation / Analytic Optimisation Analytic Optimisation Compute the derivative of L(E, S, f ): ∇wL = 21∇w 􏰄(y − Xw)T (y − Xw)􏰅 = 12∇w(yTy−yTXw−wTXTy+wTXTXw) = 21∇w(wT XT Xw − 2yT Xw) = XT Xw − XT y Optimisation / Analytic Optimisation Analytic Optimisation And set it equal to zero: XTXwOLS −XTy=0 XT XwOLS = XT y This is a system of linear equations (again)...how should we proceed? Optimisation / Analytic Optimisation Analytic Optimisation First note the following important fact: rank(XT X) = rank(XT X|XT y) So the system is consistent and we have one or infinitely many solutions... Let’s consider two cases: Optimisation / Analytic Optimisation Case 1: XT X is Invertible If (XT X)−1 then we may solve the system to achieve the following, unique solution: wOLS = (XT X)−1XT y These are the Normal Equations which give an analytic solution to the OLS approach to linear regression Optimisation / Analytic Optimisation Case 2: XT X is Not Invertible If (XT X)−1 does not exist then rank(XT X) < (m + 1) So there are too few equations to fully specify the number of unknowns and our problem is underdetermined And since we know that our system is consistent (and so solutions exist) then this must mean that there is no unique solution, instead there are an infinite number Optimisation / Analytic Optimisation Case 2: Overfitting When does this occur? For example, consider fitting quadratic functions to just 2 data points: 􏰈y(1)􏰉 􏰈1 x(1) x(1)2􏰉 w0 y(2) = 1 x(2) x(2)2 w1 w2 Optimisation / Analytic Optimisation Case 2: Overfitting Optimisation / Analytic Optimisation Case 2: Overfitting Optimisation / Analytic Optimisation Case 2: Overfitting Optimisation / Analytic Optimisation Case 2: Overfitting Optimisation / Analytic Optimisation Case 2: Overfitting There are infinite solutions... ...We have encountered a case of overfitting We will return to consider this problem and its remedy in later lectures... Optimisation / Analytic Optimisation Recall that in our optimisation we needed to check for convexity, let’s do that now and check the Hessian: ∂2L =(XTX)ij ∂wi ∂wj H = ∇ 2w L = X T X Optimisation / Analytic Optimisation Now, for any a ∈ Rm+1 aTXTXa = ∥Xa∥2 􏰧 0 Therefore ∇w2L ≽ 0 Therefore L is convex [Theorem A.1] Therefore our solution is globally optimal [Theorem A.2] ...But not necessarily unique Let’s consider two cases again: Optimisation / Analytic Optimisation Case 1: XT X is Invertible Recall from our work on linear algebra that: ∃(XTX)−1 ⇐⇒ XTX≻0 But if XT X ≻ 0 then our objective is strictly convex And if our objective is strictly convex then our solution must be unique [Theorem A.3] Optimisation / Analytic Optimisation Case 2: XT X is Not Invertible Recall from our work on linear algebra that: ∃(XTX)−1 ⇐⇒ XTX≻0 So in this case XT X 􏰤 0 Now, recall our work on the optimisation of quadratic functions in the Calculus lecture. In particular, if XT X 􏰤 0 then: Our objective is not strictly convex (although it is bounded since XT y ∈ range(XT X)) So we will have an infinite number of solutions We will return to these considerations later in the module, but for now we note the symmetry between invertibility and strict convexity Optimisation / Analytic Optimisation Normal Equations: Drawbacks The normal equations are elegant, but for large m they have some drawbacks: Time Constraints: Matrix inversion takes O(m3) operations - for large XT X this is expensive Space Constraints: For very high m, XT X will not fit into memory Is there an alternative? Optimisation / Numerical Optimisation Numerical Optimisation There are several different approaches to numerical optimisation. We will investigate one of the most common: Gradient Descent, a first order, iterative, optimisation algorithm Optimisation / Numerical Optimisation Gradient Descent Takes steps proportional to the negative of the gradient at each step point... ...Because the gradient of a function is in the direction of steepest descent at that point Magnitude of descent in a general direction, u􏰥, is: ∇wL·u􏰥=∥∇wL∥2 ∥u􏰥∥2cosθ This is maximal when θ = 0 and ∇wL and u􏰥 lie in the same direction Makes monotonic progress (subject to step size) Stops when the gradient is zero Optimisation / Numerical Optimisation Batch Gradient Descent Algorithm 1 Batch Gradient Descent 1: 2: 3: 4: 5: 6: 7: Initialise t ← 0 Initialise w(t=0) randomly repeat w(t+1) ← w(t) − α∇wL|w=w(t) t←t+1 until convergence Here α > 0 is the learning rate which determines the step size

Optimisation / Numerical Optimisation
Batch Gradient Descent in 1 Dimension2
Update is large when the gradient is large
As the optimum is approached steps get smaller
2Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]

Optimisation / Numerical Optimisation
Batch Gradient Descent: Learning Rate Size3
For α too small, convergence is slow For α too large, we get divergence
3Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]

Optimisation / Numerical Optimisation
Batch Gradient Descent: Convexity
Gradient Descent is guaranteed to converge to a local optimum
should one exist (for suitable α)… …But not necessarily a global one
If the objective is non-convex then GD may not find the global optimum
But for OLS recall that ∇2wL ≽ 0, which implies convexity [Theorem A.1], (and recall that XT y ∈ range(XT X), which implies that the objective is bounded below)…
…So GD will find a global optimum… [Theorem A.2]
…Although, for OLS, recall that unless ∇2wL ≻ 0 then such an optimum will not be unique

Optimisation / Numerical Optimisation
Batch Gradient Descent: Scaling4
For unscaled attributes the gradient of the largest parameter will dominate the update leading to slow convergence
For scaled attributes all parameters are updated in similar proportions
4Geron, ‘Hands-on Machine Learning With Scikit-Learn & Tensorflow’ [2017]

Optimisation / Numerical Optimisation
Batch Gradient Descent: Iteration Speed Algorithm 2 Batch Gradient Descent for OLS
1: 2: 3: 4: 5: 6: 7: 8: 9:
Initialise t ← 0
Initialise w(t=0) randomly repeat
for j = 0 to m do 􏰐n (i)
(i) (i) )[x ]j
[w(t+1)]j ← [w(t)]j − α end for
t←t+1 until convergence
i=1(w(t) · x
For each update we need to scan through the entire training set For large n this can be costly

Optimisation / Numerical Optimisation
Stochastic Gradient Descent
Algorithm 3 Stochastic Gradient Descent for OLS
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14:
Initialise t ← 0
Initialise w(t=0) randomly repeat
Randomly shuffle dataset: Initialise i ← 1
forj =0tomdo
[w(t+1)]j ←[w(t)]j−α(w(t)·x(i)−y(i))[x(i)]j
i←i+1 untili =n+1 until convergence
Here an update is made immediately as each example is examined

Optimisation / Numerical Optimisation
Stochastic Gradient Descent
Each individual SGD update is less optimal than a BGD update… …And the algorithm will not converge monotonically…
…But overall convergence (for convex objectives) is generally faster, and scales better for large n

Optimisation / Numerical Optimisation
Analytic & Numerical Solutions: Symmetry
1 In general, for OLS:
rank(XT X) = rank(XT X|XT y), so the system is consistent and we have a solution of some sort
XT X ≽ 0, so the system is convex, and XT y ∈ range(XT X), so the system is bounded, thus gradient descent will converge to a global solution of some sort
2 If (XT X)−1 exists:
The normal equations will work, rank(XT X) = (m + 1), and we have a unique solution
XT X ≻ 0 so our objective is strictly convex and gradient descent will converge to a unique solution
3 If (XT X)−1 does not exist:
The normal equations will not work, rank(XT X) < (m + 1), and we have an infinite number of solutions XT X 􏰤 0, our objective is not strictly convex and gradient descent will converge to one of the infinite number of solutions Lecture Overview 1 Lecture Overview 2 The Purpose of Linear Regression 3 Motivation 4 Optimisation 6 Appendix: Convex Optimisation 1 Linear Regression can be used to estimate a linear relationship that might exist between our input and our output data 2 Ordinary Least Squares is a well-motivated approach to linear regression which leads to a smooth, convex, optimisation problem 3 We can use the normal equations or various gradient descent procedures to solve the OLS optimisation problem and to learn the parameters of the linear hypothesis In the next lecture we will consider extending our representation to capture richer relationships via: polynomial regression. Appendix: Convex Optimisation Lecture Overview 1 Lecture Overview 2 The Purpose of Linear Regression 3 Motivation 4 Optimisation 6 Appendix: Convex Optimisation Appendix: Convex Optimisation Convex Sets5 Definition: A set Ω is convex if, for any x, y ∈ Ω and θ ∈ [0, 1], then θx + (1 − θ)y ∈ Ω. In other words, if we take any two elements in Ω, and draw a line segment between these two elements, then every point on that line segment also belongs to Ω. 5Boyd & Vandenberghe, ‘Convex Optimisation’ [2004] Appendix: Convex Optimisation Convex Functions Definition: A function f : Rn → R is convex if its domain is a convex set and if, for all x,y in its domain, and all λ ∈ [0,1], we have: f (λx + (1 − λ)y) 􏰦 λf (x) + (1 − λ)f (y) f(y) λf(x) + (1 − λ)f(y) f (λx + (1 − λ)y ) f(x) Non-Convex f(y) f(λx + (1 − λ)y) λf(x) + (1 − λ)f(y) f(x) λx +(1−λ)y y λx +(1−λ)y Appendix: Convex Optimisation 1st & 2nd Order Characterisations of Convex, Differentiable Functions Theorem A.1: Suppose f is twice differentiable over an open domain. Then, the following are equivalent: f is convex f (y) 􏰧 f (x) + ∇x f (x) · (y − x) ∀ x, y ∈ dom(f ) ∇2xf(x) ≽ 0 ∀ x ∈ dom(f) Appendix: Convex Optimisation Global Optimality Theorem A.2: Consider an unconstrained optimisation problem: min f (x) subject to: x ∈ Rn Where f : Rn → R is convex and differentiable. Then any point x that satisfies ∇xf(x) = 0 is a globally optimal solution Appendix: Convex Optimisation Strict Convexity Definition: A function f : Rn → R is strictly convex if its domain is a convex set and if, for all x,y,x ̸= y in its domain, and all λ ∈ (0,1), we have: f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y) First Order Characterisation: A function f is strictly convex on Ω ⊆ Rn, if and only if: f (y) > f (x) + ∇xf (x) · (y − x) ∀x, y ∈ Ω, x