Regression – multiple features
Second lecture on regression
Srinandan Dasmahapatra
Linear regression with multiple weights Arbitrary (linear/non-linear) but FIXED functions
• • • • • • • •
Recap simple fit of straight line through points, introduce intercept Flexibility of functions chosen to represent data
Linear vs non-linear
Fits with linear combinations of functions of inputs
Use of matrix to represent hypothesised (input-output) relation
Gradient descent to reduce loss: average of square(prediction – training output) Calculus to compute gradient vector
Express in numpy
Fitting a straight line through points
Subtracting the average = data entering
(xn, yn)
y ̃ =y −⟨y⟩{
x ̃ n = x n − ⟨ x ⟩ yn
•Subtractfromeach(x,y)theaverage:
(⟨x⟩,⟨y⟩) = 1 ∑(xn,yn) Nn
n n
•The origin is shifted to the location of the mean •Centred data: line goes through new origin
yn =w0+w1xn ⇔⟨y⟩=w0+w1⟨x⟩ •Subtract means:
yn − ⟨y⟩ = w1(xn − ⟨x⟩) ⇔ y ̃n = w1x ̃n
xn
(0,0)
(⟨x⟩, ⟨y⟩) average of x, y
Flexibility of polynomials
y=w0 +w1 x+w2 x2 +…+wM xM
•
• •
Changing each weight wi alters the shape of the function
Each power fj(x) := xj
y = w0 + w1 f1(x) + w2 f2(x) + ⋯ + wM fM(x)
wi is called “feature-touching” i ≥ 1
Linear regression with non-linear functions – 1
What is linearity?
f(x)=wx⇒{(1.)f(x1+x2)=w(x1+x2)=wx1+wx2 =f(x1)+f(x2) (linear) (2.) f(ax) = af(x)
•
•
Complex relationships between inputs and outputs not captured by linear functions
{g(x +x)=w(x2+2xx +x2)≠wx2+wx2=g(x)+g(x); 1211221212
g(x) = wx2 ⇒
•
(non-linear in x)
g(ax) = a2g(x) ≠ ag(x)
•
But both f, g are linear in w
Linear regression with non-linear functions – 2
ŷ =w +wφ(x)+wφ(x)+⋯+wφ(x), wherex ∈Rd,ŷ ∈R, andφ :Rd →R n011n22nppnnni
•
•
•
•
• •
Instead of fj(x) := xj, choose arbitrary functions φj(x)
ŷ = w ⋅ 1 + w φ (x ) + w φ (x ) + ⋯ + w φ (x ), first data point x ∈ Rd
10111221pp11
ŷ = w ⋅ 1 + w φ (x ) + w φ (x ) + ⋯ + w φ (x ), second data point x ∈ Rd
20112222pp2 2 ŷ = w ⋅ 1 + w φ (x ) + w φ (x ) + ⋯ + w φ (x ), last (N-th) data point
N011N22NppN
Create column vectors ŷ := (ŷ ,ŷ ,⋯,ŷ )⊤ ∈ RN, w := (w ,w ,w ,⋯,w )⊤ ∈ Rp+1
12N 012p Write out (N × (p + 1)) matrix A such that Aw = ŷ.
Matrix A is called a design matrix
Xp
w+ w (x)=yˆ,D:={x,y}
0 j j n n n n n=1,…,N j=1
Express the collection of proposed functions for each input-output pair as matrix form:
0 1 ( x ) · · · ( x ) 1 0 w 1 0 yˆ 1 B11 p1CB0CB1C
B 1 ( x ) · · · ( x ) C B w C B yˆ C B. 1.2 p.2 CB.1C=B.2C
@. . … . A@.A@.A 1 (x)··· (x) w yˆ
1NpNpN
| {z }|{z} |{z}
A w yˆ
Matrix A is called a design matrix
Minimisemeansquaredresiduals 1∥y−ŷ∥2tofindweightsw N ŷ
1XN 1XN0Xp n12 L(w0,w1,…,wp):= N rn2(w)= N @w0 + wj j(xn) ynA
n=1 n=1 j=1
• Exercise: show that the loss function is quadratic in each of the weights w0, w1, …, wp
• Exercise: deduce that the gradient vector ∇wof partial derivatives: ∂ L(w0, w1, …, wp) is linear in the weights wk, k = 0,1,…, p.
N
• Exercise: go through the derivation (next slide): [∇wL(w)]k = (2/N)∑rn(w)φk(xn)
n=1
∂wk
Takingpartialderivatives: [∇wL(w)]k := ∂L(w) = (2/N)∑N rn(w)φk(xn) ∂wk n=1
Recall: design matrix A maps weights to predictions
Xp
w+ w (x)=yˆ,D:={x,y}
0 j j n n n n n=1,…,N j=1
Express the collection of proposed functions for each input-output pair as matrix form. Each column of design matrix: feature transform on inputs; each row is a data-point
0 1 ( x ) · · · ( x ) 1 0 w 1 0 yˆ 1 1
0
B 0(x1) 1 1 p 1 0
B (x )
B@
B CBCBC
B 1 ( x ) · · · ( x ) C B w C B yˆ C
02= B12 p2CB1CB2C
. . .. . . . @.A@A@A
.
.
(x )
….. 1 (x)··· (x) w yˆ
0N
| {z }|{z} |{z}
1NpNpN A w yˆ
Gradientintermsofdesignmatrix:[∇wL(w)]k := ∂L(w) = (2/N)∑N rn(w)φk(xn) ∂wk n=1
0 @ L1> 0r 1>0 (x) (x) ··· (x)1 B@w0C 1 0111 p1
B @ L C B r C B (x ) (x ) ··· (x ) C B @w1 C = 2 B 2 C B 0 2 1 2 p 2 C
B@ . CA N@.A@ . . … . A @ L rN 0(xN) 1(xN) ··· p(xN)
| {z> } (rw L)
single weights: y = w*x
r>
A
@wp | {z }|
{z }
residuals
Gradientintermsofdesignmatrix:[∇wL(w)]k := ∂L(w) = (2/N)∑N rn(w)φk(xn) ∂wk n=1
0 @w1L 1> 0 r1 1> 0 0(x1) 1(x1) ··· p(x1) 1 B @w2L C = 2 B r2 C B 0(x2) 1(x2) ··· p(x2) C
@ . A N@ . A @ . . … . A
@wpL rN 0(xN) 1(xN) ··· p(xN)
| {z } | {z }| (rw L)> r>
(a⊤b)⊤ = (b⊤a)
{z
A
}
multiple weights: y = A*w
Choosing features φj(xn) A few choices
f(x) φ(x; x3)
• •
Monomials f (x ) := xj (seen before) jnn x−xn
Radial basis functions φ(x; xn) = g( a )
choose g(x) = exp(−x2), a = 1, (x1, x2, x3, x4) = (−1,0,1,2)
f(x)=∑4 wnφ(x,xn),(w1,w2,w3,w4)=(2,−4,7,−5) n=1
•
Local — influence of xn restricted, unlike monomials; kernel for similarity/“blur” Orthogonal polynomials such as Chebyshev, Bessel, etc.
• • •
Readings for regression
First Course in Machine Learning (FCML) — Rogers, Girolami. Chapter 1. Page 299-300 of Bishop, Pattern Recognition and Machine Learning (PRML)
Geron, Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow, chapter 4 (with code on GitHub) — 20 pages.
Revisiting gradient descent for linear regression
• •
When the gradient vanishes
Later: Revisit problem from perspective of linear algebra
But first, a first look at classification next, with logistic/softmax regression