Introduction to Regression
Srinandan (“Sri”) Dasmahapatra
COMP3223
Supervised Learning: labelled data
Compare labels with predictions
•
d(ŷ , y ) : How far is prediction ŷ nnn
from actual data yn?
𝒟 := {(xn, yn)}, n = 1,…, N
Given data D construct model f( ⋅ ; w) such that the “distance” between model
output and real “output” is small • Learning = model construction by
xn
minimising loss L(w) = ∑N d(ŷ , y ) • n=1 n n
ŷ n
Model
f(⋅;w)
y = f( ⋅ ; w) continuous (e.g., y=23.4), f( ⋅ ; w) is a regression model
ŷ =f(x;w) nn
Core idea in ML:Reduce mismatch between model prediction and data
ŷ =f(x) nn
residuals
rn := (y x
Minimise this by tuning w
Squared residual loss
•
in training data
rn
y =ŷ +(y −ŷ)=f(x;w)+r(w) nnnnnn
Contribution to loss function:
l(w)=(y −f(x;w))2=r2(w). nnn1Nn
• Average loss: L(w) = N ∑ ln(w) n=1
y
• •
− ŷ ) nn
Regression models minimise residuals —
deviations of model predictions from outputs
Choose weight for minimum loss
Example: find slope of straight line
• •
•
yn
Fit y=w x to data, slope w of the line is the weight/parameter to be learnt
Loss L = sum of squares of residuals (in red), for 3 possible choices of slopes —
{w1, w2, w3}: the 3 residuals for input xn is shown
xn
Choose the slope that gives the smallest value from {L(w1), L(w2), L(w3)}
w
l (w) = r2 = (wx − y )2 nnnn
L(w) = 1 ∑N ln(w) N n=1
Loss function needs a distance: introducing the norm
Treat all data points as collective unit
Other distances (norms) are introduced, such as:
Lp(r) ≜ (∑N |rn|p)1/p n=1
0Br11C N r=Br2 C,krk2=Xr2
@ . A n n=1
rN
Loss = (1/N) (length)2 of N-dimensional residual vector
Update weights to reduce loss: gradient descent
Differentiable loss function: l (w) = r2 = (wx − y )2 nnnn
•Slope can take any real value w ∈ R
•Iterate w(t) ↦w(t+1) , t = 0, 1, …,
•Update weights w(t) to w(t+1) so that L(w(t+1)) < L(w(t))
•Change weights in the direction opposite to the slope of the loss function (we want to reduce the loss, hence descent)
L(w) 20
10
-4 -2
w
2 4 (dL)
dw
θ = arctan linear approximation of loss function
-10 -20
w(t+1) =w(t)+η dL(w(t)) ,η<0 ( dw )
• •
• •
L(w)=(1/N)[(wx −y)2+(wx −y)2+⋯+(wx −y )2] 1122NN
Loss function is quadratic in w
Loss function, L(w)
Linear Regression: solving for zero gradient of loss
222222 ( w x − 2 w( wx yx +− y2 w) x y + y )
11 NN 1N1N
Closed form solution exists: linear algebra
Gradients
w(1) w(2) w(3)
w*
L(w) = aw2 + bw + c
Follow gradients until minimum reached Solution for weights: set gradient = 0
0= ∂L(w)
∂w w=w*
⟹ w*=−b/(2a)
Exercise: differential calculus
Closed form solution to linear regression weights in terms of vector products
•
• a = (1/N)[x2 + x2 + ⋯ + x2] x
L(w) = (1/N)[(wx1 − y1)2 + (wx2 − y2)2 + ⋯ + (wxN − yN)2] Exercise: In L(w) = aw2 + bw + c, show
y
x⊤y
• •
12N
b = (−2/N)[x1y1 + x2y2 + ⋯ + xNyN]
0= ∂L(w) ⟹ w*=−b/(2a)= x⊤y ∂w w=w* x⊤x
90o
Reduce loss by gradient descent: optimisation
Same idea in higher dimensions (more adjustable weights)
lossn = (yn − (w1xn,1 + w2xn,2))2, x = (x1, x2) ∈ R2
Evaluate partial derivatives (gradient) to choose direction of weight updates
w2
w1
(∇wL)i =
∂L ∂wi
w(t+1) = w(t) − η∇wL
L(w) = 𝔼(xn,yn)∼𝒟lossn(w), mean loss
Automatic differentiation
No need to differentiate by hand, except to understand (lab exercises)
• Autograd, JAX — AD libraries in python
f:x↦f(x); dy = lim f(x+δx)−f(x) = lim f(x+δx)−f(x−δx) dx δx→0 δx δx→0 2⋅δx
•
• Product: y(x) = f(x) ⋅ g(x); y′(x) = f′(x) ⋅ g(x) + f(x) ⋅ g′(x)
• Quotient: y(x) = f(x)/g(x); y′(x) = [ f′(x) ⋅ g(x) − f(x) ⋅ g′(x)]/g(x)2
Taylor series: f(x0 + δx) = f(x0) + δx df + 1 (δx)2 d2f + ⋯, note (δx)2 ≪ δx dx x0 2 dx2 x0
Composition: y(x) = f(g(x)), dy = df dg dx dg dx
•
• Program f : x ↦ f(x) forward mode AD ( f, Df ) : x ↦ ( f(x), f′(x)) e.g., sq : x ↦ (x2,2x)
•
Later: view large distances as small probabilities
An unlikely model
has large residuals
r1 r2 r3 ..... rn ..... rN
• •
•
Different interpretation of the learning task
A. If | rn | is large p(rn) is small (low probability)
B. Reducing | rn | makes p(rn) large (high probability)
Probability of what? f( ⋅ ; w) evaluated on data 𝒟 := {(xn, yn)}n=1,...,N , called model likelihood
Maximum likelihood estimation: estimation of weights to achieve objective of large likelihood
size of residual rn r2
1−n p(rn) = 2πσ2 e 2σ2
probability of residual rn
r1 r2 r3 ..... rn ..... rN
size of residual rn
•
•
• •
Evaluate model performance on test data (generalisation)
Coloured boxes: possible training sets
Learning or memorising?
Evaluating ML models: Cross-validation
Different training sets can lead to different model parameters with different predictions, hence different residuals
10% 90%
10% 90%
10% 10%
10% 10% 10% 10%
10% 10% 10% 10%
Characterise model on distribution of residuals trained on different subsets of available training data
Cross validation
Learning or memorising? Bias-variance tradeoff
Criteria for evaluation of ML models
• Different training sets can lead to different models with different model predictions and different residuals
• Bias: Deviation of average of predictions from truth
• Variance: variability of model predictions on different parts of test set
• Trade-off
Curve-fitting, no interpretation
Polynomial fits with high degrees tend to overfit
Overfitting: good on training set, poor on test set
ŷ = w0 + w1x
y (̂ x ; w ) = w 0 + w 1 x + w 2 x 2 + ⋯ + w M x M
ŷ = w0 + w1x + w2x2 + w3x3 ŷ = w0 + w1x + ⋯ + w8x8 + w9x9
From C Bishop, PRML
M
Weights learned by minimising loss
Polynomial models of high degree have large weights
From C Bishop, PRML
Regularisation: Penalty for model complexity
Complex models are believed to overfit
Loss𝒟(w) = 𝔼(xn,yn)∼𝒟(yn − y(̂ xn; w))2 w* = arg min Loss𝒟(w) + λ∥w∥2
w
Minimise a combination of two factors
1. — mismatch of model prediction to labelled data (Loss)
2. — data-independent term that depends on model alone (regularisation)
Relative penalties for loss and weight length
How big should model penalty λ be? w* = arg min Loss𝒟(w) + λ∥w∥2
w
λ
Read Section 1.1, p.4 C Bishop, PRML
• • • • • •
Supervised Learning: reduce loss
Summary
Predict and correct using weight-adjustable functions
Optimise weights using loss function — learning as optimisation
Interpret weight spaces in terms of mathematical framework of linear algebra Interpret distances and loss metrics in terms of probability theory
Evaluate performance in terms of generalisation (bias/variance)
Introduce regularisation for generalisation