程序代写代做代考 python Introduction to Regression

Introduction to Regression
Srinandan (“Sri”) Dasmahapatra
COMP3223

Supervised Learning: labelled data
Compare labels with predictions

d(ŷ , y ) : How far is prediction ŷ nnn
from actual data yn?
𝒟 := {(xn, yn)}, n = 1,…, N
Given data D construct model f( ⋅ ; w) such that the “distance” between model
output and real “output” is small • Learning = model construction by
xn
minimising loss L(w) = ∑N d(ŷ , y ) • n=1 n n
ŷ n
Model
f(⋅;w)
y = f( ⋅ ; w) continuous (e.g., y=23.4), f( ⋅ ; w) is a regression model
ŷ =f(x;w) nn

Core idea in ML:Reduce mismatch between model prediction and data
ŷ =f(x) nn
residuals
rn := (y x
Minimise this by tuning w
Squared residual loss

in training data
rn
y =ŷ +(y −ŷ)=f(x;w)+r(w) nnnnnn
Contribution to loss function:
l(w)=(y −f(x;w))2=r2(w). nnn1Nn
• Average loss: L(w) = N ∑ ln(w) n=1
y
• •
− ŷ ) nn
Regression models minimise residuals —
deviations of model predictions from outputs

Choose weight for minimum loss
Example: find slope of straight line
• •

yn
Fit y=w x to data, slope w of the line is the weight/parameter to be learnt
Loss L = sum of squares of residuals (in red), for 3 possible choices of slopes —
{w1, w2, w3}: the 3 residuals for input xn is shown
xn
Choose the slope that gives the smallest value from {L(w1), L(w2), L(w3)}
w
l (w) = r2 = (wx − y )2 nnnn
L(w) = 1 ∑N ln(w) N n=1

Loss function needs a distance: introducing the norm
Treat all data points as collective unit
Other distances (norms) are introduced, such as:
Lp(r) ≜ (∑N |rn|p)1/p n=1
0Br11C N r=Br2 C,krk2=Xr2
@ . A n n=1
rN
Loss = (1/N) (length)2 of N-dimensional residual vector

Update weights to reduce loss: gradient descent
Differentiable loss function: l (w) = r2 = (wx − y )2 nnnn
•Slope can take any real value w ∈ R
•Iterate w(t) ↦w(t+1) , t = 0, 1, …,
•Update weights w(t) to w(t+1) so that L(w(t+1)) < L(w(t)) •Change weights in the direction opposite to the slope of the loss function (we want to reduce the loss, hence descent) L(w) 20 10 -4 -2 w 2 4 (dL) dw θ = arctan linear approximation of loss function -10 -20 w(t+1) =w(t)+η dL(w(t)) ,η<0 ( dw ) • • • • L(w)=(1/N)[(wx −y)2+(wx −y)2+⋯+(wx −y )2] 1122NN Loss function is quadratic in w Loss function, L(w) Linear Regression: solving for zero gradient of loss 222222 ( w x − 2 w( wx yx +− y2 w) x y + y ) 11 NN 1N1N Closed form solution exists: linear algebra Gradients w(1) w(2) w(3) w* L(w) = aw2 + bw + c Follow gradients until minimum reached Solution for weights: set gradient = 0 0= ∂L(w) ∂w w=w* ⟹ w*=−b/(2a) Exercise: differential calculus Closed form solution to linear regression weights in terms of vector products • • a = (1/N)[x2 + x2 + ⋯ + x2] x L(w) = (1/N)[(wx1 − y1)2 + (wx2 − y2)2 + ⋯ + (wxN − yN)2] Exercise: In L(w) = aw2 + bw + c, show y x⊤y • • 12N b = (−2/N)[x1y1 + x2y2 + ⋯ + xNyN] 0= ∂L(w) ⟹ w*=−b/(2a)= x⊤y ∂w w=w* x⊤x 90o Reduce loss by gradient descent: optimisation Same idea in higher dimensions (more adjustable weights) lossn = (yn − (w1xn,1 + w2xn,2))2, x = (x1, x2) ∈ R2 Evaluate partial derivatives (gradient) to choose direction of weight updates w2 w1 (∇wL)i = ∂L ∂wi w(t+1) = w(t) − η∇wL L(w) = 𝔼(xn,yn)∼𝒟lossn(w), mean loss Automatic differentiation No need to differentiate by hand, except to understand (lab exercises) • Autograd, JAX — AD libraries in python f:x↦f(x); dy = lim f(x+δx)−f(x) = lim f(x+δx)−f(x−δx) dx δx→0 δx δx→0 2⋅δx • • Product: y(x) = f(x) ⋅ g(x); y′(x) = f′(x) ⋅ g(x) + f(x) ⋅ g′(x) • Quotient: y(x) = f(x)/g(x); y′(x) = [ f′(x) ⋅ g(x) − f(x) ⋅ g′(x)]/g(x)2 Taylor series: f(x0 + δx) = f(x0) + δx df + 1 (δx)2 d2f + ⋯, note (δx)2 ≪ δx dx x0 2 dx2 x0 Composition: y(x) = f(g(x)), dy = df dg dx dg dx • • Program f : x ↦ f(x) forward mode AD ( f, Df ) : x ↦ ( f(x), f′(x)) e.g., sq : x ↦ (x2,2x) • Later: view large distances as small probabilities An unlikely model has large residuals r1 r2 r3 ..... rn ..... rN • • • Different interpretation of the learning task A. If | rn | is large p(rn) is small (low probability) B. Reducing | rn | makes p(rn) large (high probability) Probability of what? f( ⋅ ; w) evaluated on data 𝒟 := {(xn, yn)}n=1,...,N , called model likelihood Maximum likelihood estimation: estimation of weights to achieve objective of large likelihood size of residual rn r2 1−n p(rn) = 2πσ2 e 2σ2 probability of residual rn r1 r2 r3 ..... rn ..... rN size of residual rn • • • • Evaluate model performance on test data (generalisation) Coloured boxes: possible training sets Learning or memorising? Evaluating ML models: Cross-validation Different training sets can lead to different model parameters with different predictions, hence different residuals 10% 90% 10% 90% 10% 10% 10% 10% 10% 10% 10% 10% 10% 10% Characterise model on distribution of residuals trained on different subsets of available training data Cross validation Learning or memorising? Bias-variance tradeoff Criteria for evaluation of ML models • Different training sets can lead to different models with different model predictions and different residuals • Bias: Deviation of average of predictions from truth • Variance: variability of model predictions on different parts of test set • Trade-off Curve-fitting, no interpretation Polynomial fits with high degrees tend to overfit Overfitting: good on training set, poor on test set ŷ = w0 + w1x y (̂ x ; w ) = w 0 + w 1 x + w 2 x 2 + ⋯ + w M x M ŷ = w0 + w1x + w2x2 + w3x3 ŷ = w0 + w1x + ⋯ + w8x8 + w9x9 From C Bishop, PRML M Weights learned by minimising loss Polynomial models of high degree have large weights From C Bishop, PRML Regularisation: Penalty for model complexity Complex models are believed to overfit Loss𝒟(w) = 𝔼(xn,yn)∼𝒟(yn − y(̂ xn; w))2 w* = arg min Loss𝒟(w) + λ∥w∥2 w Minimise a combination of two factors 1. — mismatch of model prediction to labelled data (Loss) 2. — data-independent term that depends on model alone (regularisation) Relative penalties for loss and weight length How big should model penalty λ be? w* = arg min Loss𝒟(w) + λ∥w∥2 w λ Read Section 1.1, p.4 C Bishop, PRML • • • • • • Supervised Learning: reduce loss Summary Predict and correct using weight-adjustable functions Optimise weights using loss function — learning as optimisation Interpret weight spaces in terms of mathematical framework of linear algebra Interpret distances and loss metrics in terms of probability theory Evaluate performance in terms of generalisation (bias/variance) Introduce regularisation for generalisation