Machine Learning
Lecture 4: Linear Regression
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Technical University of Munich
Copyright By PowCoder代写 加微信 powcoder
Winter term 2020/2021
scalar is lowercase and not bold vector is lowercase and bold matrix is uppercase and bold predicted value for inputs x vector of targets
target of the i’th example
bias term (not to be confused with bias in general) basis function
error function
training data
Moore-Penrose pseudoinverse of X
There is not a special symbol for vectors or matrices augmented by the bias term, w0. Assume it is always included.
Linear Regression 2
Data Analytics and Machine Learning
Basic Linear Regression
Linear Regression 3
Data Analytics and Machine Learning
Example: Housing price prediction
Given is a dataset D = {(xi, yi)}Ni=1, of house areas xi and corresponding prices yi.
How do we estimate a price of a new house with area xnew?
Linear Regression 4
Data Analytics and Machine Learning
Regression problem
• observations 1
X = {x1,x2,…,xN}, xi ∈ RD
y = {y1,y2,…,yN}, yi ∈ R
• Mapping f(·) from inputs to targets yi ≈f(xi)
1A common way to represent the samples is as a data matrix X ∈ RN×D, where each row represents one sample.
Linear Regression 5
Data Analytics and Machine Learning
Linear model
Target y is generated by a deterministic function f of x plus noise
yi =f(xi)+εi, εi ∼N(0,β−1) (1)
Let’s choose f(x) to be a linear function
fw(xi) = w0 + w1xi1 + w2xi2 + … + wDxiD (2)
= w0 + wT xi (3)
From now we will always assume that the bias term is absorbed into the x vector
Linear Regression 6
Data Analytics and Machine Learning
Absorbing the bias term
The linear function is given by
fw(x)=w0 +w1x1 +w2x2 +…+wDxD (4)
= w0 + wT x (5)
Here w0 is called bias or offset term. For simplicity, we can ”absorb”it by prepending a 1 to the feature vector x and respectively adding w0 to the weight vector w:
x ̃ = (1, x1, …, xD)T w ̃ = (w0, w1, …, wD)T Thefunctionfw cancompactlybewrittenasfw(x)=w ̃Tx ̃.
To unclutter the notation, we will assume the bias term is always absorbed and write w and x instead of w ̃ and x ̃.
Now, how do we choose the ”best”w that fits our data?
Linear Regression 7
Data Analytics and Machine Learning
Loss function
A loss function measures the“misfit”or error between our model (parametrized by w) and observed data D = {(xi, yi)}Ni=1.
Standard choice – least squares (LS) 1 N
(wTxi −yi)2 (7)
ELS(w) = 2 = 2
(fw(xi) − yi)2 (6) 1 N
Factor 12 is for later convenience
Linear Regression 8
Data Analytics and Machine Learning
Find the optimal weight vector w⋆ that minimizes the error
w∗ =argminELS(w) (8)
=argmin (xTi w−yi)2 (9)
By stacking the observations xi as rows of the matrix X ∈ RN×D
= arg min 1(Xw − y)T (Xw − y) (10) w2
Linear Regression 9
Data Analytics and Machine Learning
Optimal solution
To find the minimum of the loss E(w), compute the gradient ∇wE(w): ∇ E (w)=∇ 1(Xw−y)T(Xw−y) (11)
=∇ 1wTXTXw−2wTXTy+yTy (12)
= XT Xw − XT y (13)
See Equations (69), (81) from Matrix cookbook for details
Linear Regression 10
Data Analytics and Machine Learning
Optimal solution
Now set the gradient to zero and solve for w to obtain the minimizer 2
XTXw−XTy=! 0 (14) This leads to the so-called normal equation of the least squares problem
w∗ = (XTX)−1XT y (15) †
X† is called Moore-Penrose pseudo-inverse of X (because for an
invertible square matrix, X† = X−1).
2Because Hessian ∇w∇wE(w) is positive (semi)definite → see Optimization
Linear Regression 11
Data Analytics and Machine Learning
Nonlinear dependency in data
What if the dependency between y and x is not linear?
εi ∼ N (0, β−1)
Data generating process: yi = sin(2πxi) + εi,
For this example assume that the data dimensionality is D = 1
Linear Regression 12
Data Analytics and Machine Learning
Polynomials
Solution: Polynomials are universal function approximators, so for 1-dimensional x we can define f as
Or more generally
fw(x)=w0 +wjxj (16)
=w0 +wjφj(x) (17)
Define φ0 = 1
= wT φ(x) (18) The function f is still linear in w (despite not being linear in x)!
Linear Regression 13
Data Analytics and Machine Learning
Typical basis functions
Polynomials
φj (x) = xj
−(x−μj )2 φj (x) = e 2s2
Logistic Sigmoid
φj(x) = σ(x−μj ), s
where σ(a) = 1 1+e−a
Linear Regression 14
Data Analytics and Machine Learning
Linear basis function model
For d-dimensional data x: φj : Rd → R Prediction for one sample
fw(x)=w0 +wjφj(x)=wTφ(x) (19)
Using the same least squares error function as before
φ0(x1) φ1(x1) … φM(x1)
φ0(x2) φ1(x2) .
ELS(w)= wTφ(xi)−yi = (Φw−y)T(Φw−y) (20)
. . …
φ0(xN) φ1(xN) … φM(xN) being the design matrix of φ.
Linear Regression 15
Data Analytics and Machine Learning
Optimal solution
Recall the final form of the least squares loss that we arrived at for the original feature matrix X
ELS(w) = 21(Xw − y)T (Xw − y)
and compare it to the expression we found with the design matrix
Φ ∈ RN×(M+1)
ELS(w) = 12(Φw − y)T (Φw − y). (21)
This means that the optimal weights w∗ can be obtained in the same way w∗ = (ΦTΦ)−1ΦTy = Φ†y (22)
Compare this to Equation 15:
w∗ = (XTX)−1XTy = X†y (23)
Linear Regression 16
Data Analytics and Machine Learning
Choosing degree of the polynomial
How do we choose the degree of the polynomial M?
Linear Regression 17
Data Analytics and Machine Learning
Choosing degree of the polynomial
How do we choose the degree of the polynomial M? 1
Linear Regression 17
Data Analytics and Machine Learning
Choosing degree of the polynomial
How do we choose the degree of the polynomial M? 11
−1 −1 0x10x1
Linear Regression 17
Data Analytics and Machine Learning
Choosing degree of the polynomial
How do we choose the degree of the polynomial M? 11
Linear Regression 17
Data Analytics and Machine Learning
Choosing degree of the polynomial
How do we choose the degree of the polynomial M?
1 M=0 1 M=1
0x10x1 1 M=3 1 M=9
Linear Regression 17
Data Analytics and Machine Learning
Choosing degree of the polynomial
Training Validation
One valid solution is to choose M using the standard train-validation split approach.
Linear Regression 18
Data Analytics and Machine Learning
Choosing degree of the polynomial
We also make another observation: overfitting occurs when the
coefficients w become large.
What if we penalize large weights?
Linear Regression 19
Data Analytics and Machine Learning
Controlling overfitting with regularization
Least squares loss with L2 regularization (also called ridge regression)
Eridge(w) = wT φ(xi) − yi + ∥w∥2 (24)
• ∥w∥2 ≡wTw=w02+w12+w2+···+wM2 -squaredL2normofw • λ – regularization strength
Linear Regression 20
Data Analytics and Machine Learning
Controlling overfitting with regularization
Least squares loss with L2 regularization (also called ridge regression)
Eridge(w) = wT φ(xi) − yi + ∥w∥2 (24)
• ∥w∥2 ≡wTw=w02+w12+w2+···+wM2 -squaredL2normofw
• λ – regularization strength
Larger regularization strength λ leads to smaller weights w
Linear Regression 20
Data Analytics and Machine Learning
Bias-variance tradeoff
The error of an estimator can be decomposed into two parts: 3 • Bias – expected error due to model mismatch
• Variance – variation due to randomness in training data
the center of the target:
the true model that predicts the correct values.
different hits (the blue dots): different realizations of model given different training data.
3See Bishop Section 3.2 for a more rigorous mathematical derivation
Linear Regression 21
Data Analytics and Machine Learning
Bias-variance tradeoff: high bias
• In case of high bias, the model is too rigid to fit the underlying data distribution.
• This typically happens if the model is misspecified and/or the regularization strength λ is too high.
Linear Regression 22
Data Analytics and Machine Learning
Bias-variance tradeoff: high variance
• In case of high variance, the model is too flexible, and therefore captures noise in the data.
• This is exactly what we call overfitting.
• This typically happens when the model has high capacity (= it ”memorizes”the training data) and/or λ is too low.
Linear Regression 23
Data Analytics and Machine Learning
Bias-variance tradeoff
• Of course, we want models that have low bias and low variance, but often those are conflicting goals.
• A popular technique is to select a model with large capacity (e.g. high degree polynomial), and keep the variance in check by choosing appropriate regularization strength λ.
Linear Regression 24
Data Analytics and Machine Learning
Bias-variance tradeoff
• Bias-variance tradeoff in the case of unregularized least squares regression (λ = 0)
The upper-left figure from: https://eissanematollahi.com/wp-content/uploads/2018/09/Machine-Learning-Basics-1.pdf.
Linear Regression 25
Data Analytics and Machine Learning
Correlation vs. Causation
Least squares fit
f (x) = 0.018x + 13.43
35 30 25 20 15
• The weights wi can be interpreted as the strength of the (linear) relationship between feature xi and y
• A weight of 0.018 shows a strong correlation (considering the different scales)
• With actual data, you would normalize the data to handle the different scales of X and y and find a weight of about 1
• But correlation does not imply causation! Putting more people in the pool does not increase the air temperature.
Outdoor Pool Visitors
Linear Regression 26
Data Analytics and Machine Learning
Air Temperature (°C)
Probabilistic Linear Regression
In the following section, we will use probabilistic graphical models. If you do not know them yet, watch our separate Introduction to PGMs video.
Linear Regression 27
Data Analytics and Machine Learning
Probabilistic formulation of linear regression
Remember from our problem definition at the start of the lecture,
yi=fw(xi)+ εi noise
Noise has zero-mean Gaussian distribution with a fixed precision β = 1 σ2
εi ∼N(0,β−1) This implies that the distribution of the targets is
yi ∼N(fw(xi),β−1)
Remember: any function can be represented as fw(xi) = wT φ(xi)
Linear Regression 28
Data Analytics and Machine Learning
Maximum likelihood
Likelihood of a single sample
p(yi | fw(xi),β) = N(yi | fw(xi),β−1) (25)
Assume that the samples are drawn independently =⇒ likelihood of the entire dataset D = {X, y} is
p(y | X,w,β) = p(yi | fw(xi),β) (26)
We can now use the same approach we used in previous lecture –
maximize the likelihood w.r.t. w and β
wML,βML = argmaxp(y | X,w,β) (27) w,β
Linear Regression 29
Data Analytics and Machine Learning
Maximum likelihood
Like in the coin flip example, we can make a few simplifications wML,βML = argmaxp(y | X,w,β) (28)
= argmaxlnp(y | X,w,β) (29)
= argmin−lnp(y | X,w,β) (30)
Let’s denote this quantity as maximum likelihood error function that we need to minimize
EML(w,β) = −lnp(y | X,w,β) (31)
Linear Regression 30
Data Analytics and Machine Learning
Maximum likelihood
Simplify the error function
EML(w,β) = −ln N(yi | fw(xi),β−1) i=1
(33) ln 2πexp −2(w φ(xi)−yi) (34)
=−ln exp − (wTφ(xi)−yi)2
N β β T 2
(wTφ(xi)−yi)2 − 2 lnβ+ 2 ln2π (35)
Linear Regression 31
Data Analytics and Machine Learning
Optimizing log-likelihood w.r.t. w
wML =argminEML(w,β) (36)
= arg min (wT φ(xi) − yi)2 − w 2i=1
ln 2π (37) (38)
= arg min (wT φ(xi) − yi)2
least squares error fn!
= arg min ELS(w) w
Linear Regression 32
Data Analytics and Machine Learning
Optimizing log-likelihood w.r.t. w
wML =argminEML(w,β) (36)
= arg min (wT φ(xi) − yi)2 − w 2i=1
ln 2π (37) (38)
= arg min (wT φ(xi) − yi)2
least squares error fn!
= arg min ELS(w) w
Maximizing the likelihood is equivalent to minimizing the least squares error function!
wML = (ΦTΦ)−1ΦTy = Φ†y (40)
Linear Regression 32
Data Analytics and Machine Learning
Optimizing log-likelihood w.r.t. β
Plug in the estimate for w and minimize w.r.t. β
βML =argminEML(wML,β) (41) β
βN NN =argmin 2 (wTMLφ(xi)−yi)2 − 2 lnβ+ 2 ln2π (42)
Take derivative w.r.t. β and set it to zero
∂β EML(wML, β) = 2 Solving for β
(wTMLφ(xi) − yi)2 − 2β = 0 (43)
β =N (wTMLφ(xi)−yi)2 (44)
Linear Regression 33
Data Analytics and Machine Learning
Posterior distribution
Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available).
Solution – consider the posterior distribution instead
Linear Regression 34
Data Analytics and Machine Learning
Posterior distribution
Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available).
Solution – consider the posterior distribution instead
likelihood prior
p(w | X,y,β,·) = p(y | X,w,β)·p(w | ·) (45) p(y | X,β,·)
normalizing constant
∝ p(y | X, w, β) · p(w | ·) (46)
Precision β = 1/σ2 is treated as a known parameter to simplify the calculations.
Linear Regression 35
Data Analytics and Machine Learning
Posterior distribution
Recall from the Lecture 3, that the MLE leads to overfitting (especially, when little training data is available).
Solution – consider the posterior distribution instead
likelihood prior
p(w | X,y,β,·) = p(y | X,w,β)·p(w | ·) (45) p(y | X,β,·)
normalizing constant
∝ p(y | X, w, β) · p(w | ·) (46) Connection to the coin flip example
train data coin: D = X
regr.: D = {X,y}
likelihood
p(D | θ) p(y | X,w,β)
p(θ | a, b) p(w | ·)
p(θ | D) p(w | X,y,β,·)
How do we choose the prior p(w | ·)?
Precision β = 1/σ2 is treated as a known parameter to simplify the calculations.
Linear Regression 35
Data Analytics and Machine Learning
Prior for w
We set the prior over w to an isotropic multivariate normal distribution
with zero mean
p(w | α) = N(w | 0,α−1I) = α M2 exp−αwT w (47)
2π 2 α – precision of the distribution
M – number of elements in the vector w
• Higher probability is assigned to small values of w =⇒ prevents
overfitting (recall slide 20)
• Likelihood is also Gaussian – simplified calculations
Motivation:
Linear Regression 36
Data Analytics and Machine Learning
Maximum a posteriori (MAP)
We are looking for w that corresponds to the mode of the posterior wMAP = argmax p(w | X,y,α,β) (48)
= argmax lnp(y | X,w,β)+lnp(w | α)−lnp(y | X,β,α)
= arg min − ln p(y | X, w, β) − ln p(w | α) w
Similar to ML, define the MAP error function as negative log-posterior EMAP(w) = −lnp(w | X,y,α,β) (51)
= − ln p(y | X, w, β) − ln p(w | α) + const (52)
We ignore the constant terms in the error function, as they are independent of w
Linear Regression 37
Data Analytics and Machine Learning
MAP error function
Simplify the error function
EMAP =−lnp(y|X,w,β)−lnp(w|α) βN NN
(wTφ(xi)−yi)2 − 2 lnβ+ 2 ln2π −lnαM2 +αwTw
(wTφ(xi)−yi)2 + 2∥w∥2 + const
(wT φ(xi) − yi)2 + 2 ∥w∥2 + const where λ = β ridge regression error fn!
=Eridge(w) + const
MAP estimation with Gaussian prior is equivalent to ridge regression!
Linear Regression 38
Data Analytics and Machine Learning
Full Bayesian approach
Instead of representing p(w | D) with the point estimate wMAP, we can compute the full posterior distribution
p(w | D) ∝ p(y | X,w,β) p(w | α). (54) Since both likelihood and prior are Gaussian, the posterior is as well!4
p(w | D) = N (w | μ, Σ) where μ = βΣΦTy and Σ−1 = αI + βΦTΦ.
Observations
• The posterior is Gaussian, so its mode is the mean and wMAP = μ
• In the limit of an infinitely broad prior α → 0, wMAP → wML
• For N = 0, i.e. no data points, the posterior equals the prior
• Even though we assume an isotropic prior p(w), the posterior covariance is in general not diagonal
4The Gaussian distribution is a conjugate prior of itself
Linear Regression 39
Data Analytics and Machine Learning
Predicting for new data: MLE and MAP
After observing data D = {(xi, yi)}Ni=1, we can compute the MLE/MAP.
Usually, what we are actually interested in is the prediction yˆnew for a new data point xnew – the model parameters w are just a means to achieve this.
Recall, that we assume β to be known a priori (for simplified calculations).
Linear Regression 40
Data Analytics and Machine Learning
Predicting for new data: MLE and MAP
After observing data D = {(xi, yi)}Ni=1, we can compute the MLE/MAP.
Usually, what we are actually interested in is the prediction yˆnew for a new data point xnew – the model parameters w are just a means to achieve this.
Recall, that y ∼ N(fw(x),β−1)
Plugging in the estimated parameters we get a predictive distribution
that lets us make prediction yˆnew for new data xnew. • Maximum likelihood: wML and βML
p(yˆnew | xnew, wML, βML) = N yˆnew | wTMLφ(xnew), β−1 (55) ML
• Maximum a posteriori: wMAP
p(yˆnew | xnew, wMAP, β) = N yˆnew | wTMAPφ(xnew), β−1 (56)
Recall, that we assume β to be known a priori (for simplified calculations).
Linear Regression 40
Data Analytics and Machine Learning
Posterior predictive distribution
Alternatively, we can use the full posterior distribution p(w | D). This allows us to compute the posterior predictive distribution
p(yˆnew | xnew, D) =
Advantage: We get a more accurate estimate about the uncertainty in the prediction (i.e. the variance of the Gaussian, which now also depends on the input xnew)
p(yˆnew, w | xnew, D) dw
p(yˆnew |xnew,w)p(w|D)dw
= N yˆnew | μT φ(xnew), β−1 + φ(xnew)T Σφ(xnew)
Linear Regression 41
Data Analytics and Machine Learning
Example of posterior predictive distribution
Green: Underlying function, Blue: Observations, Dark-Red: Mode, Light-Red: Variance
Linear Regression 42
Data Analytics and Machine Learning
• Optimization-based approaches to regression have probabilistic interpretations
• Least squares regression ⇐⇒ Maximum likelihood (Slide 32) • Ridge regression ⇐⇒ Maximum a posteriori (Slide 38)
• Even nonlinear dependencies in the data can be captured by a model linear w.r.t. weights w (Slide 13)
• Penalizing large weights helps to reduce overfitting (Slide 20)
• Full Bayesian gives us data-dependent uncertainty estimates (Slide 41)
Linear Regression 43
Data Analytics and Machine Learning
Reading material
Main reading
• “Pattern Recognition and Machine Learning” by Bishop [ch. 1.1, 3.1, 3.2, 3.3.1, 3.3.2, 3.6]
Extra reading
• “Machine Learning: A Probabilistic Perspective”by Murphy [ch. 7.2–7.3, 7.5.1, 7.6.1, 7.6.2]
Slides are based on an older version by G. Jensen and C. Osendorfer. Some figures are from Bishop’s “Pattern Recognition and Machine Learning”.
Linear Regression 44
Data Analytics and Machine Learning
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com