Prediction and Regularization
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
February 1st and 2nd
1/59
Overview
1. The prediction problem and an example of overfitting 2. The Bias-Variance Tradeoff
3. LASSO and RIDGE
4. Implementing LASSO and RIDGE via glmnet()
2/59
A Basic Prediction Model
Suppose y is given by:
X is a vector of attributes
ε has mean 0, variance σ2, ε ⊥X
Our goal is to find a model f (X ) that approximates f (X )
y = f (X ) + ε
3/59
Suppose We Are Given 100 Observations of y
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●
●
●
●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●
●
●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●
●
●
4/59
Outcome
How Well Can We Predict Out-of-Sample Outcomes (yoos)
20
10
0
−10
−20
0 25 50 75 100
Observation
●
● ●●
● ●
●
●
●
●●
●● ● ●●●
● ●●●●●
●
●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●
●
●● ●
●
●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●
●
●
●
●
●
●
5/59
Outcome and Prediction
Predicting Out of Sample Outcomes (fˆ(Xoos))
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●
●●● ● ● ● ● ●
●●●●●● ●●● ●●●●● ● ●●●●
●●●●●
●●● ● ●● ●●●●
●●●● ●●● ●●●●●●●●
●
●
●
●●●●
● ●●
●●●●● ●●● ●●●
● ●● ●● ●●●●●● ●●● ● ●
●●● ●●●● ●
● ●●●●● ●●●●●● ●●●●●●
●● ● ●●● ●●● ●●
●●●●●●● ●●● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
●● ●● ●●●
● ●●
● ●●●
●●● ●
●
●
●
●
●
●●
5/59
Outcome and Prediction
A Good Model Has Small Distance (yoos −fˆ(Xoos))2
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●
●
●●● ●
●
●
●
●
6/59
Outcome and Prediction
A Simple Metric of Fit is the MSE
A standard measure of fit is the mean squared error between y and fˆ: MSE =E[(y−fˆ(X))2]
With N observations we can estimate this as: 1Nˆ2
MSE = N ∑[(yi −f(Xi)) ] i=1
7/59
Overfitting
Key tradeoff in building a model:
A model that fits the data you have
A model that will perform well on new data
Easy to build a model that fits the data you have well
With enough parameters
fˆ fit too closely to one dataset may perform poorly out-of-sample This is called overfitting
8/59
An Example of Overfitting
On the hub you will find a dataset called polynomial.csv Variables y and x
Split into test and training data
y is generated as a polynomial in x plus random noise:
P
yi = ∑θpxip+εi
p=0 Don’t know the order of P…
9/59
Exercise:
Fit a regression with a 2nd order polynomial in x
2
yi = ∑θpxip+εi p=0
What is the In-Sample MSE?
What is the Out-of-Sample MSE?
10/59
Exercise:
Fit a regression with a 25th order polynomial in x
25
yi = ∑θpxip+εi p=0
What is the In-Sample MSE?
What is the Out-of-Sample MSE?
Is the in-sample fit better or worse than the quadratic model?
Is the out-of-sample fit better or worse than the quadratic model?
If you finish early:
What order polynomial gives the best out-of-sample fit?
11/59
Formalizing Overfitting: Bias-Variance Tradeoff
Consider an algorithm to build model f (X ) given training data D Could write f(X)
Consider the MSE at some particular out-of-sample point X0: MSE(X0) = E[(y0 −fˆ(X0))2]
Here the expectation is taken over y and all D We may show that:
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2
BIAS2 Variance
D
12/59
Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E[(y0 −fˆ(X0))2]
= E[(f (Xo)−fˆ(X0))2]+E[ε2]+2E[f (X0)−fˆ(X0)]E[ε] = E[(f (Xo)−fˆ(X0))2]+σε2
13/59
Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E = E = E
(f (Xo)−fˆ(X0))2 +σε2
(f (Xo)−E[fˆ(X0)]−fˆ(X0)+E[fˆ(X0)])2 +σε2 (f (Xo)−E[fˆ(X0)])2]+E[(fˆ(X0)−E[fˆ(X0)])2
−2E (f (Xo)−E[fˆ(X0)])(fˆ(X0)−E[fˆ(X0)]) +σε2
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2
−2(f(Xo)−E[fˆ(X0)])E (fˆ(X0)−E[fˆ(X0)]) +σε2
=0
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 +σε2
14/59
The Bias-Variance Tradeoff
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2
BIAS2 Variance
This is known as the Bias-Variance Tradeoff
More complex models can pick up subtle elements of true f (X )
Less bias
More complex models vary more across different training datasets
More variance
Introducing a little bias may allow substantial decrease in variance And hence reduce MSE (or Prediction Error)
15/59
Depicting the Bias Variance Tradeoff
16/59
Depicting the Bias Variance Tradeoff
17/59
Depicting the Bias Variance Tradeoff
18/59
Regularization and OLS
y =X′β+ε iii
Today: two tweaks on linear regression RIDGE and LASSO
Both operate by regularizing or shrinking components of βˆ toward 0 This introduces bias, but may reduce variance
Simple intuition:
Force βˆ to be small: won’t vary so much across training data sets
Take the extreme case: βk = 0 for all k… No variance
k
19/59
The RIDGE Objective
OLS Objective:
Ridge objective:
y =X′β+ε iii
N
βˆOLS =argmin∑(yi −Xi′β)2.
β i=1
NK
βˆRIDGE = argmin ∑(yi −Xi′β)2 subject to ∑ βk2 ≤ c.
β i=1 k=1
20/59
The RIDGE Objective
Equivalent to minimizing the penalized residual sum of squares: NK
PRSS(β) = (y −X′β)2+λ β2 l2∑ii ∑k
i=1 k=1 PRSS(β)l2 is convex ⇒ unique solution
λ is called the penalty Penalizes large βk
Different values of λ provide different βˆRIDGE λ
As λ →0 we have βˆRIDGE →βˆOLS λ
As λ → ∞ we have βˆRIDGE → 0 λ
21/59
Aside: Standardization
By convention yi and Xi are assumed to be mean 0
Xi should also be standardized (unit variance)
All βk treated the same by penalty, don’t want different scaling
22/59
Closed Form Solution to Ridge Objective
The ridge solution is given by (you will prove this): βˆRIDGE =(X′X+λIK)−1X′y
λ
Here X is the Matrix with Xi as rows
IK is the Identity Matrix
Note that λIK makes X′X +λIK invertible even if X′X isn’t
For example if K >N
This was actually the original motivation for the problem
23/59
RIDGE is Biased
DefineA=X′X
βˆRIDGE =(X′X+λIK)−1X′y
Therefore, if λ ̸= 0
λ
=(A+λIK)−1A(A−1X′y)
= (A[Ik +λA−1])−1A(A−1X′y)
= (Ik +λA−1)−1A−1A((X′X)−1X′y) =(Ik +λA−1)−1βˆols
E[βˆRIDGE]=E[(Ik +λA−1])−1βˆols]̸=β λ
24/59
Pros and cons of RIDGE
Pros:
Simple, closed form solution
Can deal with K >> N and multicollinearity
Introduces bias but can improve out of sample fit
Cons:
Shrinks coefficients but will not simplify model by eliminating variables
25/59
LASSO Objective
RIDGE will include all K predictors in the final model No simplification
LASSO is a relatively recent alternative that overcomes this: NK
βˆLASSO =argmin∑(yi −Xi′β)2 subject to β i=1
Can also write this as minimizing: NK
PRSS(β) = (y −X′β)2 +λ l1∑ii ∑k
i=1 k=1
∑|βk|≤c. k=1
|β |
26/59
LASSO and Sparse Models
Like RIDGE, LASSO will shrink βk s toward 0
However, the l1 penalty will force some coefficient estimates to be exactly 0 if λ is large enough
Sparse models: lets us ignore some features
Again different values of λ provide different βˆLASSO λ
Need to find a good choice of λ
27/59
Why Does LASSO Set some βk to 0?
28/59
LASSO Details
Unlike RIDGE, LASSO has no closed form solution Requires numerical methods
Neither LASSO nor RIDGE universally dominates
29/59
Elastic Net: Combining LASSO and RIDGE Penalties
Simplest version of elastic net (nests LASSO and RIDGE): ˆelastic 1N ′2 K K2
β =argmin ∑(yi−Xiβ) +λ α∑|βk|+(1−α)∑βk β Ni=1 k=1 k=1
α ∈ [0, 1] weights LASSO vs. RIDGE style Penalties α=1isLASSO
α=0isRIDGE
30/59
Implementing LASSO
1. An example of a prediction problem 2. Elastic Net and LASSO in R
3. How to choose hyperparameter λ 4. Cross-validation in R
31/59
An Example of a Prediction Problem
Suppose we see 100 observations of some outcome yi
Example: residential real estate prices in London (i.e. home prices)
We have 50 characteristics x1i,x2i,···,x50i that might predict yi E.g. number of rooms, size, neighborhood dummy, etc.
Want to build a model that helps us predict yi out of sample I.e. the price of some other house in London
32/59
We Are Given 100 Observations of yi
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●
●
●
●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●
●
●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●
●
●
33/59
Outcome
How Well Can We Predict Out-of-Sample Outcomes (yoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation
●
● ●●
● ●
●
●
●
●●
●● ● ●●●
● ●●●●●
●
●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●
●
●● ●
●
●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●
●
●
●
●
●
●
34/59
Outcome and Prediction
Using x1i,x2i,··· ,x50i to Predict yi
The goal is to use x1i,x2i,···,x50i to predict any yoos i
If you give us number of rooms, size, etc., we will tell you home price Need to build a model fˆ(·):
yˆi =fˆ(x1i,x2i,···,x50i)
A good model will give us predictions close to yoos
We can accurately predict prices for other homes
i
35/59
A Good Model Has Small Distance (yoos −yˆoos)2 i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●
●
●●● ●
●
●
●
●
36/59
Outcome and Prediction
Our Working Example: Suppose Only a Few xki Matter
We see yi and x1i,x2i,··· ,x50i
The true model (which we will pretend we don’t know) is:
yi = 5·x1i +4·x2i +3·x3i +2·x4i +1·x5i +εi Only the first 5 attributes matter!
In other words β1 =5,β2 =4,β3 =3,β4 =2,β5 =1 βk=0fork=6,7,···,50
37/59
Prediction Using OLS
A first approach would be to try to predict using OLS:
yi = β0 +β1x1i +β2x2i +β3x3i +β4x4i +β5x5i +β6x6i +β7x7i +···+β50x50i +vi
Have to estimate 51 different parameters with only 100 data points ⇒ not going to get precise estimates
⇒ out of sample predictions will be bad
38/59
OLS Coefficients With 100 Observations Aren’t Great
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
39/59
Coefficient Estimate
OLS Doesn’t Give Close Predictions for New yoos : i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●
●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●●●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ●
●● ● ● ●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
●● ●● ●●●
● ●●
●●
●
● ●●●
●
●
●
●● ●
●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 39.33
i=1
40/59
Outcome and Prediction
Aside: OLS Does Much Better With 10,000 Observations
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
41/59
Coefficient Estimate
OLS Out-of-Sample Predictions: 100 Training Obs.
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●
●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●●●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ●
●● ● ● ●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
●● ●● ●●●
● ●●
●●
●
● ●●●
●
●
●
●● ●
●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 39.33
i=1
42/59
Outcome and Prediction
OLS Out-of-Sample Predictions: 10,000 Training Obs.
20
10
0
−10
−20
0 25 50 75 100
Observation
● ●●●
●● ●
●
●
●●● ●●●
●
●●● ●
●●●
●●●●
● ● ●●
● ●●●●● ●●●● ●
●●●● ●●●●
● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●
●
●● ● ●●●
●● ●● ●●●● ●● ●●
●●●● ● ●●●●●●● ●● ● ● ● ● ●● ●
●●● ●●●● ●●● ●● ●●●●●●●●● ●
●●● ● ● ● ● ●●
● ● ●●●● ●●●● ●●●●●
●●● ● ●
●
●● ●
●
●
●
●
●
●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 18.92
i=1
43/59
Outcome and Prediction
Solution to The OLS Problem: Regularization
With 100 Observations OLS Didn’t do Very Well Solution: regularization
LASSO/RIDGE/Elastic Net
Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 For λ = 0 this is just OLS
Forλ>0,α=1thisisLASSO:
ˆLASSO
β =argmin
β
ˆelastic 1N 2 K K2 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
1N 2 K ∑(yi−β0−β1x1i···−βKxKi) +λ ∑|βk|
N i=1 k=1
44/59
Implementing Elastic Net in R
The big question when running LASSO is the choice of λ By default, glmnet(·) tries 100 different choices for λ
Starts with λ just large enough that all βk = 0 Proceeds with steps towards λ = 0
For each λ, we estimate corresponding coefficients βlasso(λ) How do we decide which one is best?
45/59
LASSO Coefficients With 100 Observations (λ=0.2)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
46/59
Coefficient Estimate
LASSO Coefficients With 100 Observations (λ=1)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
47/59
Coefficient Estimate
LASSO Coefficients With 100 Observations (λ=3)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
48/59
Coefficient Estimate
LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
49/59
Coefficients
−1 0 1 2 3 4 5
How to Choose λ (tuning)
One option would be to look at performance out-of-sample
Compare out-of-sample mean squared error for different values of λ
For example:
MSEoos(0.2)=26.18
MSEoos(1)=22.59 MSEoos(3)=56.31
Would like a way of choosing that does not require going out of sample…
MSEoos(λ)
50/59
How to Choose λ?
Need a disciplined way of choosing λ without going out of sample Could split training data into a training and validation sample
Estimate model on training data for many λ Compute MSE on validation sample
Choose λ that gives smallest MSE
What if there is something weird about your particular validation sample?
51/59
K-fold Cross Validation
Most common approach is K − fold cross validation
Partition training data into K separate subsets of equal size
Usually K is either 5 or 10
For any k = 1,2,··· ,K exclude the kth fold and estimate the model
for many λ on the remaining data
For each λ, compute the MSEcv on the excluded fold
k,λ
Do this for all K folds:
Now you have K estimates of MSEcv for each λ k,λ
52/59
K-fold Cross Validation
K estimates of MSEcv for each λ k,λ
Compute mean of the MSEs for each λ:
̄cv 1K
MSEλ =K∑MSEk,λ
cv k=1
Can also compute standard deviations
̄ cv Choose λ that gives small MSEλ
53/59
How to Choose λ : k-fold Cross Validation
54/59
How to Choose λ : k-fold Cross Validation Partition the sample into k equal folds
The default for R is k=10
For our sample, this means 10 folds with 10 observations each
Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each)
2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
This provides 10 estimates of MSE(λ) for each λ
Can construct means and standard deviations of MSE(λ) for each λ
Choose λ that gives small mean MSE(λ)
55/59
Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
●●● ●
●●●
● ●
● ●
●●
●●
●●
●●
●● ●
● ●
● ●●
● ●
● ●
● ●
●
● ●●
● ●
●● ●
● ●
●
● ●●
● ●●
●● ●● ●●
● ●●
●● ●● ●●●
●● ●● ●
●
●
−5 −4 −3 −2 −1 0 1
log(Lambda)
56/59
Mean−Squared Error
30 40 50 60 70 80 90
λ = 0.50 Minimizes Cross-Validation MSE
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
= 22.39
i=1
57/59
Coefficient Estimate
λ = 0.80 Is Most Regularized Within 1 S.E.
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
= 21.79
i=1
58/59
Coefficient Estimate
Now Do LASSO Yourself
On the Hub you will find two files: Training data: menti 200.csv
Testing data (out of sample): menti 200 test.csv
There are 200 observations and 200 predictors
Three questions:
1. Run LASSO on the training data: what is the out-of-sample MSE for
the λ that gives minimum mean cross-validated error?
2. How many coefficients (excluding intercept) are included in the most
regularized model with error within 1 s.e. of the minimum?
3. Run a RIDGE regression. Is the out-of-sample MSE higher or lower than for LASSO?
Extra time: estimate an elastic net regression with α = 0.5 How would you tune α? Google caret or train…
59/59