Prediction and Regularization
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
January 31st and February 1st
Copyright By PowCoder代写 加微信 powcoder
1. The prediction problem and an example of overfitting 2. The Bias-Variance Tradeoff
3. LASSO and RIDGE
4. Implementing LASSO and RIDGE via glmnet()
A Basic Prediction Model
Suppose y is given by:
X is a vector of attributes
ε has mean 0, variance σ2, ε ⊥X
Our goal is to find a model f (X ) that approximates f (X )
y = f (X ) + ε
Suppose We Are Given 100 Observations of y
0 25 50 75 100
Observation
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
How Well Can We Predict Out-of-Sample Outcomes (yoos)
0 25 50 75 100
Observation
●●● ●●●●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
Outcome and Prediction
Predicting Out of Sample Outcomes (fˆ(Xoos))
0 25 50 75 100
Observation
●●● ● ● ● ● ●
●●●●●● ●●● ●●●●● ● ●●●●
●●● ● ●● ●●●●
●●●● ●●● ●●●●●●●●
●●●●● ●●● ●●●
● ●● ●● ●●●●●● ●●● ● ●
●●● ●●●● ●
● ●●●●● ●●●●●● ●●●●●●
●● ● ●●● ●●● ●●
●●●●●●● ●●● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction
A Good Model Has Small Distance (yoos −fˆ(Xoos))2
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction
A Simple Metric of Fit is the MSE
A standard measure of fit is the mean squared error between y and fˆ: MSE =E[(y−fˆ(X))2]
With N observations we can estimate this as: 1Nˆ2
MSE = N ∑[(yi −f(Xi)) ] i=1
Overfitting
Key tradeoff in building a model:
A model that fits the data you have
A model that will perform well on new data
Easy to build a model that fits the data you have well
With enough parameters
fˆ fit too closely to one dataset may perform poorly out-of-sample This is called overfitting
An Example of Overfitting
On the hub you will find a dataset called polynomial.csv Variables y and x
Split into test and training data
y is generated as a polynomial in x plus random noise:
yi = ∑θpxip+εi
p=0 Don’t know the order of P…
Fit a regression with a 2nd order polynomial in x
yi = ∑θpxip+εi p=0
What is the In-Sample MSE?
What is the Out-of-Sample MSE?
Fit a regression with a 25th order polynomial in x
yi = ∑θpxip+εi p=0
What is the In-Sample MSE?
What is the Out-of-Sample MSE?
Is the in-sample fit better or worse than the quadratic model?
Is the out-of-sample fit better or worse than the quadratic model?
If you finish early:
What order polynomial gives the best out-of-sample fit?
Formalizing Overfitting: Bias-Variance Tradeoff
Consider an algorithm to build model f (X ) given training data D Could write f(X)
Consider the MSE at some particular out-of-sample point X0: MSE(X0) = E[(y0 −fˆ(X0))2]
Here the expectation is taken over y and all D We may show that:
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2
BIAS2 Variance
Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E[(y0 −fˆ(X0))2]
= E[(f (Xo)−fˆ(X0))2]+E[ε2]+2E[f (X0)−fˆ(X0)]E[ε] = E[(f (Xo)−fˆ(X0))2]+σε2
Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E = E = E
(f (Xo)−fˆ(X0))2 +σε2
(f (Xo)−E[fˆ(X0)]−fˆ(X0)+E[fˆ(X0)])2 +σε2 (f (Xo)−E[fˆ(X0)])2]+E[(fˆ(X0)−E[fˆ(X0)])2
−2E (f (Xo)−E[fˆ(X0)])(fˆ(X0)−E[fˆ(X0)]) +σε2
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2
−2(f(Xo)−E[fˆ(X0)])E (fˆ(X0)−E[fˆ(X0)]) +σε2
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 +σε2
The Bias-Variance Tradeoff
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2
BIAS2 Variance
This is known as the Bias-Variance Tradeoff
More complex models can pick up subtle elements of true f (X )
Less bias
More complex models vary more across different training datasets
More variance
Introducing a little bias may allow substantial decrease in variance And hence reduce MSE (or Prediction Error)
Depicting the Bias Variance Tradeoff
Depicting the Bias Variance Tradeoff
Depicting the Bias Variance Tradeoff
Regularization and OLS
y =X′β+ε iii
Today: two tweaks on linear regression RIDGE and LASSO
Both operate by regularizing or shrinking components of βˆ toward 0 This introduces bias, but may reduce variance
Simple intuition:
Force βˆ to be small: won’t vary so much across training data sets
Take the extreme case: βk = 0 for all k… No variance
The RIDGE Objective
OLS Objective:
Ridge objective:
y =X′β+ε iii
βˆOLS =argmin∑(yi −Xi′β)2.
βˆRIDGE = argmin ∑(yi −Xi′β)2 subject to ∑ βk2 ≤ c.
The RIDGE Objective
Equivalent to minimizing the penalized residual sum of squares: NK
PRSS(β) = (y −X′β)2+λ β2 l2∑ii ∑k
i=1 k=1 PRSS(β)l2 is convex ⇒ unique solution
λ is called the penalty Penalizes large βk
Different values of λ provide different βˆRIDGE λ
As λ →0 we have βˆRIDGE →βˆOLS λ
As λ → ∞ we have βˆRIDGE → 0 λ
Aside: Standardization
By convention yi and Xi are assumed to be mean 0
Xi should also be standardized (unit variance)
All βk treated the same by penalty, don’t want different scaling
Closed Form Solution to Ridge Objective
The ridge solution is given by (you will prove this): βˆRIDGE =(X′X+λIK)−1X′y
Here X is the Matrix with Xi as rows
IK is the Identity Matrix
Note that λIK makes X′X +λIK invertible even if X′X isn’t
For example if K >N
This was actually the original motivation for the problem
RIDGE is Biased
DefineA=X′X
βˆRIDGE =(X′X+λIK)−1X′y
Therefore, if λ ̸= 0
=(A+λIK)−1A(A−1X′y)
= (A[Ik +λA−1])−1A(A−1X′y)
= (Ik +λA−1)−1A−1A((X′X)−1X′y) =(Ik +λA−1)−1βˆols
E[βˆRIDGE]=E[(Ik +λA−1])−1βˆols]̸=β λ
Pros and cons of RIDGE
Simple, closed form solution
Can deal with K >> N and multicollinearity
Introduces bias but can improve out of sample fit
Shrinks coefficients but will not simplify model by eliminating variables
LASSO Objective
RIDGE will include all K predictors in the final model No simplification
LASSO is a relatively recent alternative that overcomes this: NK
βˆLASSO =argmin∑(yi −Xi′β)2 subject to β i=1
Can also write this as minimizing: NK
PRSS(β) = (y −X′β)2 +λ l1∑ii ∑k
∑|βk|≤c. k=1
LASSO and Sparse Models
Like RIDGE, LASSO will shrink βk s toward 0
However, the l1 penalty will force some coefficient estimates to be exactly 0 if λ is large enough
Sparse models: lets us ignore some features
Again different values of λ provide different βˆLASSO λ
Need to find a good choice of λ
Why Does LASSO Set some βk to 0?
LASSO Details
Unlike RIDGE, LASSO has no closed form solution Requires numerical methods
Neither LASSO nor RIDGE universally dominates
Elastic Net: Combining LASSO and RIDGE Penalties
Simplest version of elastic net (nests LASSO and RIDGE): ˆelastic 1N ′2 K K2
β =argmin ∑(yi−Xiβ) +λ α∑|βk|+(1−α)∑βk β Ni=1 k=1 k=1
α ∈ [0, 1] weights LASSO vs. RIDGE style Penalties α=1isLASSO
α=0isRIDGE
Implementing LASSO
1. An example of a prediction problem 2. Elastic Net and LASSO in R
3. How to choose hyperparameter λ 4. Cross-validation in R
An Example of a Prediction Problem
Suppose we see 100 observations of some outcome yi
Example: residential real estate prices in London (i.e. home prices)
We have 50 characteristics x1i,x2i,···,x50i that might predict yi E.g. number of rooms, size, neighborhood dummy, etc.
Want to build a model that helps us predict yi out of sample I.e. the price of some other house in London
We Are Given 100 Observations of yi
0 25 50 75 100
Observation
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
How Well Can We Predict Out-of-Sample Outcomes (yoos) i
0 25 50 75 100
Observation
●●● ●●●●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
Outcome and Prediction
Using x1i,x2i,··· ,x50i to Predict yi
The goal is to use x1i,x2i,···,x50i to predict any yoos i
If you give us number of rooms, size, etc., we will tell you home price Need to build a model fˆ(·):
yˆi =fˆ(x1i,x2i,···,x50i)
A good model will give us predictions close to yoos
We can accurately predict prices for other homes
A Good Model Has Small Distance (yoos −yˆoos)2 i
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction
Our Working Example: Suppose Only a Few xki Matter
We see yi and x1i,x2i,··· ,x50i
The true model (which we will pretend we don’t know) is:
yi = 5·x1i +4·x2i +3·x3i +2·x4i +1·x5i +εi Only the first 5 attributes matter!
In other words β1 =5,β2 =4,β3 =3,β4 =2,β5 =1 βk=0fork=6,7,···,50
Prediction Using OLS
A first approach would be to try to predict using OLS:
yi = β0 +β1x1i +β2x2i +β3x3i +β4x4i +β5x5i +β6x6i +β7x7i +···+β50x50i +vi
Have to estimate 51 different parameters with only 100 data points ⇒ not going to get precise estimates
⇒ out of sample predictions will be bad
OLS Coefficients With 100 Observations Aren’t Great
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate
OLS Doesn’t Give Close Predictions for New yoos : i
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction
Aside: OLS Does Much Better With 10,000 Observations
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate
OLS Out-of-Sample Predictions: 100 Training Obs.
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction
OLS Out-of-Sample Predictions: 10,000 Training Obs.
0 25 50 75 100
Observation
● ●●●●● ●●●● ●
● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●
●● ●● ●●●● ●● ●●
●●●● ● ●●●●●●● ●● ● ● ● ● ●● ●
●●● ●●●● ●●● ●● ●●●●●●●●● ●
●●● ● ● ● ● ●●
● ● ●●●● ●●●● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction
Solution to The OLS Problem: Regularization
With 100 Observations OLS Didn’t do Very Well Solution: regularization
LASSO/RIDGE/Elastic Net
Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 For λ = 0 this is just OLS
Forλ>0,α=1thisisLASSO:
ˆelastic 1N 2 K K2 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
1N 2 K ∑(yi−β0−β1x1i···−βKxKi) +λ ∑|βk|
Implementing Elastic Net in R
The big question when running LASSO is the choice of λ By default, glmnet(·) tries 100 different choices for λ
Starts with λ just large enough that all βk = 0 Proceeds with steps towards λ = 0
For each λ, we estimate corresponding coefficients βlasso(λ) How do we decide which one is best?
LASSO Coefficients With 100 Observations (λ=0.2)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate
LASSO Coefficients With 100 Observations (λ=1)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate
LASSO Coefficients With 100 Observations (λ=3)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate
LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
Coefficients
−1 0 1 2 3 4 5
How to Choose λ (tuning)
One option would be to look at performance out-of-sample
Compare out-of-sample mean squared error for different values of λ
For example:
MSEoos(0.2)=26.18
MSEoos(1)=22.59 MSEoos(3)=56.31
Would like a way of choosing that does not require going out of sample…
How to Choose λ?
Need a disciplined way of choosing λ without going out of sample Could split training data into a training and validation sample
Estimate model on training data for many λ Compute MSE on validation sample
Choose λ that gives smallest MSE
What if there is something weird about your particular validation sample?
K-fold Cross Validation
Most common approach is K − fold cross validation
Partition training data into K separate subsets of equal size
Usually K is either 5 or 10
For any k = 1,2,··· ,K exclude the kth fold and estimate the model
for many λ on the remaining data
For each λ, compute the MSEcv on the excluded fold
Do this for all K folds:
Now you have K estimates of MSEcv for each λ k,λ
K-fold Cross Validation
K estimates of MSEcv for each λ k,λ
Can also compute standard deviations
̄ cv Choose λ that gives small MSEλ
Compute mean of the MSEs for each λ:
MSEλ =K∑MSEk,λ
How to Choose λ : k-fold Cross Validation
How to Choose λ : k-fold Cross Validation Partition the sample into k equal folds
The default for R is k=10
For our sample, this means 10 folds with 10 observations each
Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each)
2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
This provides 10 estimates of MSE(λ) for each λ
Can construct means and standard deviations of MSE(λ) for each λ
Choose λ that gives small mean MSE(λ)
Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
−5 −4 −3 −2 −1 0 1
log(Lambda)
Mean−Squared Error
30 40 50 60 70 80 90
λ = 0.50 Minimizes Cross-Validation MSE
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
Coefficient Estimate
λ = 0.87 Is Most Regularized Within 1 S.E.
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
Coefficient Estimate
Now Do LASSO Yourself
On the Hub you will find two files: Training data: menti 200.csv
Testing data (out of sample): menti 200 test.csv
There are 200 observations and 200 predictors
Three questions:
1. Run LASSO on the training data: what is the out-of-sample MSE for
the λ that gives minimum mean cross-validated error?
2. How many coefficients (excluding intercept) are included in the most
regularized model with error within 1 s.e. of the minimum?
3. Run a RIDGE regression. Is the out-of-sample MSE higher or lower than for LASSO?
Extra time: estimate an elastic net regression with α = 0.5 How would you tune α? Google caret or train…
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com