程序代写 Prediction and Regularization

Prediction and Regularization
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
January 31st and February 1st

Copyright By PowCoder代写 加微信 powcoder

1. The prediction problem and an example of overfitting 2. The Bias-Variance Tradeoff
3. LASSO and RIDGE
4. Implementing LASSO and RIDGE via glmnet()

A Basic Prediction Model
􏰀 Suppose y is given by:
􏰀 X is a vector of attributes
􏰀 ε has mean 0, variance σ2, ε ⊥X
􏰀 Our goal is to find a model 􏱍f (X ) that approximates f (X )
y = f (X ) + ε

Suppose We Are Given 100 Observations of y
0 25 50 75 100
Observation
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●

How Well Can We Predict Out-of-Sample Outcomes (yoos)
0 25 50 75 100
Observation
●●● ●●●●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
Outcome and Prediction

Predicting Out of Sample Outcomes (fˆ(Xoos))
0 25 50 75 100
Observation
●●● ● ● ● ● ●
●●●●●● ●●● ●●●●● ● ●●●●
●●● ● ●● ●●●●
●●●● ●●● ●●●●●●●●
●●●●● ●●● ●●●
● ●● ●● ●●●●●● ●●● ● ●
●●● ●●●● ●
● ●●●●● ●●●●●● ●●●●●●
●● ● ●●● ●●● ●●
●●●●●●● ●●● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction

A Good Model Has Small Distance (yoos −fˆ(Xoos))2
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction

A Simple Metric of Fit is the MSE
􏰀 A standard measure of fit is the mean squared error between y and fˆ: MSE =E[(y−fˆ(X))2]
􏰀 With N observations we can estimate this as: 􏱍1Nˆ2
MSE = N ∑[(yi −f(Xi)) ] i=1

Overfitting
􏰀 Key tradeoff in building a model:
􏰀 A model that fits the data you have
􏰀 A model that will perform well on new data
􏰀 Easy to build a model that fits the data you have well
􏰀 With enough parameters
􏰀 fˆ fit too closely to one dataset may perform poorly out-of-sample 􏰀 This is called overfitting

An Example of Overfitting
􏰀 On the hub you will find a dataset called polynomial.csv 􏰀 Variables y and x
􏰀 Split into test and training data
􏰀 y is generated as a polynomial in x plus random noise:
yi = ∑θpxip+εi
p=0 􏰀 Don’t know the order of P…

􏰀 Fit a regression with a 2nd order polynomial in x
yi = ∑θpxip+εi p=0
􏰀 What is the In-Sample MSE?
􏰀 What is the Out-of-Sample MSE?

􏰀 Fit a regression with a 25th order polynomial in x
yi = ∑θpxip+εi p=0
􏰀 What is the In-Sample MSE?
􏰀 What is the Out-of-Sample MSE?
􏰀 Is the in-sample fit better or worse than the quadratic model?
􏰀 Is the out-of-sample fit better or worse than the quadratic model?
􏰀 If you finish early:
􏰀 What order polynomial gives the best out-of-sample fit?

Formalizing Overfitting: Bias-Variance Tradeoff
􏰀 Consider an algorithm to build model 􏱍f (X ) given training data D 􏰀 Could write f􏰐(X)
􏰀 Consider the MSE at some particular out-of-sample point X0: MSE(X0) = E[(y0 −fˆ(X0))2]
􏰀 Here the expectation is taken over y and all D 􏰀 We may show that:
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2 􏰎 􏰍􏰌 􏰏􏰎 􏰍􏰌 􏰏
BIAS2 Variance

Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E[(y0 −fˆ(X0))2]
= E[(f (Xo)−fˆ(X0))2]+E[ε2]+2E[f (X0)−fˆ(X0)]E[ε] = E[(f (Xo)−fˆ(X0))2]+σε2

Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E = E = E
(f (Xo)−fˆ(X0))2 +σε2 􏰅􏰆
(f (Xo)−E[fˆ(X0)]−fˆ(X0)+E[fˆ(X0)])2 +σε2 􏰅􏰆 (f (Xo)−E[fˆ(X0)])2]+E[(fˆ(X0)−E[fˆ(X0)])2
−2E (f (Xo)−E[fˆ(X0)])(fˆ(X0)−E[fˆ(X0)]) +σε2 􏰅􏰆
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 􏰅􏰆
−2(f(Xo)−E[fˆ(X0)])E (fˆ(X0)−E[fˆ(X0)]) +σε2 􏰎 􏰍􏰌 􏰏
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 +σε2

The Bias-Variance Tradeoff
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2 􏰎 􏰍􏰌 􏰏􏰎 􏰍􏰌 􏰏
BIAS2 Variance
􏰀 This is known as the Bias-Variance Tradeoff
􏰀 More complex models can pick up subtle elements of true f (X )
􏰀 Less bias
􏰀 More complex models vary more across different training datasets
􏰀 More variance
􏰀 Introducing a little bias may allow substantial decrease in variance 􏰀 And hence reduce MSE (or Prediction Error)

Depicting the Bias Variance Tradeoff

Depicting the Bias Variance Tradeoff

Depicting the Bias Variance Tradeoff

Regularization and OLS
y =X′β+ε iii
􏰀 Today: two tweaks on linear regression 􏰀 RIDGE and LASSO
􏰀 Both operate by regularizing or shrinking components of βˆ toward 0 􏰀 This introduces bias, but may reduce variance
􏰀 Simple intuition:
􏰀 Force βˆ to be small: won’t vary so much across training data sets
􏰀 Take the extreme case: βk = 0 for all k… 􏰀 No variance

The RIDGE Objective
􏰀 OLS Objective:
􏰀 Ridge objective:
y =X′β+ε iii
βˆOLS =argmin∑(yi −Xi′β)2.
βˆRIDGE = argmin ∑(yi −Xi′β)2 subject to ∑ βk2 ≤ c.

The RIDGE Objective
􏰀 Equivalent to minimizing the penalized residual sum of squares: NK
PRSS(β) = (y −X′β)2+λ β2 l2∑ii ∑k
i=1 k=1 􏰀 PRSS(β)l2 is convex ⇒ unique solution
􏰀 λ is called the penalty 􏰀 Penalizes large βk
􏰀 Different values of λ provide different βˆRIDGE λ
􏰀 As λ →0 we have βˆRIDGE →βˆOLS λ
􏰀 As λ → ∞ we have βˆRIDGE → 0 λ

Aside: Standardization
􏰀 By convention yi and Xi are assumed to be mean 0
􏰀 Xi should also be standardized (unit variance)
􏰀 All βk treated the same by penalty, don’t want different scaling

Closed Form Solution to Ridge Objective
􏰀 The ridge solution is given by (you will prove this): βˆRIDGE =(X′X+λIK)−1X′y
􏰀 Here X is the Matrix with Xi as rows
􏰀 IK is the Identity Matrix
􏰀 Note that λIK makes X′X +λIK invertible even if X′X isn’t
􏰀 For example if K >N
􏰀 This was actually the original motivation for the problem

RIDGE is Biased
􏰀 DefineA=X′X
βˆRIDGE =(X′X+λIK)−1X′y
􏰀 Therefore, if λ ̸= 0
=(A+λIK)−1A(A−1X′y)
= (A[Ik +λA−1])−1A(A−1X′y)
= (Ik +λA−1)−1A−1A((X′X)−1X′y) =(Ik +λA−1)−1βˆols
E[βˆRIDGE]=E[(Ik +λA−1])−1βˆols]̸=β λ

Pros and cons of RIDGE
􏰀 Simple, closed form solution
􏰀 Can deal with K >> N and multicollinearity
􏰀 Introduces bias but can improve out of sample fit
􏰀 Shrinks coefficients but will not simplify model by eliminating variables

LASSO Objective
􏰀 RIDGE will include all K predictors in the final model 􏰀 No simplification
􏰀 LASSO is a relatively recent alternative that overcomes this: NK
βˆLASSO =argmin∑(yi −Xi′β)2 subject to β i=1
􏰀 Can also write this as minimizing: NK
PRSS(β) = (y −X′β)2 +λ l1∑ii ∑k
∑|βk|≤c. k=1

LASSO and Sparse Models
􏰀 Like RIDGE, LASSO will shrink βk s toward 0
􏰀 However, the l1 penalty will force some coefficient estimates to be exactly 0 if λ is large enough
􏰀 Sparse models: lets us ignore some features
􏰀 Again different values of λ provide different βˆLASSO λ
􏰀 Need to find a good choice of λ

Why Does LASSO Set some βk to 0?

LASSO Details
􏰀 Unlike RIDGE, LASSO has no closed form solution 􏰀 Requires numerical methods
􏰀 Neither LASSO nor RIDGE universally dominates

Elastic Net: Combining LASSO and RIDGE Penalties
􏰀 Simplest version of elastic net (nests LASSO and RIDGE): ˆelastic 􏰇1N ′2 􏰅K K2􏰆􏰈
β =argmin ∑(yi−Xiβ) +λ α∑|βk|+(1−α)∑βk β Ni=1 k=1 k=1
􏰀 α ∈ [0, 1] weights LASSO vs. RIDGE style Penalties 􏰀 α=1isLASSO
􏰀 α=0isRIDGE

Implementing LASSO
1. An example of a prediction problem 2. Elastic Net and LASSO in R
3. How to choose hyperparameter λ 4. Cross-validation in R

An Example of a Prediction Problem
􏰀 Suppose we see 100 observations of some outcome yi
􏰀 Example: residential real estate prices in London (i.e. home prices)
􏰀 We have 50 characteristics x1i,x2i,···,x50i that might predict yi 􏰀 E.g. number of rooms, size, neighborhood dummy, etc.
􏰀 Want to build a model that helps us predict yi out of sample 􏰀 I.e. the price of some other house in London

We Are Given 100 Observations of yi
0 25 50 75 100
Observation
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●

How Well Can We Predict Out-of-Sample Outcomes (yoos) i
0 25 50 75 100
Observation
●●● ●●●●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
Outcome and Prediction

Using x1i,x2i,··· ,x50i to Predict yi
􏰀 The goal is to use x1i,x2i,···,x50i to predict any yoos i
􏰀 If you give us number of rooms, size, etc., we will tell you home price 􏰀 Need to build a model fˆ(·):
yˆi =fˆ(x1i,x2i,···,x50i)
􏰀 A good model will give us predictions close to yoos
􏰀 We can accurately predict prices for other homes

A Good Model Has Small Distance (yoos −yˆoos)2 i
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
Outcome and Prediction

Our Working Example: Suppose Only a Few xki Matter
􏰀 We see yi and x1i,x2i,··· ,x50i
􏰀 The true model (which we will pretend we don’t know) is:
yi = 5·x1i +4·x2i +3·x3i +2·x4i +1·x5i +εi 􏰀 Only the first 5 attributes matter!
􏰀 In other words β1 =5,β2 =4,β3 =3,β4 =2,β5 =1 􏰀 βk=0fork=6,7,···,50

Prediction Using OLS
􏰀 A first approach would be to try to predict using OLS:
yi = β0 +β1x1i +β2x2i +β3x3i +β4x4i +β5x5i +β6x6i +β7x7i +···+β50x50i +vi
􏰀 Have to estimate 51 different parameters with only 100 data points 􏰀 ⇒ not going to get precise estimates
􏰀 ⇒ out of sample predictions will be bad

OLS Coefficients With 100 Observations Aren’t Great
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate

OLS Doesn’t Give Close Predictions for New yoos : i
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction

Aside: OLS Does Much Better With 10,000 Observations
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate

OLS Out-of-Sample Predictions: 100 Training Obs.
0 25 50 75 100
Observation
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction

OLS Out-of-Sample Predictions: 10,000 Training Obs.
0 25 50 75 100
Observation
● ●●●●● ●●●● ●
● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●
●● ●● ●●●● ●● ●●
●●●● ● ●●●●●●● ●● ● ● ● ● ●● ●
●●● ●●●● ●●● ●● ●●●●●●●●● ●
●●● ● ● ● ● ●●
● ● ●●●● ●●●● ●●●●●
1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
Outcome and Prediction

Solution to The OLS Problem: Regularization
􏰀 With 100 Observations OLS Didn’t do Very Well 􏰀 Solution: regularization
􏰀 LASSO/RIDGE/Elastic Net
􏰀 Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 􏰀 For λ = 0 this is just OLS
􏰀 Forλ>0,α=1thisisLASSO:
ˆelastic 􏰇1N 2 􏰅K K2􏰆􏰈 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
􏰇1N 2 􏰅K 􏰆􏰈 ∑(yi−β0−β1x1i···−βKxKi) +λ ∑|βk|

Implementing Elastic Net in R
􏰀 The big question when running LASSO is the choice of λ 􏰀 By default, glmnet(·) tries 100 different choices for λ
􏰀 Starts with λ just large enough that all βk = 0 􏰀 Proceeds with steps towards λ = 0
􏰀 For each λ, we estimate corresponding coefficients βlasso(λ) 􏰀 How do we decide which one is best?

LASSO Coefficients With 100 Observations (λ=0.2)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=1)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=3)
5 10 15 20 25 30 35 40 45 50
X Variable
Coefficient Estimate

LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
Coefficients
−1 0 1 2 3 4 5

How to Choose λ (tuning)
􏰀 One option would be to look at performance out-of-sample
􏰀 Compare out-of-sample mean squared error for different values of λ
􏰀 For example:
􏰀 MSEoos(0.2)=26.18
􏰀 MSEoos(1)=22.59 􏰀 MSEoos(3)=56.31
􏰀 Would like a way of choosing that does not require going out of sample…

How to Choose λ?
􏰀 Need a disciplined way of choosing λ without going out of sample 􏰀 Could split training data into a training and validation sample
􏰀 Estimate model on training data for many λ 􏰀 Compute MSE on validation sample
􏰀 Choose λ that gives smallest MSE
􏰀 What if there is something weird about your particular validation sample?

K-fold Cross Validation
􏰀 Most common approach is K − fold cross validation
􏰀 Partition training data into K separate subsets of equal size
􏰀 Usually K is either 5 or 10
􏰀 For any k = 1,2,··· ,K exclude the kth fold and estimate the model
for many λ on the remaining data
􏰀 For each λ, compute the MSEcv on the excluded fold
􏰀 Do this for all K folds:
􏰀 Now you have K estimates of MSEcv for each λ k,λ

K-fold Cross Validation
􏰀 K estimates of MSEcv for each λ k,λ
􏰀 Can also compute standard deviations
̄ cv Choose λ that gives small MSEλ
􏰀 Compute mean of the MSEs for each λ:
MSEλ =K∑MSEk,λ

How to Choose λ : k-fold Cross Validation

How to Choose λ : k-fold Cross Validation 􏰀 Partition the sample into k equal folds
􏰀 The default for R is k=10
􏰀 For our sample, this means 10 folds with 10 observations each
􏰀 Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each)
2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
􏰀 This provides 10 estimates of MSE(λ) for each λ
􏰀 Can construct means and standard deviations of MSE(λ) for each λ
􏰀 Choose λ that gives small mean MSE(λ)

Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
−5 −4 −3 −2 −1 0 1
log(Lambda)
Mean−Squared Error
30 40 50 60 70 80 90

λ = 0.50 Minimizes Cross-Validation MSE
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
Coefficient Estimate

λ = 0.87 Is Most Regularized Within 1 S.E.
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
Coefficient Estimate

Now Do LASSO Yourself
􏰀 On the Hub you will find two files: 􏰀 Training data: menti 200.csv
􏰀 Testing data (out of sample): menti 200 test.csv
􏰀 There are 200 observations and 200 predictors
􏰀 Three questions:
1. Run LASSO on the training data: what is the out-of-sample MSE for
the λ that gives minimum mean cross-validated error?
2. How many coefficients (excluding intercept) are included in the most
regularized model with error within 1 s.e. of the minimum?
3. Run a RIDGE regression. Is the out-of-sample MSE higher or lower than for LASSO?
􏰀 Extra time: estimate an elastic net regression with α = 0.5 􏰀 How would you tune α? Google caret or train…

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com