CS计算机代考程序代写 algorithm finance Prediction and Regularization

Prediction and Regularization
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
February 1st and 2nd
1/59

Overview
1. The prediction problem and an example of overfitting 2. The Bias-Variance Tradeoff
3. LASSO and RIDGE
4. Implementing LASSO and RIDGE via glmnet()
2/59

A Basic Prediction Model
􏰒 Suppose y is given by:
􏰒 X is a vector of attributes
􏰒 ε has mean 0, variance σ2, ε ⊥X
􏰒 Our goal is to find a model 􏱓f (X ) that approximates f (X )
y = f (X ) + ε
3/59

Suppose We Are Given 100 Observations of y
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●


●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●

●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●


4/59
Outcome

How Well Can We Predict Out-of-Sample Outcomes (yoos)
20
10
0
−10
−20
0 25 50 75 100
Observation

● ●●
● ●



●●
●● ● ●●●
● ●●●●●

●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●

●● ●

●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●






5/59
Outcome and Prediction

Predicting Out of Sample Outcomes (fˆ(Xoos))
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●
●●● ● ● ● ● ●
●●●●●● ●●● ●●●●● ● ●●●●
●●●●●
●●● ● ●● ●●●●
●●●● ●●● ●●●●●●●●



●●●●
● ●●
●●●●● ●●● ●●●
● ●● ●● ●●●●●● ●●● ● ●
●●● ●●●● ●
● ●●●●● ●●●●●● ●●●●●●
●● ● ●●● ●●● ●●
●●●●●●● ●●● ●●●
●●●●●● ● ●●●
●●● ●● ●●●●
●● ●● ●●●
● ●●
● ●●●
●●● ●





●●
5/59
Outcome and Prediction

A Good Model Has Small Distance (yoos −fˆ(Xoos))2
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●

●●● ●




6/59
Outcome and Prediction

A Simple Metric of Fit is the MSE
􏰒 A standard measure of fit is the mean squared error between y and fˆ: MSE =E[(y−fˆ(X))2]
􏰒 With N observations we can estimate this as: 􏱓1Nˆ2
MSE = N ∑[(yi −f(Xi)) ] i=1
7/59

Overfitting
􏰒 Key tradeoff in building a model:
􏰒 A model that fits the data you have
􏰒 A model that will perform well on new data
􏰒 Easy to build a model that fits the data you have well
􏰒 With enough parameters
􏰒 fˆ fit too closely to one dataset may perform poorly out-of-sample 􏰒 This is called overfitting
8/59

An Example of Overfitting
􏰒 On the hub you will find a dataset called polynomial.csv 􏰒 Variables y and x
􏰒 Split into test and training data
􏰒 y is generated as a polynomial in x plus random noise:
P
yi = ∑θpxip+εi
p=0 􏰒 Don’t know the order of P…
9/59

Exercise:
􏰒 Fit a regression with a 2nd order polynomial in x
2
yi = ∑θpxip+εi p=0
􏰒 What is the In-Sample MSE?
􏰒 What is the Out-of-Sample MSE?
10/59

Exercise:
􏰒 Fit a regression with a 25th order polynomial in x
25
yi = ∑θpxip+εi p=0
􏰒 What is the In-Sample MSE?
􏰒 What is the Out-of-Sample MSE?
􏰒 Is the in-sample fit better or worse than the quadratic model?
􏰒 Is the out-of-sample fit better or worse than the quadratic model?
􏰒 If you finish early:
􏰒 What order polynomial gives the best out-of-sample fit?
11/59

Formalizing Overfitting: Bias-Variance Tradeoff
􏰒 Consider an algorithm to build model 􏱓f (X ) given training data D 􏰒 Could write f􏱋(X)
􏰒 Consider the MSE at some particular out-of-sample point X0: MSE(X0) = E[(y0 −fˆ(X0))2]
􏰒 Here the expectation is taken over y and all D 􏰒 We may show that:
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2 􏰐 􏰏􏰎 􏰑􏰐 􏰏􏰎 􏰑
BIAS2 Variance
D
12/59

Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E[(y0 −fˆ(X0))2]
= E[(f (Xo)−fˆ(X0))2]+E[ε2]+2E[f (X0)−fˆ(X0)]E[ε] = E[(f (Xo)−fˆ(X0))2]+σε2
13/59

Formalizing Overfitting: Bias-Variance Tradeoff
MSE(X0) = E = E = E
􏰋􏰌
(f (Xo)−fˆ(X0))2 +σε2 􏰋􏰌
(f (Xo)−E[fˆ(X0)]−fˆ(X0)+E[fˆ(X0)])2 +σε2 􏰋􏰌 (f (Xo)−E[fˆ(X0)])2]+E[(fˆ(X0)−E[fˆ(X0)])2
􏰋􏰌
−2E (f (Xo)−E[fˆ(X0)])(fˆ(X0)−E[fˆ(X0)]) +σε2 􏰋􏰌
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 􏰋􏰌
−2(f(Xo)−E[fˆ(X0)])E (fˆ(X0)−E[fˆ(X0)]) +σε2 􏰐 􏰏􏰎 􏰑
=0
􏰋􏰌
= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 +σε2
14/59

The Bias-Variance Tradeoff
MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2 􏰐 􏰏􏰎 􏰑􏰐 􏰏􏰎 􏰑
BIAS2 Variance
􏰒 This is known as the Bias-Variance Tradeoff
􏰒 More complex models can pick up subtle elements of true f (X )
􏰒 Less bias
􏰒 More complex models vary more across different training datasets
􏰒 More variance
􏰒 Introducing a little bias may allow substantial decrease in variance 􏰒 And hence reduce MSE (or Prediction Error)
15/59

Depicting the Bias Variance Tradeoff
16/59

Depicting the Bias Variance Tradeoff
17/59

Depicting the Bias Variance Tradeoff
18/59

Regularization and OLS
y =X′β+ε iii
􏰒 Today: two tweaks on linear regression 􏰒 RIDGE and LASSO
􏰒 Both operate by regularizing or shrinking components of βˆ toward 0 􏰒 This introduces bias, but may reduce variance
􏰒 Simple intuition:
􏰒 Force βˆ to be small: won’t vary so much across training data sets
􏰒 Take the extreme case: βk = 0 for all k… 􏰒 No variance
k
19/59

The RIDGE Objective
􏰒 OLS Objective:
􏰒 Ridge objective:
y =X′β+ε iii
N
βˆOLS =argmin∑(yi −Xi′β)2.
β i=1
NK
βˆRIDGE = argmin ∑(yi −Xi′β)2 subject to ∑ βk2 ≤ c.
β i=1 k=1
20/59

The RIDGE Objective
􏰒 Equivalent to minimizing the penalized residual sum of squares: NK
PRSS(β) = (y −X′β)2+λ β2 l2∑ii ∑k
i=1 k=1 􏰒 PRSS(β)l2 is convex ⇒ unique solution
􏰒 λ is called the penalty 􏰒 Penalizes large βk
􏰒 Different values of λ provide different βˆRIDGE λ
􏰒 As λ →0 we have βˆRIDGE →βˆOLS λ
􏰒 As λ → ∞ we have βˆRIDGE → 0 λ
21/59

Aside: Standardization
􏰒 By convention yi and Xi are assumed to be mean 0
􏰒 Xi should also be standardized (unit variance)
􏰒 All βk treated the same by penalty, don’t want different scaling
22/59

Closed Form Solution to Ridge Objective
􏰒 The ridge solution is given by (you will prove this): βˆRIDGE =(X′X+λIK)−1X′y
λ
􏰒 Here X is the Matrix with Xi as rows
􏰒 IK is the Identity Matrix
􏰒 Note that λIK makes X′X +λIK invertible even if X′X isn’t
􏰒 For example if K >N
􏰒 This was actually the original motivation for the problem
23/59

RIDGE is Biased
􏰒 DefineA=X′X
βˆRIDGE =(X′X+λIK)−1X′y
􏰒 Therefore, if λ ̸= 0
λ
=(A+λIK)−1A(A−1X′y)
= (A[Ik +λA−1])−1A(A−1X′y)
= (Ik +λA−1)−1A−1A((X′X)−1X′y) =(Ik +λA−1)−1βˆols
E[βˆRIDGE]=E[(Ik +λA−1])−1βˆols]̸=β λ
24/59

Pros and cons of RIDGE
􏰒 Pros:
􏰒 Simple, closed form solution
􏰒 Can deal with K >> N and multicollinearity
􏰒 Introduces bias but can improve out of sample fit
􏰒 Cons:
􏰒 Shrinks coefficients but will not simplify model by eliminating variables
25/59

LASSO Objective
􏰒 RIDGE will include all K predictors in the final model 􏰒 No simplification
􏰒 LASSO is a relatively recent alternative that overcomes this: NK
βˆLASSO =argmin∑(yi −Xi′β)2 subject to β i=1
􏰒 Can also write this as minimizing: NK
PRSS(β) = (y −X′β)2 +λ l1∑ii ∑k
i=1 k=1
∑|βk|≤c. k=1
|β |
26/59

LASSO and Sparse Models
􏰒 Like RIDGE, LASSO will shrink βk s toward 0
􏰒 However, the l1 penalty will force some coefficient estimates to be exactly 0 if λ is large enough
􏰒 Sparse models: lets us ignore some features
􏰒 Again different values of λ provide different βˆLASSO λ
􏰒 Need to find a good choice of λ
27/59

Why Does LASSO Set some βk to 0?
28/59

LASSO Details
􏰒 Unlike RIDGE, LASSO has no closed form solution 􏰒 Requires numerical methods
􏰒 Neither LASSO nor RIDGE universally dominates
29/59

Elastic Net: Combining LASSO and RIDGE Penalties
􏰒 Simplest version of elastic net (nests LASSO and RIDGE): ˆelastic 􏰦1N ′2 􏰋K K2􏰌􏰧
β =argmin ∑(yi−Xiβ) +λ α∑|βk|+(1−α)∑βk β Ni=1 k=1 k=1
􏰒 α ∈ [0, 1] weights LASSO vs. RIDGE style Penalties 􏰒 α=1isLASSO
􏰒 α=0isRIDGE
30/59

Implementing LASSO
1. An example of a prediction problem 2. Elastic Net and LASSO in R
3. How to choose hyperparameter λ 4. Cross-validation in R
31/59

An Example of a Prediction Problem
􏰒 Suppose we see 100 observations of some outcome yi
􏰒 Example: residential real estate prices in London (i.e. home prices)
􏰒 We have 50 characteristics x1i,x2i,···,x50i that might predict yi 􏰒 E.g. number of rooms, size, neighborhood dummy, etc.
􏰒 Want to build a model that helps us predict yi out of sample 􏰒 I.e. the price of some other house in London
32/59

We Are Given 100 Observations of yi
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●
● ●●●●


●● ●● ●
● ●●
●●●● ●●●●●
● ●●● ●●●●●
●● ●●● ●●●●
●●●

●● ●●
● ●
● ●●●● ●
● ●●
●● ●
● ●●
● ● ●● ●●
●● ●●
●●●●●●● ●
●●●● ●●
● ●●


33/59
Outcome

How Well Can We Predict Out-of-Sample Outcomes (yoos) i
20
10
0
−10
−20
0 25 50 75 100
Observation

● ●●
● ●



●●
●● ● ●●●
● ●●●●●

●●● ●●●●●●
●●●●● ●
●●●● ●●●●
● ●

●● ●

●● ●●
●●
● ●●●
● ● ●●● ●●●●●●
● ●●●● ●●●
●●●●●
●●
● ●●●






34/59
Outcome and Prediction

Using x1i,x2i,··· ,x50i to Predict yi
􏰒 The goal is to use x1i,x2i,···,x50i to predict any yoos i
􏰒 If you give us number of rooms, size, etc., we will tell you home price 􏰒 Need to build a model fˆ(·):
yˆi =fˆ(x1i,x2i,···,x50i)
􏰒 A good model will give us predictions close to yoos
􏰒 We can accurately predict prices for other homes
i
35/59

A Good Model Has Small Distance (yoos −yˆoos)2 i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●● ●●
● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ● ●●●
●●●● ●●●●●●●
●●●●●● ●●●●●
●●●● ●● ● ● ● ●
● ●● ●● ●●●●●●
●●● ● ●
●●●● ●●●
●●●●●●●●●●●
●●●● ● ● ●●● ●●●●●
●●●●●● ●●●● ●●●●●
● ●● ● ● ● ●● ● ●●●
●●●●●● ● ●●●
● ●
● ● ● ●●
●●● ●● ●●●●
●● ●●
● ●●●

●●● ●




36/59
Outcome and Prediction

Our Working Example: Suppose Only a Few xki Matter
􏰒 We see yi and x1i,x2i,··· ,x50i
􏰒 The true model (which we will pretend we don’t know) is:
yi = 5·x1i +4·x2i +3·x3i +2·x4i +1·x5i +εi 􏰒 Only the first 5 attributes matter!
􏰒 In other words β1 =5,β2 =4,β3 =3,β4 =2,β5 =1 􏰒 βk=0fork=6,7,···,50
37/59

Prediction Using OLS
􏰒 A first approach would be to try to predict using OLS:
yi = β0 +β1x1i +β2x2i +β3x3i +β4x4i +β5x5i +β6x6i +β7x7i +···+β50x50i +vi
􏰒 Have to estimate 51 different parameters with only 100 data points 􏰒 ⇒ not going to get precise estimates
􏰒 ⇒ out of sample predictions will be bad
38/59

OLS Coefficients With 100 Observations Aren’t Great
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
39/59
Coefficient Estimate

OLS Doesn’t Give Close Predictions for New yoos : i
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●

● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●●●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ●
●● ● ● ●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
●● ●● ●●●
● ●●
●●

● ●●●



●● ●

1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 39.33
i=1
40/59
Outcome and Prediction

Aside: OLS Does Much Better With 10,000 Observations
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
41/59
Coefficient Estimate

OLS Out-of-Sample Predictions: 100 Training Obs.
20
10
0
−10
−20
0 25 50 75 100
Observation
●● ●●●●

● ●●
●● ●● ●●● ●
●●●●●● ●●● ●●●● ●● ●●●
●●●●●●●●
●●●●● ●●● ● ● ●●●
●●●●●● ● ●●●●●
●●●●●●●● ●● ●●● ●●● ● ● ●●●
●●●●●●●●●● ●●●● ●
● ●●● ●●●●●● ●●●●●
● ●● ● ●
●● ● ● ●●
● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●
●●● ● ●●●●●
●● ●● ●●●
● ●●
●●

● ●●●



●● ●

1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 39.33
i=1
42/59
Outcome and Prediction

OLS Out-of-Sample Predictions: 10,000 Training Obs.
20
10
0
−10
−20
0 25 50 75 100
Observation
● ●●●
●● ●


●●● ●●●

●●● ●
●●●
●●●●
● ● ●●
● ●●●●● ●●●● ●
●●●● ●●●●
● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●

●● ● ●●●
●● ●● ●●●● ●● ●●
●●●● ● ●●●●●●● ●● ● ● ● ● ●● ●
●●● ●●●● ●●● ●● ●●●●●●●●● ●
●●● ● ● ● ● ●●
● ● ●●●● ●●●● ●●●●●
●●● ● ●

●● ●






1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )
= 18.92
i=1
43/59
Outcome and Prediction

Solution to The OLS Problem: Regularization
􏰒 With 100 Observations OLS Didn’t do Very Well 􏰒 Solution: regularization
􏰒 LASSO/RIDGE/Elastic Net
􏰒 Simplest version of elastic net (nests LASSO and RIDGE):
β Ni=1 k=1 k=1 􏰒 For λ = 0 this is just OLS
􏰒 Forλ>0,α=1thisisLASSO:
ˆLASSO
β =argmin
β
ˆelastic 􏰦1N 2 􏰋K K2􏰌􏰧 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk
􏰦1N 2 􏰋K 􏰌􏰧 ∑(yi−β0−β1x1i···−βKxKi) +λ ∑|βk|
N i=1 k=1
44/59

Implementing Elastic Net in R
􏰒 The big question when running LASSO is the choice of λ 􏰒 By default, glmnet(·) tries 100 different choices for λ
􏰒 Starts with λ just large enough that all βk = 0 􏰒 Proceeds with steps towards λ = 0
􏰒 For each λ, we estimate corresponding coefficients βlasso(λ) 􏰒 How do we decide which one is best?
45/59

LASSO Coefficients With 100 Observations (λ=0.2)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
46/59
Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=1)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
47/59
Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=3)
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
48/59
Coefficient Estimate

LASSO Coefficients For All λ
49 49 46 35 21 10 3
−5 −4 −3 −2 −1 0 1
Log Lambda
49/59
Coefficients
−1 0 1 2 3 4 5

How to Choose λ (tuning)
􏰒 One option would be to look at performance out-of-sample
􏰒 Compare out-of-sample mean squared error for different values of λ
􏰒 For example:
􏰒 MSEoos(0.2)=26.18
􏰒 MSEoos(1)=22.59 􏰒 MSEoos(3)=56.31
􏰒 Would like a way of choosing that does not require going out of sample…
MSEoos(λ)
50/59

How to Choose λ?
􏰒 Need a disciplined way of choosing λ without going out of sample 􏰒 Could split training data into a training and validation sample
􏰒 Estimate model on training data for many λ 􏰒 Compute MSE on validation sample
􏰒 Choose λ that gives smallest MSE
􏰒 What if there is something weird about your particular validation sample?
51/59

K-fold Cross Validation
􏰒 Most common approach is K − fold cross validation
􏰒 Partition training data into K separate subsets of equal size
􏰒 Usually K is either 5 or 10
􏰒 For any k = 1,2,··· ,K exclude the kth fold and estimate the model
for many λ on the remaining data
􏰒 For each λ, compute the MSEcv on the excluded fold
k,λ
􏰒 Do this for all K folds:
􏰒 Now you have K estimates of MSEcv for each λ k,λ
52/59

K-fold Cross Validation
􏰒 K estimates of MSEcv for each λ k,λ
􏰒 Compute mean of the MSEs for each λ:
̄cv 1K
MSEλ =K∑MSEk,λ
cv k=1
􏰒
􏰒 Can also compute standard deviations
̄ cv Choose λ that gives small MSEλ
53/59

How to Choose λ : k-fold Cross Validation
54/59

How to Choose λ : k-fold Cross Validation 􏰒 Partition the sample into k equal folds
􏰒 The default for R is k=10
􏰒 For our sample, this means 10 folds with 10 observations each
􏰒 Cross-validation proceeds in several steps:
1. Choose k-1 folds (9 folds in our example, with 10 observations each)
2. Run LASSO on these 90 observations
3. find βlasso(λ) for all 100 λ
4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds
􏰒 This provides 10 estimates of MSE(λ) for each λ
􏰒 Can construct means and standard deviations of MSE(λ) for each λ
􏰒 Choose λ that gives small mean MSE(λ)
55/59

Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1
●●● ●
●●●
● ●
● ●
●●
●●
●●
●●
●● ●
● ●
● ●●
● ●
● ●
● ●

● ●●
● ●
●● ●
● ●

● ●●
● ●●
●● ●● ●●
● ●●
●● ●● ●●●
●● ●● ●


−5 −4 −3 −2 −1 0 1
log(Lambda)
56/59
Mean−Squared Error
30 40 50 60 70 80 90

λ = 0.50 Minimizes Cross-Validation MSE
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
= 22.39
i=1
57/59
Coefficient Estimate

λ = 0.80 Is Most Regularized Within 1 S.E.
5
4
3
2
1
0
−1
5 10 15 20 25 30 35 40 45 50
X Variable
1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )
= 21.79
i=1
58/59
Coefficient Estimate

Now Do LASSO Yourself
􏰒 On the Hub you will find two files: 􏰒 Training data: menti 200.csv
􏰒 Testing data (out of sample): menti 200 test.csv
􏰒 There are 200 observations and 200 predictors
􏰒 Three questions:
1. Run LASSO on the training data: what is the out-of-sample MSE for
the λ that gives minimum mean cross-validated error?
2. How many coefficients (excluding intercept) are included in the most
regularized model with error within 1 s.e. of the minimum?
3. Run a RIDGE regression. Is the out-of-sample MSE higher or lower than for LASSO?
􏰒 Extra time: estimate an elastic net regression with α = 0.5 􏰒 How would you tune α? Google caret or train…
59/59