程序代写代做代考 go ECONOMETRICS I ECON GR5411

ECONOMETRICS I ECON GR5411
Lecture 25 – Big Data (MSPE) by
Seyhan Erden Columbia University
MA in Economics

12/13/20
Big Data and Ridge Outline: 1.What is “big data”?
2.Prediction with many predictors: the MSPE, OLS and the principle of shrinkage
3.Ridge regression
Lecture 25 – GR5411 by Seyhan Erden 2

1. What is “Big Data”?
“Big Data” means many things:
ØData sets with many observations (millions)
ØData sets with many variables (thousands or more)
ØData sets with nonstandard data types, like texts, voice, or images
12/13/20 Lecture 25 – GR5411 by Seyhan Erden 3

12/13/20
Lecture 25 – GR5411 by Seyhan Erden 4
“Big Data” has many different applications:
ØPrediction using many predictors
ØGiven your browsing history, what products are
you most likely to shop for now?
ØGiven your loan application profile, how likely are you to repay a bank loan?
ØPrediction using highly nonlinear models (for which you need many observations)
ØRecognition problems, like facial and voice recognition

12/13/20
“Big Data” has different jargon
”Big Data” has different jargon, which makes it seem very different than statistics and econometrics…
Ø“Machine learning” when a computer (machine) uses a large data set to learn (e.g. about your online shopping preferences)
But at its core, machine learning builds on familiar tools of prediction:
ØOne of the major big data application: prediction with many predictors. We treat this as a regression problem, but we need new methods that go beyond OLS.
ØFor prediction, we do not need causal coefficients.
Lecture 25 – GR5411 by Seyhan Erden 5

12/13/20
Lecture 25 – GR5411 by Seyhan Erden 6
2. Prediction with many predictors:
The MSPE, OLS, and the principle of shrinkage
The many−predictor problem:
ØThe goal is to provide a good prediction of some outcome 𝑌 given a large numbers of 𝑋’𝑠, when the number of 𝑋’𝑠 (𝑘) is large relative to the number of observations (𝑛) – in fact, maybe 𝑘 > 𝑛!

12/13/20
Lecture 25 – GR5411 by Seyhan Erden 7
The goal is good out-of-sample prediction
ØThe estimation sample is the 𝑛 observations used to estimate the prediction model
ØThe prediction is made using the estimated model, for an out−of−sample (OOS) observations – an observation not in the estimation sample.

12/13/20
Lecture 25 – GR5411 by Seyhan Erden 8
The Predictive Regression Model
The standardized predictive regression model is the linear model, with all the 𝑋’𝑠 normalized (standardized) to have mean of zero and a standard deviation of one, and 𝑌 is deviated from its mean:
𝑌! =𝛽”𝑋”! +𝛽#𝑋#! +⋯+𝛽$𝑋$! +𝑢!
The intercept is excluded because all the variables have mean zero. Let 𝑋∗ ,…,𝑋∗ ,𝑌∗ , 𝑖 = 1,…,𝑛,
“!$!! ∗ denotes the data as originally collected, where 𝑋&!
is the 𝑖'( observation on the 𝑗'( original regressor.
In matrix form:
𝑦 = 𝑋𝛽 + 𝑢

The Predictive Regression Model (cont’)
Throughout this lecture, we use standardized X’s, demeaned Y’s and standardized predictive regression model:
𝑌! =𝛽”𝑋”! +𝛽#𝑋#! +⋯+𝛽$𝑋$! +𝑢!
ØWeassume𝐸𝑌|𝑋 =𝛽”𝑋”+⋯+𝛽$𝑋$so 𝐸𝑢!|𝑋! =0
ØBecause all the variables, including 𝑌, are deviated from their means, the intercept is zero – so is omitted from the model.
ØThe model allows for linearities in 𝑋 by letting some of the 𝑋’𝑠 be squares, cubes, logs, interactions, etc.
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 9

The Mean Squared Prediction Error
The Mean Squared Prediction Error (MSPE) is the
expected value of the squared error made by predicting
Y for an observation not in the estimation data set:
#
where:
Ø𝑌 is the variable to be predicted
Ø𝑋 denotes the 𝑘 variables used to make prediction,
𝑋!!”, 𝑌!!” are the values of 𝑋 and 𝑌 in the out-of-sample data set.
ØThe prediction 𝑌% 𝑋!!” uses a model estimated using the estimation data set, evaluated at 𝑋!!”.
ØThe MSPE measures the expected quality of the
prediction made for an out-of-sample observation.
𝑀 𝑆 𝑃 𝐸 = 𝐸 𝑌 ) ) * − 𝑌8 𝑋 ) ) *
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 10

The Mean Squared Prediction Error
# 𝑌8 𝑋 ) ) * = 𝛽9 ” 𝑋 “) ) * + ⋯ + 𝛽9 $ 𝑋 $) ) *
𝑀 𝑆 𝑃 𝐸 = 𝐸 𝑌 ) ) * − 𝑌8 𝑋 ) ) *
where
The prediction error is
𝑌 ! ! ” − 𝛽’ # 𝑋 #! ! ” + ⋯ + 𝛽’ $ 𝑋 $! ! ”
= 𝛽#𝑋#!!” + ⋯ + 𝛽$𝑋$!!” + 𝑢!!” − 𝛽’#𝑋#!!” + ⋯ + 𝛽’$𝑋$!!”
=𝑢!!”− 𝛽’#−𝛽#𝑋#!!”+⋯+𝛽’$−𝛽$𝑋$!!”
Then, 𝑀𝑆𝑃𝐸 = 𝐸 𝑌))* − 𝑌8 𝑋))* #is …… next slide
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 11

The Mean Squared Prediction Error
% =𝜎&%+𝐸 𝛽’#−𝛽# 𝑋#!!”+⋯+ 𝛽’$−𝛽$ 𝑋$!!” %
𝑀𝑆𝑃𝐸=𝐸 𝑢!!”− 𝛽’#−𝛽# 𝑋#!!”+⋯+ 𝛽’$−𝛽$ 𝑋$!!”
The first term, 𝜎&%, is the variance of the oracle prediction error – the prediction error made using the true (unkown) conditional mean, 𝐸 𝑌|𝑋 . The 1st term 𝜎&% is the MSPE of the oracle forecast – can’t be beaten!
The second term is the contribution to the prediction error arising from the estimated regression coefficients. The second term represents the cost, measured in terms of increased mean squared prediction error, of needing to estimate the coefficients instead of using the oracle prediction. The 2nd term arises because the 𝛽’s aren’t known, so must be estimated using estimation sample.
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 12

12/13/20
The Oracle Prediction
The oracle prediction is the best-possible prediction – the prediction that minimizes the MSPE – if you knew the joint distribution of 𝑌 and 𝑋.
The oracle prediction is the conditional expectation of 𝑌 given 𝑋, 𝐸 𝑌!!”|𝑋 = 𝑋!!”
ØSuppose not. Then the forecast error could be predicted using 𝑋!!” – if so, the forecast couldn’t have been the best possible, because it could be improved using the predicted error.
ØThe math: Suppose that your prediction of 𝑌, given the random variable 𝑋 is 𝑔(𝑋). The prediction error is 𝑌 − 𝑔(𝑋) and the quadratic loss associated with
this prediction is defined as 𝐿𝑜𝑠𝑠 = 𝐸 𝑌 − 𝑔(𝑋) # . We must show that, of all possible functions of 𝑔(𝑋), the 𝐿𝑜𝑠𝑠 is minimized by 𝑔 𝑋 = E 𝑌|𝑋 of all possible functions of 𝑔(𝑋), the
𝐿𝑜𝑠𝑠 = 𝐸 𝑌 − 𝑔(𝑋) # is minimized by 𝑔 𝑋 = E 𝑌|𝑋 Lecture 25 – GR5411 by Seyhan Erden
13

The MSPE for Linear Regression Estimated by OLS:
Let the 𝑘×1 vector 𝑋))* denote the values of the 𝑋’s for the out-of-sample observation (“𝑜𝑜𝑠”) to be predicted. With this notation, the MSPE on the slide above can be written using matrix notation as
𝑀𝑆𝑃𝐸=𝜎8#+𝐸 𝛽9−𝛽9𝑋))*# (1) where 𝛽9 denotes any estimator of 𝛽, not just OLS
estimator.
Under the least squares assumptions for prediction, the out-of-sample observation is assumed to be an i.i.d. draw from the same population as the estimation sample.
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 14

Under this assumption MSPE, eq (1) can be written as
𝑀 𝑆 𝑃 𝐸 = 𝜎 8# + 𝑡 𝑟 𝑎 𝑐 𝑒 𝐸 𝛽9 − 𝛽 𝛽9 − 𝛽 9 𝑄 : where 𝑄: = 𝐸 𝑋9𝑋 , and we can write the 2nd term of
eq. (1) as
𝐸 𝛽9−𝛽 9𝑋))* # =𝐸 𝑋))*9 𝛽9−𝛽 𝛽9−𝛽 9𝑋))*
= 𝑡𝑟𝑎𝑐𝑒𝐸 𝛽9 − 𝛽 𝛽9 − 𝛽 9𝑋))*𝑋))*9 =𝑡𝑟𝑎𝑐𝑒𝐸 𝛽9−𝛽 𝛽9−𝛽9 𝑄:
Because, the assumption that out-of-sample observations is independent of the estimation observations and drawn from the same distribution, hence 𝐸 𝑋))*𝑋))*9 = 𝑄:
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 15

The MSPE for OLS can then be found by using the following fact,𝛽9 = 𝑋9𝑋 ;”𝑋9𝑦
= 𝑋9𝑋;”𝑋9 𝑋𝛽+𝑢 =𝛽+ 𝑋9𝑋;”𝑋9𝑢
Hence,
Plugging this back into
𝛽9 − 𝛽 = 𝑋 9 𝑋 ; ” 𝑋 9 𝑢
𝐸 𝛽9−𝛽 𝛽9−𝛽 9 =𝐸 𝑋9𝑋 ;”𝑋9𝑢𝑢9𝑋 𝑋9𝑋 ;”
=𝐸 𝑋9𝑋;”𝑋9𝐸𝑢𝑢9|𝑋𝑋𝑋9𝑋;”
= 𝐸 𝑋 9 𝑋 ; ” 𝑋 9 𝑋 𝑋 9 𝑋 ; ” 𝜎 8# = 𝐸 𝑋 9 𝑋 ; ” 𝜎 8#
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 16

Now, plugging this solution back into MSPE
1 𝑋9𝑋 ;”
𝑀 𝑆 𝑃 𝐸 < = > = 𝜎 8# + 𝑛 𝑡 𝑟 𝑎 𝑐 𝑒 𝐸 𝑛 𝑄 : 𝜎 8# ( 2 )
This is the MSPE for a prediction made using the OLS estimator under the least squares assumptions for prediction with homoskedastic errors.
When 𝑛 > 𝑘 and 𝑋9𝑋⁄𝑛 ≅ 𝑄: (specifically for fixed 𝑘, 𝑋9𝑋⁄𝑛→?𝑄:)so9 ;”
𝑡 𝑟 𝑎 𝑐 𝑒 𝐸 𝑋 𝑋 𝑄 : 𝜎 8# ≅ 𝑡 𝑟 𝑎 𝑐 𝑒 𝑄 :; ” 𝑄 : 𝑛
= 𝑡𝑟𝑎𝑐𝑒 𝐼$
=𝑘
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 17

Now, plugging this solution back into eq. (2)
𝑀 𝑆 𝑃 𝐸 < = > ≅ 1 + 𝑛𝑘 𝜎 8#
This result is valid under homoskedacity and when 𝑘/𝑛
is small
ØFor a given 𝑛, the MSPE of OLS increases linearly with the number of predictors 𝑘 . A big problem with many predictors!
ØIs there an estimator for which the MSPE increases more slowly than OLS, as more predictors are added? (say for hundreds or thousands of predictors)
12/13/20
Lecture 25 – GR5411 by Seyhan Erden 18

12/13/20
The Principle of Shrinkage
In the 1950s the statisticians figured out that you could reduce the MSPE, relative to OLS, by allowing the estimator to be biased in the right way.
ØWhen the 𝑋’s are uncorrelated, these estimators are biased towards zero – or “shrunk” towards zero – and have the
form,
where 0 < 𝑐 < 1 and 𝐽𝑆 stands for James-Stein. ØBut how could introducing bias possibly help??? Lecture 25 - GR5411 by Seyhan Erden 19 𝛽9 @ > = 𝑐 𝛽9

12/13/20
Lecture 25 – GR5411 by Seyhan Erden 20
The Principle of Shrinkage
ØThe James-Stein shrinkage estimator: 𝛽9 @ > = 𝑐 𝛽9
where 0 < 𝑐 < 1. ØAs 𝑐 gets smaller: ØThe squared bias of the estimator increases (square of a number between zero and one is smaller than the number itself). ØBut the variance decreases. ØThis produces a bias-variance tradeoff. ØIf 𝑘 is large, the benefit of smaller variance can beat out the cost of larger bias, for the right choice of 𝑐 – thus reducing the MSPE. ØThe estimators we will learn here all have a shrinkage interpretation. Estimating the MSPE The MSPE is a bit tricky estimate – it isn’t just the regression SER, it is an out-of-sample, not in-sample, observation. Split-sample estimation of the MSPE ØThis method simulates the out-of-sample prediction exercise – but using only estimation sample (which is all you have!). 1. Estimate the model using half the estimation sample. 2. Use the estimated model to predict 𝑌 for the other half of the data – called the “reserve” or the “test” sample – and calculate the prediction error. 3. Estimate the MSPE using the prediction errors of the test sample 𝑀𝑆𝑃𝐸 =1O𝑌−𝑌8 # *?A!';*BC?AD 𝑛'D*' ! ! 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 21 )E*DFGB'!)H* !H 'D*' *BC?AD Estimating the MSPE by m-fold cross-validation: The split-sample estimate typically overstates the MSPE because the model is estimated on only 50% of the data. ØThis problem is reduced by using m-fold cross validation. Øm-fold cross-validation, for the case m = 10: 1. Estimate the model on 90% of the data and use it to predict the remaining 10% 2. Repeat this on the remaining 9 possible subsamples (so there is no overlap on the test samples). Estimate the MSPE using the full set of out-of-sample predictions 3. The m-fold cross-validation estimator of the MSPE is estimated by averaging these m subsample estimates of MSPE: ! 1! 𝑛, ! 𝑀𝑆𝑃𝐸!"#$%& '($)) *+%,&+-,$. = 𝑚 ),/0 𝑛/𝑚 𝑀𝑆𝑃𝐸, where 𝑛$ is the number of obs in subsample i and the factor in parentheses allows for different numbers of observations in different subsamples 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 22 3. Ridge Regression The ridge regression estimator shrinks the estimate towards zero by penalizing large squared values of the coefficients. The ridge regression estimator minimizes the penalized sum of squared residuals, 𝑆J!KLD 𝑏;𝜆J!KLD = 𝑦−𝑋𝑏 9 𝑦−𝑋𝑏 +𝜆J!KLD𝑏9𝑏 where 𝜆J!KLD𝑏9𝑏 is the “penalty” term. 𝑆J!KLD = 𝑦9𝑦 − 𝑏9𝑋9𝑦 − 𝑦9𝑋𝑏 + 𝑏9𝑋9𝑋𝑏 + 𝜆J!KLD𝑏9𝑏 𝑑𝑆 = −2𝑋9𝑦 + 2𝑋9𝑋𝑏 + 2𝜆J!KLD𝑏 = 0 𝑑𝑏 𝛽9 J ! K L D = 𝑋 9 𝑋 + 𝜆 𝐼 ; " 𝑋 9 𝑦 J!KLD $ 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 23 The Ridge Regression in a Picture The ridge regression penalty term penalizes the sum of squared residuals for large values of β, as shown here for k = 1: • The value of the ridge objective function, 𝑆𝑅𝑖𝑑𝑔𝑒(𝑏), is the sum of squared residuals plus a penalty which is a quadratic in b. • Thus, the penalized sum of squared residuals is minimized at a smaller value of b than is the unpenalized SSR. 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 24 Choosing the Ridge Regression penalty factor, 𝝀𝑹𝒊𝒅𝒈𝒆 The ridge regression estimator has an additional parameter, 𝜆'()*+: 𝑆'()*+ 𝑏;𝜆'()*+ = 𝑦−𝑋𝑏 , 𝑦−𝑋𝑏 +𝜆'()*+𝑏,𝑏 ØIt would seem natural to choose 𝜆'()*+ by minimizing both 𝑏 and 𝜆'()*+ – but doing so would simple choose 𝜆'()*+ = 0, which would just get us back to OLS! ØInstead, 𝜆'()*+ can be chosen by minimizing the m-fold cross- validated estimate of the MSPE. ØChoose some value of 𝜆%$&'(, and estimate the MSPE by m-fold cross- validation ØRepeat for many values of 𝜆%$&'(, and choose the one that yields the lowest MSPE 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 25 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 26 Empirical Example: Predicting School-level test scores Data set: a school-level version of the California elementary district data set, augmented with additional variables describing school, student, and district characteristics The full data set has 3932 observations. Half of those (1966) are used now – the remaining 1966 are reserved for an out- of-sample comparison of the ridge v. other prediction methods, done later. The data set has 817 predictors... Predicting School-level test scores Variables in the 817-predictor school test score data set 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 27 Predicting School-level test scores: Ridge Regression 𝜆'()*+ is estimated by minimizing the 10-fold cross- validated MSPE The resulting estimate of the shrinkage parameter is 39.5 Root MSPE’s: OLS: 78.2 Ridge: 39.5 𝑘 = 817, 𝑛 = 1966 12/13/20 Lecture 25 - GR5411 by Seyhan Erden 28