Overview of the data
The data is from the 1991 Survey of Income and Program Participation (SIPP). You are provided with 7933 observations.
The sample contains households data in which the reference persons aged 25-64 years old. At least one person is employed, and no one is self-employed. The observation units correspond to the household reference persons.
The data set contains a number of feature variables that you can choose to predict total wealth. The outcome variable (total wealth) and feature variables are described in the next slide.
Dataframe with the following variables
Variable to predict (outcome variable):
• tw: total wealth (in US $).
• Total wealth equals net financial assets, including Individual Retirement Account (IRA) and 401(k) assets, plus housing equity plus the value of business, property, and motor vehicles.
Variables related to retirement (features):
• ira: individual retirement account (IRA) (in US $).
• e401: 1 if eligible for 401(k), 0 otherwise
Financial variables (features):
• nifa: non-401k financial assets (in US $).
• inc: income (in US $).
Variables related to home ownership (features):
• hmort: home mortgage (in US $).
• hval: home value (in US $).
• hequity: home value minus home mortgage.
Other covariates (features):
• educ: education (in years).
• male: 1 if male, 0 otherwise.
• twoearn: 1 if two earners in the household, 0 otherwise.
• nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.
• age: age.
• fsize: family size.
• marr: 1 if married, 0 otherwise.
What is 401k and IRA?
•
•
•
• •
• The feature variable e401 contains information on the eligibility • IRA accounts:
• Everyone can participate — you can go to a bank to open an IRA account • The feature variable ira contains IRA account (in US $)
Both 401k and IRA are tax deferred savings options which aims to increase
individual saving for retirement
The 401(k) plan:
a company
–
sponsored retirement account where employees can contribute
employers can match a certain % of an employee’s contribution
401(k) plans are offered by employers —
only employees in companies
offering such plans can participate
Collection of methods
We have already seen:
• OLS
• Ridgeregressions
• Stepwise selection methods • Lasso
Note:
1. 2. 3.
In the project, you should select different methods from the list above and compare their prediction performance and interpretability
For Ridge, Stepwise selection, and Lasso, don’t forget the use of Cross- Validation
In addition to prediction performance, you might want to think about whether the set of predictors used to predict total wealth make intuitive sense
Compare the prediction performances of different methods — an example (this is just ONE EXAMPLE)
• Say, you have applied the Ridge regression and the Lasso
• For the Ridge regression, you use the K-fold CV (Slide 12) to choose the best 𝜆𝜆, say 𝜆𝜆∗
𝜆𝜆∗ , estimate the model with the ENTIRE data
𝑅𝑅𝑅𝑅
𝑅𝑅𝑅𝑅
• Note that you have computed the 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝝀𝝀∗ ) in Step 6 of Slide 12
. Given • For the Lasso, you also use the K-fold CV (Slide 12) to choose the best 𝜆𝜆, say 𝜆𝜆 . Given 𝜆𝜆 ,
𝑳𝑳
𝑹𝑹𝑹𝑹
• Note that you have computed the 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝝀𝝀∗ ) in Step 6 of Slide 12
∗𝐿𝐿 ∗𝐿𝐿 ∗
estimate the model with the ENTIRE data
• The best 𝜆𝜆 for Ridge does not have to be the same as the best 𝜆𝜆 for Lasso; that is, 𝜆𝜆 doesn’t
necessarily equal to 𝜆𝜆∗𝐿𝐿
• Which do you choose to build the prediction/fitted model? Ridge estimates or Lasso
• You compare 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝜆𝜆∗ ) with 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝜆𝜆∗ )
𝑅𝑅𝑅𝑅
estimates? 𝑅𝑅𝑅𝑅 𝐿𝐿
• If 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂 𝜆𝜆∗ > 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝜆𝜆∗ ), choose Lasso to build the prediction/fitted model; 𝑅𝑅𝑅𝑅 𝐿𝐿
otherwise, choose Ridge
• If 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂 𝜆𝜆∗ and 𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂𝒂(𝜆𝜆∗ ) are similar, choose one that you feel the resulting
𝑅𝑅𝑅𝑅 𝐿𝐿
fitted model is easier to understand (e.g., one that with fewer predictors and the predictors are intuitive)
K-fold cross validation
1. Partition the data 𝑇𝑇 into 𝐾𝐾 separate sets of equal size
• 𝑇𝑇 = (𝑇𝑇 , 𝑇𝑇 , … , 𝑇𝑇 ); e.g., 𝐾𝐾 = 5 𝑜𝑜𝑜𝑜 10
12𝐾𝐾
2. For a given 𝜆𝜆 and each 𝑚𝑚 = 1,2, … , 𝐾𝐾, estimate the model with the training data
excluding 𝑇𝑇𝑚𝑚 ̂
• Denote the obtained model by 𝑓𝑓 (⋅)
𝑚𝑚̂ −𝑚𝑚,𝜆𝜆 𝑚𝑚 3. Predict the outcomes for 𝑇𝑇 with the model from Step 2 and the input data in 𝑇𝑇
• The predicted outcomes are 𝑓𝑓 𝑥𝑥 where 𝑥𝑥 ∈ 𝑇𝑇
−𝑚𝑚,𝜆𝜆 ̂ 𝑚𝑚
4. Compute the sample mean squared (prediction) error for 𝑇𝑇 , known as the CV
𝑚𝑚 −𝑚𝑚 𝑚𝑚 𝑥𝑥,𝑦𝑦 ∈𝑇𝑇 −𝑚𝑚,𝜆𝜆
prediction error: −1
•𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜𝑜𝑜𝜆𝜆=𝑇𝑇 ∑ 𝑦𝑦−𝑓𝑓 𝑥𝑥2
𝑚𝑚
5. Compute the average of 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 over all 𝐾𝐾 sets for each 𝜆𝜆
• av𝑔𝑔𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜𝑜𝑜 𝜆𝜆 = 𝐾𝐾−1 ∑𝐾𝐾 𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜𝑜𝑜 𝜆𝜆
𝑚𝑚=1 −𝑚𝑚
6. Select 𝜆𝜆 = 𝜆𝜆 that gives the smallest av𝑔𝑔𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜𝑜𝑜 𝜆𝜆
∗