程序代写代做代考 1. The dataset 401KSUBS contains information on net financial wealth (nettfa), age of the survey

1. The dataset 401KSUBS contains information on net financial wealth (nettfa), age of the survey

(age), annual family income (inc), family size (fsize), and a binary variable for eligibility in

a 401(k) plan (e401k) among other variables. The wealth and income variables are both

recorded in thousands of dollars. Our response variable for this problem is nettfa. Note: The

complete the dataset includes 10 predictors (please refer to the file description for details).

(a) Provide a descriptive analysis of your variables. This should include, histograms and

fitted distributions, quantile plots, correlation plot, boxplots, scatterplots, and statistical

summaries (e.g., the five-number summary). All figures must include comments.

(b) For each variable, test if a transformation to linearity is appropriate, and if so, apply

the respective transformation. We will use these results later in part (m).

(c) Estimate a multiple linear regression model for nettfa that includes income, age, and

e401k as explanatory variables. We will use this model as a baseline. Comment on the

statistical and economic significance of your estimates. Also, make sure to provide an

interpretation of your estimates. If there are any outliers worth removing, remove them

before proceeding with the next steps.

(d) For your model in part (c) plot the respective residuals vs. ŷ, and y vs. ŷ, and comment

on your results.

(e) For a more economically realistic model, the income and age variables should appear as

quadratics. Re-estimate your model from part (c) including these two quadratic terms.

Now, what is the estimated dollar effect of 401(k) eligibility?

(f) For the model estimated in part (e), add the interactions e401k(age−41) and e401k(age−

41)2. Note that the average age in the sample is about 41, so that in the new model, the

coefficient on e401k is the estimated effect of 401(k) eligibility at the average age. Are

the interaction terms significant? Would you suggest keeping one of the interactions (or

both)? Explain.

(g) Comparing the estimates from parts (c) and (e), do the estimated effects of 401(k)

eligibility at age 41 differ much? Explain.

(h) Now, drop the interaction terms from the model in part (f), but define five family size

dummy variables: fsize1, fsize2, fsize3, fsize4, and fsize5. The variable fsize5 is

unity for families with five or more members. Include the family size dummies in the

model estimated in part (e) and make sure to choose the base group as fsize2. Comment

on your estimates.

– 2 –

(i) Now, do a Chow test for the model

nettfa = β0 + β1inc + β2inc
2
+ β3age + β4age

2
+ β5e401k + e

across the five family size categories, allowing for intercept differences. The restricted

sum of squared residuals, SSRr, is obtained from part (g) because that regression as-

sumes all slopes are the same. The unrestricted sum of squared residuals is, SSRur =

SSR1+⋯+SSR5, where SSRf is the sum of squared residuals for the equation estimated

using only family size f . You should convince yourself that there are 30 parameters in

the unrestricted model (5 intercepts plus 25 slopes) and 10 parameters in the restricted

model (5 intercepts plus 5 slopes). Therefore, the number of restrictions being tested is

q = 20, and the df for the unrestricted model is 9,275 − 30 = 9,245.

(j) Based o your model in part (f) plot and discuss the marginal effects plots across your

predictors.

(k) Is there an optimal level of net financial wealth based on income and age? If so, compute

this level and show the respective perspective and image plots.

(l) For each predictor, plot the predictor effect plot by family size.

(m) Estimate a multiple regression model using your transformed predictor variables from

part (b) using the same model as in part (f) and compare the two models. Which one

do you prefer?

(n) Based on all the available predictors, estimate a model with additive and interactions

terms, and compare it to your model in part (f).

(o) Lastly, choose you favorite model from all the ones estimated and perform a five-fold

cross validation test on it.

– 3 –

2. Assume a healthcare insurance company hired you as a consultant to develop an econometric

model to estimate the number of doctor visits a patient has over a 3 month period. The

rational behind this study is that patients with a higher number of doctors visits wold pose

a higher liability in terms of insurance expenses, and therefore, this may be mitigated via a

higher insurance premium. The panel data are from the German Health Care Usage Dataset,

and consist of 7,293 Individuals across varying numbers of periods with a total of 27, 326

observations.

(a) Build a multiple regression model with a subset of 10 predictors (at most), including

interaction and non-linear transformations if appropriate. For this part you only need to

briefly discuss a justification for the model chosen, and discuss the respective regression

output.

(b) Differences in Differences: In 1987 the German Government passed a series of legislations

to improve healthcare access for unemployed people and women.

i. Determine whether or not the policy worked for women.

ii. Determine whether or not the policy worked for unemployed.

(c) Test the hypothesis that the number of doctor visits a patient has over a 3 month period

is greater for women than for men.

(d) Based on your findings propose and test your own hypothesis of interest using the linear

functional form: λ = c1β1 + c2β2 +⋯.

Data Description (For Problem 2)

This is a large data set. There are altogether 27,326 observations. The number of observations

ranges from 1 to 7. (Frequencies are: 1=1525, 2=2158, 3=825, 4=926, 5=1051, 6=1000, 7=987).

Note, the variable NUMOBS below tells how many observations there are for each person. This

variable is repeated in each row of the data for the person. Below are the variables definitions.

Note: You can ignore the variables TI and INCOME (this one is just a copy of HHINC).

ID = person – identification number

FEMALE = female = 1; male = 0

YEAR = calendar year of the observation

AGE = age in years

HSAT = health satisfaction, coded 0 (low) – 10 (high) Note, this variable has 40 coding errors.

Variable NEWHSAT below fixes them.

HANDDUM = handicapped = 1; otherwise = 0

HEALTHY = self reported to be healthy = 1; otherwise = 0

ALC = average alcohol consumption in the last 3 months

TRAVEL = traveled in the last 3 months abroad = 1; otherwise = 0

– 4 –

HANDPER = degree of handicap in percent (0 – 100)

HHNINC = household nominal monthly net income in German marks / 10000

LOGINC = Natural log (ln) of household nominal monthly net income in German marks / 10000

HHKIDS = children under age 16 in the household = 1; otherwise = 0

EDUC = years of schooling

MARRIED = married = 1; otherwise = 0

HAUPTS = highest schooling degree is Hauptschul degree = 1; otherwise = 0

REALS = highest schooling degree is Realschul degree = 1; otherwise = 0

FACHHS = highest schooling degree is Polytechnical degree = 1; otherwise = 0

ABITUR = highest schooling degree is Abitur = 1; otherwise = 0

UNIV = highest schooling degree is university degree = 1; otherwise = 0

WORKING = employed = 1; otherwise = 0

BLUEC = blue collar employee = 1; otherwise = 0

WHITEC = white collar employee = 1; otherwise = 0

SELF = self employed = 1; otherwise = 0

BEAMT = civil servant = 1; otherwise = 0

DOCVIS = number of doctor visits in last three months

HOSPVIS = number of hospital visits in last calendar year

UNEMPLOY = unemployed = 1; otherwise = 0

DOCTOR = dummy variable = 1 if DOCVIS > 0, 0 otherwise.

HOSPITAL = dummy variable = 1 if HOSPVIS > 0, 0 otherwise.

PUBLIC = insured in public health insurance = 1; otherwise = 0

ADDON = insured by add-on insurance = 1; otherswise = 0

NUMOBS = number of observations for this person. Repeated in each row of data.

NEWHSAT = recoded value of HSAT with coding errors corrected.

PRESCRIP = number of prescriptions in last three months