HW3
HW3
Question 1
load(“401ksubs.RData”)
1.a
the first 10 line of the data.
head(data,10)
## e401k inc marr male age fsize nettfa p401k pira incsq agesq
## 1 0 13.170 0 0 40 1 4.575 0 1 173.4489 1600
## 2 1 61.230 0 1 35 1 154.000 1 0 3749.1128 1225
## 3 0 12.858 1 0 44 2 0.000 0 0 165.3282 1936
## 4 0 98.880 1 1 44 2 21.800 0 0 9777.2539 1936
## 5 0 22.614 0 0 53 1 18.450 0 0 511.3930 2809
## 6 0 15.000 1 0 60 3 0.000 0 0 225.0000 3600
## 7 0 37.155 1 0 49 5 3.483 0 1 1380.4939 2401
## 8 0 31.896 1 0 38 5 -2.100 0 0 1017.3548 1444
## 9 0 47.295 1 0 52 2 5.290 0 1 2236.8169 2704
## 10 1 29.100 0 1 45 1 29.600 0 1 846.8100 2025
the summary of data
summary(data)
## e401k inc marr male
## Min. :0.0000 Min. : 10.01 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 21.66 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median : 33.29 Median :1.0000 Median :0.0000
## Mean :0.3921 Mean : 39.25 Mean :0.6286 Mean :0.2044
## 3rd Qu.:1.0000 3rd Qu.: 50.16 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :199.04 Max. :1.0000 Max. :1.0000
## age fsize nettfa p401k
## Min. :25.00 Min. : 1.000 Min. :-502.30 Min. :0.0000
## 1st Qu.:33.00 1st Qu.: 2.000 1st Qu.: -0.50 1st Qu.:0.0000
## Median :40.00 Median : 3.000 Median : 2.00 Median :0.0000
## Mean :41.08 Mean : 2.885 Mean : 19.07 Mean :0.2762
## 3rd Qu.:48.00 3rd Qu.: 4.000 3rd Qu.: 18.45 3rd Qu.:1.0000
## Max. :64.00 Max. :13.000 Max. :1536.80 Max. :1.0000
## pira incsq agesq
## Min. :0.0000 Min. : 100.2 Min. : 625
## 1st Qu.:0.0000 1st Qu.: 469.2 1st Qu.:1089
## Median :0.0000 Median : 1108.1 Median :1600
## Mean :0.2543 Mean : 2121.2 Mean :1794
## 3rd Qu.:1.0000 3rd Qu.: 2516.0 3rd Qu.:2304
## Max. :1.0000 Max. :39617.3 Max. :4096
five-number of each variable
sapply(data,quantile)
## e401k inc marr male age fsize nettfa p401k pira incsq
## 0% 0 10.008 0 0 25 1 -502.3020 0 0 100.1601
## 25% 0 21.660 0 0 33 2 -0.5000 0 0 469.1556
## 50% 0 33.288 1 0 40 3 2.0000 0 0 1108.0908
## 75% 1 50.160 1 0 48 4 18.4495 1 1 2516.0256
## 100% 1 199.041 1 1 64 13 1536.7980 1 1 39617.3203
## agesq
## 0% 625
## 25% 1089
## 50% 1600
## 75% 2304
## 100% 4096
Normal Q-Q Plot of income
qqnorm(data$inc)
qqline(data$inc)
Normal Q-Q Plot of nettfa
qqnorm(data$nettfa)
qqline(data$nettfa)
barplot of factor variable
library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr
p1<-ggplot(data = data)+geom_bar(aes(x=as.factor(e401k)))
p2<-ggplot(data = data,aes(as.factor(marr)))+geom_bar()
p3<-ggplot(data = data,aes(as.factor(fsize)))+geom_bar()
p4<-ggplot(data = data,aes(as.factor(pira)))+geom_bar()
p5<-ggplot(data = data,aes(as.factor(age)))+geom_bar()
p6<-ggplot(data = data,aes(as.factor(male)))+geom_bar()
p7<-ggplot(data = data,aes(as.factor(p401k)))+geom_bar()
ggarrange(p1,p2,p3,p4,p5,p6,p7,ncol = 2,nrow = 4)
Histogram
library(ggplot2)
ggplot(data = data,aes(inc))+geom_histogram(binwidth = 10)
ggplot(data = data,aes(nettfa))+geom_histogram(binwidth = 10)
ggplot(data,aes(inc))+geom_density()
ggplot(data=data,aes(inc,nettfa))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(data=data,aes(incsq,nettfa))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
boxplot
ggplot(data = data,aes(y=inc))+geom_boxplot()
ggplot(data = data,aes(y=nettfa))+geom_boxplot()
ggplot(data = data,aes(y=fsize))+geom_boxplot()
ggplot(data = data,aes(y=age))+geom_boxplot()
ggplot(data = data,aes(y=incsq))+geom_boxplot()
ggplot(data = data,aes(y=agesq))+geom_boxplot()
scatter matrix of the data
1.b
After transformation on inc and nettfa, from the density curve and scatter plot, we can see that inc_ln and nettfa_ln's relationship is more linear.
data$inc_ln<-log(data$inc)
data$nettfa_ln<-log(data$nettfa)
## Warning in log(data$nettfa): NaNs produced
data<-data[data$nettfa!=0,]
ggplot(data=data,aes(inc))+geom_density()
ggplot(data=data,aes(inc_ln))+geom_density()
ggplot(data=data,aes(nettfa_ln))+geom_density()
## Warning: Removed 2682 rows containing non-finite values (stat_density).
ggplot(data=data,aes(inc,nettfa))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(data=data,aes(inc_ln,nettfa_ln))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 2682 rows containing non-finite values (stat_smooth).
## Warning: Removed 2682 rows containing missing values (geom_point).
1.c
model_c = lm(nettfa~inc+age+e401k, data=data)
summary(model_c)
##
## Call:
## lm(formula = nettfa ~ inc + age + e401k, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513.62 -19.68 -4.61 10.58 1461.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65.33757 2.78934 -23.424 < 2e-16 ***
## inc 0.93464 0.02753 33.951 < 2e-16 ***
## age 1.09909 0.06276 17.513 < 2e-16 ***
## e401k 6.24593 1.34766 4.635 3.63e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.96 on 8707 degrees of freedom
## Multiple R-squared: 0.1702, Adjusted R-squared: 0.1699
## F-statistic: 595.2 on 3 and 8707 DF, p-value: < 2.2e-16
We can see that the p-value for coefficents are much less than 0.05, we reject the null hypothesis that coefficient is 0. There is a significant relationship between nettfa and income, age, e401k.
The coefficient are all positive which indicates that income, age and e401k are positve related to the nettfa. It conforms our intuition that people with larger income tend to have larger net financial wealth. Older people tend to have larger net financial wealth. People eligible in e401k tend to have larger net financial wealth.
1.d
plot(model_c$residuals, model_c$fitted.values, xlab="residual", ylab="fitted nettfa", title("residual vs fitted nettfa"))
plot(data$nettfa, model_c$fitted.values, xlab="actual nettfa", ylab="fitted nettfa", title("actual nettfa vs fitted nettfa"))
From the two plots, we can see that for majority cases where the absolute value of net wealth not too large, the model fits them very well. For data with large absolute value of net wealth, the model has large error.
1.e
model_e = lm(nettfa~inc + age + e401k + I(inc^2)+I(age^2), data=data)
summary(model_e)
##
## Call:
## lm(formula = nettfa ~ inc + age + e401k + I(inc^2) + I(age^2),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -517.19 -16.55 -3.61 6.36 1463.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.8110442 10.6489138 2.142 0.032213 *
## inc -0.2826054 0.0795455 -3.553 0.000383 ***
## age -2.0036313 0.5163537 -3.880 0.000105 ***
## e401k 9.7576971 1.3380392 7.293 3.31e-13 ***
## I(inc^2) 0.0102701 0.0006184 16.607 < 2e-16 ***
## I(age^2) 0.0358502 0.0059194 6.056 1.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.88 on 8705 degrees of freedom
## Multiple R-squared: 0.2001, Adjusted R-squared: 0.1996
## F-statistic: 435.5 on 5 and 8705 DF, p-value: < 2.2e-16
Now the coefficient for e401k become larger.
1.f
model_f = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age - 41)) + (e401k:I((age - 41)^2)) , data=data)
summary(model_f)
##
## Call:
## lm(formula = nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) +
## (e401k:I(age - 41)) + (e401k:I((age - 41)^2)), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -519.24 -15.95 -2.95 4.72 1459.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1694143 13.2273919 2.659 0.007856 **
## inc -0.2845365 0.0794590 -3.581 0.000344 ***
## age -2.3870468 0.6402432 -3.728 0.000194 ***
## e401k 10.0308821 1.8055944 5.555 2.85e-08 ***
## I(inc^2) 0.0102556 0.0006178 16.601 < 2e-16 ***
## I(age^2) 0.0377743 0.0073319 5.152 2.63e-07 ***
## e401k:I(age - 41) 0.6271855 0.1385814 4.526 6.10e-06 ***
## e401k:I((age - 41)^2) -0.0041193 0.0122523 -0.336 0.736724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.81 on 8703 degrees of freedom
## Multiple R-squared: 0.2022, Adjusted R-squared: 0.2015
## F-statistic: 315.1 on 7 and 8703 DF, p-value: < 2.2e-16
The p-value for e401k:I(age - 41) is much less 0.05, whereas p value for e401k:I((age - 41)^2) is much larger than 0.05. Thus e401k:I(age - 41) is significant and e401k:I((age - 41)^2) is not. I suggest to keep the significant term e401k:I(age - 41). It is better to remove non-significant term and keep the significant one.
1.g
The coefficient of e401k in (c) is 5.98908 and it is 9.7046880 in (e). The estimated effects of e401k become larger.
1.h
data$fsize1 = rep(0, nrow(data))
data$fsize2 = rep(0, nrow(data))
data$fsize3 = rep(0, nrow(data))
data$fsize4 = rep(0, nrow(data))
data$fsize5 = rep(0, nrow(data))
data$fsize1[data$fsize == 1] = 1
data$fsize2[data$fsize == 2] = 1
data$fsize3[data$fsize == 3] = 1
data$fsize4[data$fsize == 4] = 1
data$fsize5[data$fsize >= 5] = 1
model_h = lm(nettfa~inc + age + e401k + I(inc^2)+I(age^2) + fsize1 + fsize3 + fsize4 + fsize5, data=data)
summary(model_h)
##
## Call:
## lm(formula = nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) +
## fsize1 + fsize3 + fsize4 + fsize5, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -517.4 -16.9 -3.5 6.9 1460.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.0592788 10.8225775 1.299 0.19395
## inc -0.2367342 0.0805880 -2.938 0.00332 **
## age -1.4889417 0.5287242 -2.816 0.00487 **
## e401k 9.5065189 1.3381626 7.104 1.31e-12 ***
## I(inc^2) 0.0100179 0.0006213 16.125 < 2e-16 ***
## I(age^2) 0.0296686 0.0060848 4.876 1.10e-06 ***
## fsize1 1.5646389 1.9409414 0.806 0.42019
## fsize3 -3.6864050 1.9362175 -1.904 0.05695 .
## fsize4 -5.5494769 1.9204821 -2.890 0.00387 **
## fsize5 -6.7664837 2.2223662 -3.045 0.00234 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.82 on 8701 degrees of freedom
## Multiple R-squared: 0.2021, Adjusted R-squared: 0.2013
## F-statistic: 244.9 on 9 and 8701 DF, p-value: < 2.2e-16
We can see that coefficients for fsize3, fsize4 and fsize5 are significant and coefficient for fsize1 is not significant. The coefficients for fsize3, fsize4 and fsize5 are all negative indicating large family size tend to have smaller net financial wealth.
1.i
model_f = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age - 41)) + (e401k:I((age - 41)^2)) , data=data) summary(model_f)
m1 = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + fsize1, data = data)
m2 = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + fsize2, data = data )
m3 = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + fsize3, data = data)
m4 = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + fsize4, data = data)
m5 = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + fsize5, data = data)
anova(model_e, m1, m2, m3, m4, m5)
## Analysis of Variance Table
##
## Model 1: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2)
## Model 2: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + fsize1
## Model 3: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + fsize2
## Model 4: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + fsize3
## Model 5: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + fsize4
## Model 6: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + fsize5
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8705 30176646
## 2 8704 30141355 1 35291 10.191 0.001416 **
## 3 8704 30161157 0 -19802
## 4 8704 30173951 0 -12794
## 5 8704 30157552 0 16399
## 6 8704 30155251 0 2301
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
1.j
library(“margins”)
margin = margins(model_f)
summary(margin)
## factor AME SE z p lower upper
## age 0.9819 0.0667 14.7126 0.0000 0.8511 1.1127
## e401k 9.7197 1.3384 7.2624 0.0000 7.0965 12.3428
## inc 0.5449 0.0368 14.8254 0.0000 0.4728 0.6169
plot(margin)
From the plot, we can see that the marginal effect of age are much larger than inc and e401k.
1.k
library(plot3D)
scatter3D(data$age, data$inc, data$nettfa)
plot(data$age, data$nettfa)
plot(data$inc, data$nettfa)
We can see that the optimal level of net financial wealth is about 1500k.
1.l
library(“margins”)
margin2 = margins(model_h)
summary(margin2)
## factor AME SE z p lower upper
## age 0.9557 0.0669 14.2776 0.0000 0.8245 1.0869
## e401k 9.5065 1.3383 7.1032 0.0000 6.8834 12.1296
## fsize1 1.5646 1.9409 0.8061 0.4202 -2.2395 5.3688
## fsize3 -3.6864 1.9362 -1.9039 0.0569 -7.4814 0.1085
## fsize4 -5.5495 1.9205 -2.8896 0.0039 -9.3136 -1.7854
## fsize5 -6.7665 2.2224 -3.0447 0.0023 -11.1222 -2.4107
## inc 0.5734 0.0377 15.2269 0.0000 0.4996 0.6472
plot(margin2)
1.m
model_m = lm(nettfa_ln ~ inc_ln + age + e401k + I(inc_ln^2) + I(age^2) + (e401k:I(age – 41)) + (e401k:I((age – 41)^2)) , data=data)
summary(model_m)
##
## Call:
## lm(formula = nettfa_ln ~ inc_ln + age + e401k + I(inc_ln^2) +
## I(age^2) + (e401k:I(age – 41)) + (e401k:I((age – 41)^2)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8911 -0.9062 0.1832 1.0973 5.4122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.2655534 0.8113440 -5.257 1.51e-07 ***
## inc_ln 1.3889796 0.3803356 3.652 0.000262 ***
## age -0.0297986 0.0222403 -1.340 0.180346
## e401k 0.7485993 0.0580033 12.906 < 2e-16 ***
## I(inc_ln^2) 0.0368092 0.0522519 0.704 0.481176
## I(age^2) 0.0009168 0.0002504 3.661 0.000254 ***
## e401k:I(age - 41) -0.0019541 0.0046271 -0.422 0.672817
## e401k:I((age - 41)^2) -0.0008438 0.0003912 -2.157 0.031055 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.607 on 6021 degrees of freedom
## (2682 observations deleted due to missingness)
## Multiple R-squared: 0.3588, Adjusted R-squared: 0.358
## F-statistic: 481.2 on 7 and 6021 DF, p-value: < 2.2e-16
We can see that the new model's Adjusted R-squared 0.358 is much larger than the model in F which is 0.2036. I prefer this new model.
1.n
model_n = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age - 41)) + (e401k:I((age - 41)^2)) + marr + male , data=data)
summary(model_n)
##
## Call:
## lm(formula = nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) +
## (e401k:I(age - 41)) + (e401k:I((age - 41)^2)) + marr + male,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515.71 -16.29 -2.95 5.56 1459.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.9921873 13.2662812 2.713 0.006680 **
## inc -0.1827555 0.0835317 -2.188 0.028706 *
## age -2.4198696 0.6400510 -3.781 0.000157 ***
## e401k 9.7770599 1.8053469 5.416 6.27e-08 ***
## I(inc^2) 0.0097560 0.0006301 15.484 < 2e-16 ***
## I(age^2) 0.0382992 0.0073278 5.227 1.77e-07 ***
## marr -5.6238291 1.5448765 -3.640 0.000274 ***
## male 1.5188891 1.7207114 0.883 0.377417
## e401k:I(age - 41) 0.6229432 0.1384506 4.499 6.90e-06 ***
## e401k:I((age - 41)^2) -0.0040842 0.0122406 -0.334 0.738644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.75 on 8701 degrees of freedom
## Multiple R-squared: 0.2039, Adjusted R-squared: 0.2031
## F-statistic: 247.6 on 9 and 8701 DF, p-value: < 2.2e-16
anova(model_f, model_n)
## Analysis of Variance Table
##
## Model 1: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age -
## 41)) + (e401k:I((age - 41)^2))
## Model 2: nettfa ~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age -
## 41)) + (e401k:I((age - 41)^2)) + marr + male
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8703 30097790
## 2 8701 30032124 2 65666 9.5125 7.469e-05 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
By intuition, whether the person is married or is man has impact on the net wealth. Thus I add two predictors marr and male to the model. Then I use anova to compare the two models. The p value is much less than 0.001, thus the new model is significantly better than the model in (f).
1.o
set.seed(8888)
shuffled<-data[sample(nrow(data)),]
folds <- cut(seq(1,nrow(data)),breaks=5,labels=FALSE)
mses = rep(0, 5)
for(i in 1:5) {
testInds <- which(folds==i,arr.ind=TRUE)
testData <- shuffled[testInds, ]
trainData <- shuffled[-testInds, ]
m = lm(nettfa~ inc + age + e401k + I(inc^2) + I(age^2) + (e401k:I(age - 41)) + (e401k:I((age - 41)^2)) + marr + male , data=trainData)
pred = predict(m, testData)
mses[i] = mean( (testData$nettfa-pred )^2)
}
mses
## [1] 2802.584 2955.837 4111.155 2895.160 4579.648
mean(mses)
## [1] 3468.877
I conduct the five cross validation on the model from (n). The average mean squared errors across the 5 folds is 3268.31.
Question 2
2.a
german_healthcare_usage <- read.csv("german_healthcare_usage.csv")
# ALC ADDON
model_2_a = lm(DOCVIS ~ HEALTHY + HANDPER + LOGINC + EDUC + WORKING + PUBLIC, data = german_healthcare_usage)
summary(model_2_a)
##
## Call:
## lm(formula = DOCVIS ~ HEALTHY + HANDPER + LOGINC + EDUC + WORKING +
## PUBLIC, data = german_healthcare_usage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.635 -2.305 -1.326 0.696 113.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.013808 0.252342 19.869 < 2e-16 ***
## HEALTHY -2.866023 0.069067 -41.496 < 2e-16 ***
## HANDPER 0.041297 0.001759 23.475 < 2e-16 ***
## LOGINC -0.144574 0.070697 -2.045 0.040867 *
## EDUC -0.037854 0.015349 -2.466 0.013658 *
## WORKING -0.719214 0.073655 -9.765 < 2e-16 ***
## PUBLIC 0.423669 0.109105 3.883 0.000103 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.363 on 27313 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.112, Adjusted R-squared: 0.1118
## F-statistic: 574 on 6 and 27313 DF, p-value: < 2.2e-16
I choose this model. I think self report health (HEALTHY) and degree of handicap (HANDPER) has direct impact on doctor visits. Other factors such as income, years of education, whether employed, insured will also impact the doctor visits.
From the model output, we can see that p-value for all coefficietns are less than 0.05. They are all significant. The Multiple R-squared is 0.112 and the Adjusted R-squared is 0.1118.
2.b
2.b.i
The control group is all men observation, the treatment group is all women observation, the null hypothesis is , versus .
female<-german_healthcare_usage[which(german_healthcare_usage$FEMALE==1),]
male<-german_healthcare_usage[which(german_healthcare_usage$FEMALE==0),]
female$before<-as.numeric(female$YEAR1984|female$YEAR1985|female$YEAR1986)
female$after<-as.numeric(female$YEAR1987|female$YEAR1988|female$YEAR1991|female$YEAR1994)
male$before<-as.numeric(male$YEAR1984|male$YEAR1985|male$YEAR1986)
male$after<-as.numeric(male$YEAR1987|male$YEAR1988|male$YEAR1991|male$YEAR1994)
control_before<-male[which(male$before==1),]
control_after<-male[which(male$before==0),]
treatment_before<-female[which(female$before==1),]
treatment_after<-female[which(female$before==0),]
D_in_D<-(mean(treatment_after$DOCVIS)-mean(control_after$DOCVIS))-(mean(treatment_before$DOCVIS)-mean(control_before$DOCVIS))
D_in_D
## [1] -0.1885151
The Differences-in-Differences estimator on women is less than 0, the policy is worked for women.
2.b.ii
The control group is all unemployed observation, the treatment group is all employed observation, the null hypothesis is , versus .
unemployed<-german_healthcare_usage[which(german_healthcare_usage$UNEMPLOY==1),]
employed<-german_healthcare_usage[which(german_healthcare_usage$UNEMPLOY==0),]
unemployed$before<-as.numeric(unemployed$YEAR1984|unemployed$YEAR1985|unemployed$YEAR1986)
unemployed$after<-as.numeric(unemployed$YEAR1987|unemployed$YEAR1988|unemployed$YEAR1991|unemployed$YEAR1994)
employed$before<-as.numeric(employed$YEAR1984|employed$YEAR1985|employed$YEAR1986)
employed$after<-as.numeric(employed$YEAR1987|employed$YEAR1988|employed$YEAR1991|employed$YEAR1994)
control_before<-employed[which(employed$before==1),]
control_after<-employed[which(employed$before==0),]
treatment_before<-unemployed[which(unemployed$before==1),]
treatment_after<-unemployed[which(unemployed$before==0),]
D_in_D<-(mean(treatment_after$DOCVIS)-mean(control_after$DOCVIS))-(mean(treatment_before$DOCVIS)-mean(control_before$DOCVIS))
D_in_D
## [1] -0.2587182
The Differences-in-Differences estimator on unemployed is less than 0, the policy is worked for unemployed.
2.c
The null hypothesis is , the alternative hypothesis is , we perform a Two Sample t-test here:
men<-german_healthcare_usage[which(german_healthcare_usage$FEMALE==0),]
t.test(female$DOCVIS,men$DOCVIS, alternative = c("greater"))
##
## Welch Two Sample t-test
##
## data: female$DOCVIS and men$DOCVIS
## t = 16.889, df = 25813, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.051871 Inf
## sample estimates:
## mean of x mean of y
## 3.791087 2.625711
The p value of Two Sample t-test is less than 0.05, which means that there is a reason to reject the null hypothesis, the number of doctor visits a patient has over a 3 month period is greater for women than for men.
2.d
We are interested in the effect of policy on civil servant, the regression model is , where is a dummy variable for the period, equal tp 1 when after, and is a dummy variable for group membership, equal to 1 when treatment.
beamt<-german_healthcare_usage[which(german_healthcare_usage$BEAMT==1),]
beamt$T<-as.numeric(beamt$YEAR1987|beamt$YEAR1988|beamt$YEAR1991|beamt$YEAR1994)
beamt$T_DOCTOR<-beamt$T*beamt$DOCTOR
reg2<-lm(DOCVIS~DOCTOR+T+T_DOCTOR,data = beamt)
summary(reg2)
##
## Call:
## lm(formula = DOCVIS ~ DOCTOR + T + T_DOCTOR, data = beamt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.524 -1.940 0.000 0.000 41.476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.258e-15 1.682e-01 0.000 1.0000
## DOCTOR 4.524e+00 2.404e-01 18.822 <2e-16 ***
## T -1.601e-16 2.309e-01 0.000 1.0000
## T_DOCTOR -5.844e-01 3.171e-01 -1.843 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.511 on 2037 degrees of freedom
## Multiple R-squared: 0.2604, Adjusted R-squared: 0.2593
## F-statistic: 239.1 on 3 and 2037 DF, p-value: < 2.2e-16
The Differences-in-Differences estimator on civil servant is -0.058 less than 0, the policy is worked for civil servant, and it is significant at 0.1 significant level.