Homework 4
To read a data set in the Stata format into R, we have used
The commands above make the EAWE21 (EARNINGS) data set available to us on Lee, and having included the attach command, we can refer to individual variables. If you want to know more about this data set, it is described in Dougherty’s data appendix.
(Those of you who would like to work with your own copy of RStudio could download copies of this data set in Excel format, or a .csv file (under ASCII), from the web site for the textbook. You can also work with the .dta version of the file, but you will need to make sure you have installed the foreign package for your version of R and RSTudio.)
1. Estimate a regression model that uses years of schooling (S) and years of work experience (EXP) to predict EARNINGS.
The general format for R to include more than one variable in the lm command is
lm(y~X2+X3)
Any number of X variables can be added in this manner.
Recall that you then need the summary and anova commands to see results comparable to the ones in the book, using Stata.
Verify that your estimated coefficients, standard errors, and t-statistics match those in Dougherty’s Table 3.1 (p. 158).
How do you interpret the R2 you obtained?
2. Use the confint command in R to obtain confidence intervals for each coefficient.
Verify that your results match those in Table 3.1.
3. Use the anova command to obtain the Analysis of Variance table.
Verify that your Residual Sum of Squares matches the result in Table 3.1.
To compare the Model Sum of Squares, note that you will have to sum the two separate values that R allocates to S and to EXP. We will have more to say about this shortly.
Verify that your three different sums of squares sum to the Total SS that appears in Table 3.1.
4. Now re-estimate your model, by changing the lm command. If you typed S+EXP, then you should now type EXP+S for the right-hand side variables.
Verify that nothing changes about your model, except the sums of squares in the anova command.
4. Our assumption in this regression (consistent with equation 3.1 in Dougherty) is that V (ui) = σ2.
library(foreign)
data = read.dta(“/home/are106/EAWE21.dta”) attach(data)
1
What is your estimate of this variance, using your results in question 1 above?
We use the Residual Sum of Squares, divided by its associated degrees of freedom, which will
be n − K in general; here, K = 3 of course.
What would your estimate of this variance have been, using only β1 to predict EARNINGS
(i.e., using an intercept-only model)?
5. Suppose you believed that the true coefficient on S was 1.2, because you estimated it to be in that neighborhood using our model from Chapter 2 that did not include EXP.
Using your new results from Chapter 3, would you reject the hypothesis that the true coefficient on schooling is 1.2?
Would you reject the hypothesis that the coefficient on EXP is zero?
6. This is something worth doing once and then probably never again.
a. Dougherty provides a new formula for βˆ2 on page 160. By programming this formula yourself, verify by comparing the result to Table 3.1 that his formula is correct, and that it gives you the same estimated coefficient as does your lm command above.
b. Reverse the roles of X2 (arbitrarily chosen to be S) and X3 (arbitrarily chosen to be EXP) and show that you also obtain Dougherty’s βˆ3 using ther formula on page 160.
c. What would happen to these two estimates if the sample covariance between the two variables were actually equal to zero? (This is not an R question, it requires considering the formulas for βˆ2 and βˆ3.)
7. Create variables corresponding to uˆi and yˆi, using y to denote EARNINGS.
Verify that the sum of squared uˆ values equals the Residual Sum of Squares in Table 3.1.
Plot your predicted values against years of schooling, and then, in a separate plot, against years of experience.
Give a brief interpretation for each plot.
8. I would like to perform a simulation experiment like the one we did in Question 2 of Homework 3, but I would like to use realistic data.
Here is a modified version of that old program that uses our EAWE21 data set to show that OLS estimates are unbiased, when we estimate a three-parameter model.
set.seed(4)
model = lm(EARNINGS~S+EXP) c=coefficients(model) beta1 = c[1]
beta2 = c[2]
beta3 = c[3]
summary = summary(model) uhat1=summary$residuals
sigma = sqrt(sum(uhat1^2)/(497))
2
n = 500; nreps=10000
ys = matrix(-9999,n,nreps)
b1s = matrix(-9999,nreps,1) b2s = matrix(-9999,nreps,1) b3s = matrix(-9999,nreps,1)
for (i in 1:nreps) {
ys[,i] = beta1 + beta2*S + beta3*EXP+sigma*rnorm(n) model1 = lm(ys[,i]~S+EXP)
b1s[i] = model1$coefficients[1] b2s[i] = model1$coefficients[2] b3s[i] = model1$coefficients[3]
} mean(b1s)
## [1] -14.66225
var(b1s)
## [,1]
## [1,] 18.43267
hist(b1s)
Histogram of b1s
−30 −25 −20 −15 −10 −5 0
b1s
3
Frequency
0 500 1000 1500
mean(b2s)
## [1] 1.877419
var(b2s)
## [,1]
## [1,] 0.05026629
hist(b2s)
Histogram of b2s
1.0 1.5
2.0 2.5
b2s
mean(b3s)
## [1] 0.9837542
var(b3s)
## [,1]
## [1,] 0.04388279
hist(b3s)
4
Frequency
0 500 1000 1500
Histogram of b3s
0.5 1.0 1.5
b3s
Modify the program to show that when the correct model is the one from Chapter 3, but you erroneously assume that the Chapter 2 model with S included and EXP excluded is correct, you will obtain biased estimates for your β1 and β2.
Here are a few problems you should be familiar with, before the next exam: 3.1, 3.9, 3.12, 3.15, 3.17, 3.18, 3.19, 3.22
5
Frequency
0 500 1000 1500