Econ 527 Assignment 11
The due date for this assignment is Friday December 4.
1. (A dummy variable IV) Consider a simple IV regression model with a single endogenous
regressor, no exogenous regressors, and a dummy IV:
Yi = β1 + β2Ei + Ui,
where Yi is the income of individual i, Ei is the years of education of individual i (the endoge-
nous regressor). The IV is the dummy variable for Quarter I: Qi = 1(i is born in Quarter I).
Consider the IV estimator of β:
where Z = [l Q], X = [l E], and l is the n-vector of ones.
βˆ = Z′X−1 Z′Y,
(a) Show that βˆ2, the IV estimator of β2, can be written as
βˆ2 = Y ̄1 − Y ̄0 , (1) E ̄ 1 − E ̄ 0
where E ̄1 is the average education of the individuals born in Quarter I:
ni=1 QiEi E1= n Q ,
i=1 i
E ̄0 is the average education of the individuals born in Quarters II-IV:
replaced with income Yi.
(b) In view of the expression in (1), discuss the condition needed for the relevance of the IV.
2. (Sargan’s test: The overidentifying restrictions test under homoskedasticity) In the IV regression model
Yi=Xi′β+Ui, β∈Rk
with an l-vector of IVs Zi, where l > k, the overidentifying restrictions test is designed to test exogeneity of the IVs. The null hypothesis is given by H0 : EZiUi = 0, and it is tested against the alternative H1 : EZiUi ̸= 0. Assume that errors are homoskedastic:
EUi2 |Zi=σ2.
In the homoskedastic case, the overidentifying restrictions test is known as Sargan’s test and based on the Sargan’s statistic
S n = Uˆ ′ P Z Uˆ / σˆ n2 , 1
̄
ni=1(1 − Qi)Ei E0= ni=1(1−Qi),
̄
and Y ̄1 and Y ̄0 are defined similarly to E ̄1 and E ̄0 (respectively), but with education Ei
where
the vector of fitted residuals
PZ = Z(Z′Z)−1Z′, Uˆ = Y − X βˆ
is constructed using the 2SLS estimator βˆn:
βˆn = (X′PZX)−1X′PZY, σˆ n2 = n − 1 Uˆ ′ Uˆ
and
is an estimator of σ2. Note that similarly to the OLS case σˆn2 →p σ2, provided that βˆn →p β.
In this question, you will show that under H0 : EZiUi = 0, Sn →d χ2l−k.
Recall that l is the number of variables in Zi, and k is the number of variables in Xi. Hence, the test can only be used when the model is overidentified: there are more IVs than the endogenous regressors.
(a) Using the definitions of Uˆ and the 2SLS estimator βˆn, as well as the matrix version of the regression equation Y = Xβ + U, show that
Uˆ = I n − X ( X ′ P Z X ) − 1 X ′ P Z Y =In −X(X′PZX)−1X′PZU.
(b) Using the result in (a), show that
Uˆ′PZUˆ =U′PZ −PZX(X′PZX)−1X′PZU.
(c) Using the result in (b), show that
ˆ′ ˆ U′Z Z′U U PZU = √n Bn √n ,
where
Z′Z −1 Z′Z −1 Z′X X′Z Z′Z −1 Z′X −1 X′Z Z′Z −1 Bn= n − n n n n n n n .
(d) Suppose that data are iid, and
Σ = EZiZi′
is a finite and positive definite l × l matrix. Assume further that
Q = EZiXi′ is a finite l × k matrix of rank k. Show that
Bn →p Σ−1 − Σ−1Q Q′Σ−1Q−1 Q′Σ−1. 2
(e) Assume that H0 : EZiUi = 0 is true and that σ2 > 0. Show that Z′U 2 1/2
where the l-vector Z has a standard normal distribution: Z ∼ N(0,Il).
(f) Assume that σˆn2 →p σ2. Using the results in (c)-(e), show that under H0 : EZiUi = 0, Sn →d Z′ Il − Σ−1/2Q Q′Σ−1Q−1 Q′Σ−1/2 Z.
(g) Show that the matrix Il − Σ−1/2Q Q′Σ−1Q−1 Q′Σ−1/2 is symmetric and idempotent.
(h) Show that
trace Il − Σ−1/2Q Q′Σ−1Q−1 Q′Σ−1/2 = l − k.
(i) Using the results in (f)-(h) and the result for the distribution of quadratic forms in
Lecture 4, show that under H0 : EZiUi = 0,
Sn →d χ2l−k.
(j) Based on the result in (i), when should the econometrician reject H0 : EZiUi = 0? I.e. describe the overidentifying restrictions test. Find the asymptotic size of the proposed test.
(k) Find the exact value of Sn when l = k. Does it depend on the null hypothesis H0 : EZiUi = 0 being true or false? Hint: Show first that the 2SLS estimator simplifies to the IV estimator (Z′X)−1Z′Y when l = k.
√n→dN0,σΣ=σΣ Z,
3. (a)
An Excel file “Mroz.xls” for each individual contains data on wages and log wages (wage and lwage), education (educ), experience (exper), experience squared (expersq), as well as parents’ education (fatheduc and motheduc). Import the file into R. Exclude observations with missing wage. Hints:
• An R package “readxl” includes the command read_excel() for reading XLS files. E.g. use data=read_excel(“Mroz.xls”,na=”.”). The second option is for indi- cating that the missing values are indicated by “.”.
• Make sure that the Excel file is located in the working directory or navigate to the directory with the file.
• To drop the observations with missing values for the wage variable, you can use my_data=subset(data,is.na(wage)==FALSE).
• You can alternatively use the read.csv() command and “Mroz.csv” file. (b) Estimate the following model by OLS:
lwage = β1 + β2educ + β3exper + β4expersq + U.
Report the estimated return to education. Report the homoskedastic standard errors, as well as the heteroskedasticity-robust standard errors. Does the model appear to be heteroskedastic?
3
(c) Estimate the model in (b) by 2SLS using fatheduc and motheduc as IVs to instrument educ. Report the IV estimate of the return to education, and compare with the result in part (b). Does the difference between the OLS and IV estimates make sense? Report and compare the homoskedastic and heteroskedasticity robust standard errors. Does the model appear to be heteroskedastic? Compare the standard errors for educ in the OLS and IV models. Hints:
• The 2SLS estimator can be computed using the ivreg() command in package AER: ivreg(lwage~educ+exper+expersq | fatheduc+motheduc+exper+expersq). Note that the option “| fatheduc+motheduc+exper+expersq” is used to indi- cate the IVs and exogenous regressors.
• The heteroskedasticity-robust standard errors can be computed using the command robust.se() from an R package “ivpack”.
(d) To check whether the IVs fatheduc and motheduc are related to the endogenous regressor educ, estimate by OLS the first-stage equation:
educ = π1 + π2fatheduc + π3motheduc + π4exper + π5expersq + V.
Report homoskedastic and heteroskedasticity-robust standard errors. Test H0 : π2 =
π3 = 0. Are the IVs related to the endogenous regressor?
(e) Explain why the IVs fatheduc and motheduc may be invalid due to correlation with the error term U in the equation in part (b)? Perform Sargan’s test of IVs validity. Do the IVs pass Sargan’s test? Hint:
• Suppose an R object m_iv is generated using the ivreg() command. Sargan’s test can be performed using the option diagnostics=TRUE in the summary() command: summary(m_iv,diagnostics=TRUE).
(f) When the regressors are exogenous and the IVs are valid, the OLS and IV estimators should be very close to each other. Hence, exogeneity of the regressors (educ in the case) can be tested by comparing the OLS and IV estimates. This is known as the Hausman test (or the Durbin-Wu-Hausman test). The test can be performed by including the first- stage residuals in the main (structural equation). Consider the structural and first-stage equations:
Y i = X i′ β + U i , Xi = ΠZi + Vi,
where Zi are the exogenous variables (IVs and exogenous regressors): EZiUi = 0. Since Zi is uncorrelated with Ui, the regressors Xi can be endogenous only if Vi and Ui are correlated. Let Vˆi denote the fitted first-stage residuals. The Hausman test can be per- formed by regressing Yi against Xi and Vˆi, and testing that the coefficients on Vˆi are zero. If Xi is endogenous, Vi and Ui are correlated, and therefore Vˆi will have explana- tory power for Yi.
Perform the Hausman test. Does educ appear to be endogenous? Should the econome- trician use OLS or IV estimation.
• In R, the Hausman test is performed by specifying diagnostics=TRUE in the summary() command.
(g) Perform the Hausman test directly by adding the first-stage residual into the OLS regres- sion in (b) and testing that the coefficient on the first-stage residual is zero. Compare with the output from the summary() command with the option diagnostics=TRUE. Compare the estimates of the coefficients on educ, exper, and expersq with the 2SLS estimates from the IV regression in part (c).
4