CS计算机代考程序代写 Problem Set #7

Problem Set #7

Problem 1. (Theory for Linear Panel Data Models)

Consider the linear panel data model with individual-specific ‘fixed’ effects:

𝑦𝑖𝑡 = 𝛼𝑖 + 𝛽𝑥𝑖𝑡 + 𝑢𝑖𝑡

Where 𝑖 = 1,2, … , 𝑁 indexes individuals/firms/states/etc and 𝑡 = 1, … , 𝑇 indexes time periods.

Suppose that the error terms 𝑢𝑖𝑡 are “strictly exogenous for given i” , i.e., 𝐸(𝑢𝑖𝑡|𝑥𝑖1, 𝑥𝑖2, . . . , 𝑥𝑖𝑇) = 0.

a) Suppose we run a ‘pooled OLS’ regression, i.e., ignore 𝛼𝑖 and run a regression of 𝑦𝑖𝑡 on 𝑥𝑖𝑡 using

the full sample of NT observations. Note that, if we do this, the ‘error term’ in this regression is

𝑣𝑖𝑡 = 𝛼𝑖 + 𝑢𝑖𝑡. Show that the slope estimator �̂�𝑂𝐿𝑆 is consistent if and only if 𝐶𝑜𝑣(𝛼𝑖 , 𝑥𝑖𝑡) = 0.

b) Define first differences as ∆𝑦𝑖𝑡 = 𝑦𝑖𝑡 − 𝑦𝑖𝑡−1 (and similarly for ∆𝑥𝑖𝑡 and ∆𝑢𝑖𝑡). Show that

∆𝑦𝑖𝑡 = 𝛽∆𝑥𝑖𝑡 + ∆𝑢𝑖𝑡,

and that 𝐶𝑜𝑣(∆𝑥𝑖𝑡 , ∆𝑢𝑖𝑡) = 0. (So the slope estimate from regressing ∆𝑦𝑖𝑡on ∆𝑥𝑖𝑡 is consistent).

Consider the ‘within transformation’

�̅�𝑖 =
1

𝑇
∑ 𝑦𝑖𝑡

𝑇
𝑡=1 �̅�𝑖 =

1

𝑇
∑ 𝑥𝑖𝑡

𝑇
𝑡=1 �̅�𝑖 =

1

𝑇
∑ 𝑢𝑖𝑡

𝑇
𝑡=1

�̈�𝑖𝑡 = 𝑦𝑖𝑡 − �̅�𝑖 �̈�𝑖𝑡 = 𝑥𝑖𝑡 − �̅�𝑖 �̈�𝑖𝑡 = 𝑢𝑖𝑡 − �̅�𝑖

As we said in class, the slope coefficient from a regression of �̈�𝑖𝑡 on �̈�𝑖𝑡 has become known in

econometrics as �̂�𝐹𝐸, the ‘fixed effects estimator’.

c) Using the definitions above, show that �̅�𝑖 = 𝛼𝑖 + 𝛽�̅�𝑖 + �̅�𝑖.

d) Explain why the ‘between estimator’, i.e. the slope coefficient �̂�𝐵𝐸 from a regression of �̅�𝑖 on �̅�𝑖,

has the same problem we saw for �̂�𝑂𝐿𝑆 in part a).

e) Show using the definitions above that �̈�𝑖𝑡 = 𝛽�̈�𝑖𝑡 + �̈�𝑖𝑡. In other words, by writing the model in

terms of within-transformed y and x, we eliminate 𝛼𝑖 from the model. (This implies that �̂�𝐹𝐸 is

consistent even when 𝐶𝑜𝑣(𝛼𝑖, 𝑥𝑖𝑡) ≠ 0!)

f) Suppose that the original error terms 𝑢𝑖𝑡 are homoscedastic and serially uncorrelated, i.e.

𝑉𝑎𝑟(𝑢𝑖𝑡|𝑋) = 𝜎𝑢
2 𝐶𝑜𝑣(𝑢𝑖𝑡 , 𝑢𝑖𝑠|𝑋) = 0 for any 𝑡 ≠ 𝑠

Again using the definitions above, what is 𝐶𝑜𝑣(�̈�𝑖𝑡 , �̈�𝑖𝑠|𝑋)?

(Hint: It is not zero! This is why, when using fixed effects estimators, we always employ

‘clustered’ standard errors, which allow correlation across time periods within a given i.

Important: You should be using ,cluster() as an option every time with both

first differences and xtreg ,fe, including on all remaining problems in this problem set!)

Problem 2. (Monte Carlo for Linear Panel Data Estimators)

Download the file fesim.do from the course website. The idea in this problem is to use Monte Carlo

simulation to “check” our theoretical results from the previous problem.

Input the program pOLS into memory either by doing one of the following:

• Run the entire .do file (the simple but slow option),

• Highlight lines 11-22 of the .do file and click ‘Execute (do)’ on the upper right of the Do-

file editor window, or

• Copy and paste lines 11-22 of the .do file into the Command window and press Enter.

a) Run the program pOLS once. What is the sample correlation between xit and uit. Do you

expect that the pooled OLS estimator will be consistent (cf. Problem 1a)?

Now run the .do file fesim.do. This compares three different estimators for the slope parameter 𝛽 in a

panel with N=200 cross-sectional observations over T=3 time periods. Note that in the simulation, the

error terms 𝑢𝑖𝑡 are actually iid (in particular they have no time dependence). Give brief answers to the

following questions (I really just want you to run the simulation and spend some time thinking).

b) Explain how you can tell from the simulation that the pooled OLS estimator is biased while the

FE and FD estimators are (approximately) unbiased (cf. Problems 1b, 1e). How can you tell

based on these results if the magnitude of the bias is likely to affect our conclusions in practice?

c) Compare the standard errors of the three estimators of 𝛽. Is the most precise estimator the one

you would want to use in practice?

d) Explain how you can tell that White standard errors (,robust) do not correctly capture

sampling uncertainty for the FD and FE estimators (in the latter case this happens because of

what we did in Problem 1f). For which of the two estimators is the problem worse?

The last two programs FErobust and FEcluster compute the fixed effects estimator in two

different ways.1 Explain how you can tell that the clustered standard errors (used in FEcluster) fixes

the problem you noticed in part d).

1 This ended up being necessary because, due to the same issue we investigated in Problems 1f and 2d, Stata
actually automatically clusters by panel id when you use xtreg. For any two variables y and x the output of

xtreg y x, fe robust is exactly the same as xtreg y x, fe cluster(panelvar) where

panelvar is the cross-sectional identifier you used with the xtset command. You should try this and see for

yourself!! I ended up having to compute the FE estimator ‘manually’ using the within transformation as shown in
the notes so we could compare clustered standard errors to the heteroskedasticity robust (White) version we
normally use with reg, robust. This is also why I don’t bother making a 4th histogram at the end – the last

program computes the same estimator as the 3rd, but uses a different standard error formula (clustering).

Problem 3. (Tricks for Working with Panel Data in Stata)

In this problem we’ll study a traffic data set that’s slightly different from the one we worked with in

class. The goal is to study the impact of three different types of laws on traffic fatalities. We want to

estimate the model:

𝑑𝑡ℎ𝑟𝑡𝑒𝑖𝑡 = 𝛼𝑖 + 𝛽1𝑎𝑑𝑚𝑛𝑖𝑡 + 𝛽2𝑜𝑝𝑒𝑛𝑖𝑡 + 𝛽3𝑠𝑝𝑒𝑒𝑑𝑖𝑡 + 𝑢𝑖𝑡

The dependent variable 𝑑𝑡ℎ𝑟𝑡𝑒𝑖𝑡 is traffic accident fatalities per 100 million miles of road in state i

during year t. The variables admn, open, and speed are dummy variables indicating whether state i had

three different laws in place during year t:

admn Permits revocation of drivers license during traffic stops without trial

(mostly used for DUI stops, similar to what we talked about in class)

open Open container law (no open containers of alcohol in car when driving)

speed Statewide speed limit of 65mph in effect (including on interstates)

Load the data in traffic1.dta. This is a panel of N=51 U.S. states plus the District of Columbia in T=2 years

(1985 and 1990). However, you should notice it’s formatted differently from the examples in class.

Instead of a single variable with N*T = 51*2 = 102 observations, this dataset has only 51 observations

with one variable per time period. For example, dthrte85 gives the traffic fatality rate for each state in

1985, while dthrte90 contains fatality rates for 1990.

a) The first thing we need to do is reformat the data so that Stata’s (and most other software

packages’!) panel data commands will work. Enter the following command:

Open the Data Editor and inspect your new data. Also notice we’ve created a new variable

called year. Use tabulate on this new variable and explain why the result makes sense.

b) As we saw in class, Stata’s panel commands require us to first declare our data as a panel using

the xtset command. The second problem with this data (this came up as a question in class) is

that the state variable is a string. (Try typing xtset state year; you get an error.) To fix

this, enter the following commands:

Open the data editor again. Though the new variable stateid still looks like a string (the only

difference it it’s blue instead of red), it is actually numeric. How can we tell the difference?

c) Estimate the model for 𝑑𝑡ℎ𝑟𝑡𝑒𝑖𝑡 we wrote down above. Compare four different estimators for

the slope coefficients 𝛽1, 𝛽2, and 𝛽3: Pooled OLS, Between Estimator, First Differences (FD), and

Fixed Effects (FE). Which two estimators give identical answers and why?

d) Explain why, if we’re trying to estimate causal impacts of these laws, we prefer the FE and FD

estimates to pooled OLS. (Give an answer in practical/economic terms specific to this problem.)

e) Type preserve and hit Enter. Drop 3 states from the sample: Arkansas (AK), Florida (NM),

and New Mexico (NM). Re-run the FE and FD regressions and notice that the variable open

“drops out” (Stata omits it from the model!). Type restore to get back the dropped states.

Why were we unable to estimate the impact of open container laws without these states?

Problem 4. (“Police Cause Crime”)

Load the file CRIME4.dta from the course website. This file contains panel data on crime rates in the

state of North Carolina for the years 1981 – 1987. We’ll work with the following four variables:

lcrmrte Log of crime rate (crimes committed per person) in county i during year t

lcrmrte Log of size of police force per capita in county i during year t

density Population density (people per square mile) in county i during year t

west Dummy variable, =1 if county i is in the western half of the state (0 otherwise)

a) Run a pooled OLS regression of log crime rate on log police per capita. Explain what the slope

coefficient estimate means in an English sentence (include magnitude and be specific about

units/percentages). Also explain why interpreting this as a causal effect is a very bad idea.

b) Make two histograms of density , one for counties in the western half of the state (west=1)

and another for counties in the central/eastern region (west=0). Comment on what you see.

c) Use the separate command to generate two variables lcrmrte0 and lcrmrte1

containing log crime rates for only counties in the central/eastern and western regions,

respectively (we did something similar for murder rates in the file class-Oct29-panel.do). Make

a scatterplot showing the variables lcrmrte0 and lcrmrte1 on the vertical axis versus

lpolpc on the horizontal axis. Comment on what you see. How might this relate to part b)?

d) Run a regression of log crime rates on log police per capita, this time using the fixed effects

estimator. Is your answer any better than part a)? Does including log(𝑑𝑒𝑛𝑠𝑖𝑡𝑦)𝑖𝑡 as an

additional control (regressor) help?

e) Run the same regression from part d) using west as an additional regressor. What happens?

f) Conveniently, this data set has the first differences clcrmrte and clpolpc already included.

Make a scatterplot showing these two variables. Can this scatterplot help explain your results

from part d)? (Note that while the FE and FD estimators don’t give the same answers, until T

gets very large the two will usually be fairly close. Problems with one are shared by the other.)

g) Enter the following commands in Stata:

Explain in English what these two new variables tell you (use the Data Editor!)

h) Using the variables from part f), drop any counties j that have |∆log(𝑝𝑜𝑙𝑝𝑐)𝑗𝑡| > 1 for any t.

Re-run the fixed effects regression from part d). Try this with and without log(𝑑𝑒𝑛𝑠𝑖𝑡𝑦)𝑖𝑡 as an

additional control. Also compare your answers with and without clustered standard errors.

Comment on your results.

(Important Note: This problem is a nice object lesson. Most classroom examples in statistics

and econometrics class have one issue that, once fixed, gives us a nice neat answer. Real data

aren’t like that. There are multiple issues (individual-specific effects and outliers…) and once

we’ve addressed all of them, we very often don’t have enough variation left in our data to

conclude that our estimated effects are different from zero. Get used to this. Good

econometrics skills are well compensated precisely because working with real data is not easy!)

Problem 5. (Panel with Individual and Time Effects)

Load the file jtrain.dta from the course website. This is a sample of N=54 firms in T=3 years (1987-89).

This is a sample of manufacturing firms. For each firm, the variable lscrap is the logarithm of the

‘scrap rate’ per 100 units produced. This is the rate at which manufactured items fail quality control

tests at some point in the production process and must be discarded entirely. High scrap rates are

indicative of waste that both increases firm costs and negatively impacts the environment.

In 1988 and 89, a subset of these firms were given government grants to implement job training

programs. The idea here is that better trained workers will be less likely to make errors during

manufacturing and be better able to recognize and fix potential problems earlier in the process,

resulting (hopefully!) in lower scrap rates. Note that each firm received at most one grant (firms that

received grants in 1988 were ineligible in 1989).

a) For each of the three years, how many firms received grants? (Hint: Use tabulate.)

b) For the years 1988 and 1989 only, what are the average scrap rates for firms that received

grants versus firms that did not receive grants? Does this result suggest the grants were

effective? (Hint: tabulate grant if d88+d89, summarize(lscrap))

c) The variable clscrap is the change in log scrap rate versus the previous year. Repeat part b),

but this time find the average change in log scrap rates. Now do the grants appear effective?

d) Parts b) & c) seem to give contradictory answers. Can you give an explanation that reconciles

these two results? (Hint: Do you think grants were awarded at random… ?)

Consider the following model:

log(𝑠𝑐𝑟𝑎𝑝𝑖𝑡) = 𝛼𝑖 + 𝛿𝑡 + 𝛽1𝑔𝑟𝑎𝑛𝑡𝑖𝑡 + 𝛽2 log(𝑒𝑚𝑝𝑙𝑜𝑦𝑖𝑡) + 𝛽3 log(𝑠𝑎𝑙𝑒𝑠𝑖𝑡) + 𝑢𝑖𝑡 ,

where 𝑒𝑚𝑝𝑙𝑜𝑦𝑖𝑡 is number of employees and 𝑠𝑎𝑙𝑒𝑠𝑖𝑡 is annual sales in US$.

e) Estimate this model twice: The first time, use the reg command (i.e., ignoring 𝛼𝑖 or treating

them as uncorrelated with the regressors). The second time, use xtreg with the ,fe option

(so that 𝛼𝑖 drop out of the model). Explain how your results relate to parts b) – c).

f) Enter the following in Stata:

Explain what we’re doing here and why the results might make you question that the second

regression from part e) is causal.

g) Re-run the second regression from part e) including the dummy variables d88 and d89. What

happens to the coefficient on grant, and how is this related to part f)?

h) Re-run the regression from part g) including grant_1 as a regressor. This is the lagged value of

grant, assumed to be zero for all firms in 1987. Why is your answer different from part g)?

Overall, do these results suggest the grants were effective?