Lecture 5 and 6: Cross Section and Panel Data methods
Rigissa Megalokonomou University of Queensland
1/73
Reading for lecture 5
Inthemaintextbook(Wooldridge2013):Chapter13
MostlyHarmlessEconometrics(AngristandPischke): Chapter 3 and 5
Additional Reading:
Microeconometrics:MethodsandApplications(Cameron and Trivedi): Chapter 25.4
2/73
PooledCrossSection
Including Time Effects
Intractacting Variables with Time Dummies
Differences in Differences (DiD) Estimator
Differences in Differences in Difference (DiDiD) Estimator
PanelData
Pooled OLS
First Differences
Fixed Effects (FE) Estimator
Random Effects (RE) Estimator
Hausman Test
3/73
1. What is a Pooled Cross Section?
Somedatasetshavebothcross-sectionalandtimeseries features.
Forexample,supposethattwocross-sectionalhousehold surveys are taken in the United States, one in 1985 and one in 1990.
In1985,arandomsampleofhouseholdsissurveyedfor variables such as income, savings, family size, and so on.
In1990,anewrandomsampleofhouseholdsistaken using the same survey questions.
Inordertoincreaseoursamplesize,wecanforma pooled cross section by combining the two years.
4/73
Pooled Cross Section Vs Panel Data
Poolingcrosssectionsfromdifferentyearsisusually helpful, since it increases the sample size and we get more precise estimates and test statistics with more power.
Becauserandomsamplesaretakenineachyear,itwould be a fluke if the same household appeared in the sample during both years.
Theseshouldbeindependentlysampledobservations.
Thisimportantfactordistinguishesapooledcrosssection
from a panel (or longitudinal) data set.
Inpaneldatasetswefollowthesameindividuals,families, firms, cities, states, etc across time.
5/73
Example: Pooled Cross Section
Theideahereistocollectdatafromtheyearsbeforeand after a key policy change.
Considerthefollowingdatasetonhousingpricestakenin 1993 and 1995, when there was a reduction in property taxes in 1994.
Supposewehavedataon250housesfor1993andon270 houses for 1995.
Wehavedatafor1yearbeforeand1yearafterthepolicy change.
Onewaytostoresuchadatasetis:
6/73
Source: Wooldridge, Introductory Econometrics, A Modern Approach, Chapter 13
7/73
1.2. Including Time Effects (Dummies) in Pooled Cross Sections
Typicallyifwebelievethatthepopulationmayhave different distributions in different time periods, we allow the intercept to differ across periods (usually years).
Thisiseasilyaccomplishedbyincludingdummyvariables for the time periods in the regression model.
IfwehaveTyearsofdata,weincludedummiesforT−1 of the years and choose one year (often the earliest) to be the base year.
An F test can be used to test whether the intercepts/dummies change over time.
8/73
EXAMPLE: U.S. Women’s Fertility during 1972-1984
Supposeweareinterestedinthefollowingquestions:
Aftercontrollingforotherobservablefactors,whathas
happened to fertility over time?
Howmuchdoeseducationaffectthenumberofchildren born in each family? What about age, ethnicity, etc?
Howmuchofthefallinaveragefertilitycannotbeexplained by changes in observed factors, including education?
HerewerequireaPCSandlookatcoefficientsonyear dummies.
Foragenericunitinthepopulationwecanwritean equation:
fertility = β0 +δ1d2+δ2d3+…+δT−1dT +βX ++u (1)
whered2,d3,…dTaretimedummiesandXincludesedu, age, age2, ethnicity dummies, characteristics of residential area controls, etc.
9/73
EXAMPLE: U.S. Women’s Fertility during 1972-1984
10/73
What happens to the average fertility rate over this period (from 1772 to 1984)?
11/73
So far we assumed the effect of education (and all other variables) was the same over time!
12/73
Interacting Variables with Time Dummies
Wecaneasilyallowtheslopestochangeovertimeby forming interactions and adding them to the model.
Asalwayswithinteractions,wemustbecarefulin interpreting the level terms – including those on the year dummies.
Is the effect of education constant over time?
The joint test for all interactions with educ gives
p-value=.318, so we cannot reject the null that the effect of education has been constant. But it seems fertility has become more sensitive to education in the last couple of years in the data (such as 1982, 1984).
13/73
14/73
Policy Analysis with Pooled Cross Section (PCS)
Poolingcrosssectionsfromdifferentyearsisoftenan effective way of analyzing the effects of a new government policy.
WithaPCS,oftenagoalistoseehowthemeanvalueofa variable has changed over time.
Fromapolicyperspective,PCSsareatthefoundationof Difference-in-Differences (DID) estimation.
TypicalDIDsetupisthatdatacanbecollectedbothbefore and after an intervention (or ”treatment”), and there is (at least) one ”control group” and (at least) one ”treatment” group.
Oftentheinterventionisofayes/noform.Butother non-binary treatments (such as class size) can be handled, too.
15/73
Polled Cross Section and Dif-in-Dif Setup
Supposewecanobservetestscoresforasampleof4th graders in two large school districts (A and B) at the end of the 2011 and 2012 school years.
District A was given a grant to reduce class size in 4th-grade in 2012.
In 2011, we would have a random sample of 4th graders from both districts. We would sample these two districts in 2012, and get a different set of 4th graders.
We effectively have random samples from four different populations (District A and B in 2011 and 2012).
By having a ”control district” (ie. district B) – which was not given the grant in either year – we can control for aggregate changes over time that affect test scores of students in both districts.
We want to estimate the effect of the grant intervention.
The differences-in-differences strategy aims to compare the
change in test scores in district A to the change in test
scores in district B.
This DID estimate is:
(Y ̄districtA,t2 −Y ̄districtA,t1)-(Y ̄districtB,t2 −Y ̄districtB,t1)
16/73
1.3. DD with Two Groups and Two Time Periods (More formally)
Thesetupisusedoftentostudytheeffectsofpolicy interventions.
Outcomesareobservedfortwogroupsovertwotime periods. One of the groups is exposed to a ”treatment” (or intervention) in the second period but not in the first period. The second group is not exposed to the treatment during either period.
LetAbethecontrolgroupandBthetreatmentgroup.
Letd2beatimeperioddummyequaltooneinthesecond
period.
LetdBbeadummyequaltooneifitisthetreatmentgroup.
Then (Y ̄B,2 − Y ̄B,1)-(Y ̄A,2 − Y ̄A,1) can be estimated by δ1 in
following model:
y = β0 + β1dB + δ0d2 + δ1d2 · dB + u (2)
whereyistheoutcomeofinterest.
17/73
Short introduction to the DD method
y = β0 + β1dB + δ0d2 + δ1d2 · dB + u (3)
18/73
y = β0 + β1dB + δ0d2 + δ1d2 · dB + u
Control (A) Treatment(B) B-A
before(1) β0
β0 +β1 β1
after(2)
β0 + δ0
β0 +δ0 +β1 +δ1 β1 + δ1
after-before δ0
δ0 +δ1 δ1
dBcapturespossibledifferencesbetweenthetreatment and control groups prior to the policy change. Its coefficient, β1, is the difference between treatment and control before the intervention.
d2capturesaggregatefactorsthatwouldcausechanges in y over time even in the absence of an intervention. Notice its coefficient, δ0, is the change in the mean of the control group across the two periods.
Thecoefficientofinterestis,δ1,thedifferenceinthe average changes over time for the treatment and control groups. Conveniently, δ1 is the coefficient on the interaction d2 · dB, which is one if an only if the unit is in the treatment group in period 2.
19/73
The difference-in-differences (DD) estimate can be obtained by applying OLS to equation (2).
Writingδˆ1=(Y ̄B,2−Y ̄B,1)-(Y ̄A,2−Y ̄A,1)showsthatweare comparing the change in means over time for the treatment to the change in means for the control.
20/73
Advantages of DD method:
Itiseasytocalculatestandarderrors.
Wecancontrolforothervariableswhichmayreducethe
residual variance (lead to smaller standard errors). Itiseasytoincludemultipleperiods.
21/73
Challenge with the DD method:
Commontrendassumption:ThekeyassumptionforanyDD strategy is that the outcome in treatment and control group would follow the same time trend in the absence of the treatment, for reasons that have nothing to do with the intervention.
Importantassumptionsincethediff-in-diffdoesnotidentifythe treatment effect if treatment and comparison groups were on different trajectories prior to the program/policy.
Itisdifficulttoverifybutoneoftenusespre-treatmentdatato show that the trends are the same.
Evenifpre-trendsarethesameonestillhastoworryabout other policies changing at the same time.
22/73
Common Trends Assumption
Sometimes, the common trends assumption is clearly OK
23/73
Common Trends Assumption
Sometimes, the common trends assumption is fairly clearly violated
24/73
Intheclass-sizeexample,whatifperformanceindistrictB was at an initial lower level (on average) than district A but was increasing faster? dB accounts for initial level differences, but the effect of the policy is measured by the growth in the averages.
Only way to solve this problem is get another control group or more years of data (preferably at least two before the intervention).
HowdoweaccountfortrenddifferenceacrossgroupsA and B? We need multiple control groups.
25/73
Regression DD Including Leads and Lags
IncludingleadsandlagsintotheDDmodelisaneasyway to 1) analyse pre-trends, and 2) whether the treatment effect changes over time after treatment.
Theestimatedregressionwouldbe:
treatmentoccursinyear0.
includesqanticipatoryeffects.
includesmposttreatmenteffects.
26/73
Card and Krueger (1994) analyse the effect of a minimum wage increase in New Jersey using a differences-in-differences methodology.
Research question
Supposeweareinterestedintheeffectoftheminimum wage on employment, a classic question in labor economics.
InFebruary1992NJincreasedthestateminimumwage from $4.25 to $5.05. Pennsylvania’s minimum wage stayed at $4.25.
27/73
Theysurveyedabout400fastfoodstoresbothinNJandin PA both before and after the minimum wage increase in NJ.
TheDiDideacompares:
Y1ist : employment at restaurant i, state s, with a high min wage, before and after to
Y0ist : employment at restaurant i, state s, with a low min wage before and after
Thedifferences-in-differencesstrategyamountsto comparing the change in employment in NJ to the change in employment in PA.
28/73
ThetypicalDiDregressionmodelthatweestimateis:
Treatment=adummyiftheobservationisinthetreatment group
Post=posttreatmentdummy
29/73
30/73
The table reports average full-time equivalent (FTE) employment at restaurants in Pennsylvania and New Jersey before and after a minimum wage increase in New Jersey.
Surprisingly, employment rose in NJ relative to PA after the minimum wage change.
31/73
Study that includes Leads and Lags-Autor (2003)
hint: normalise adoption year to 0.
Theyearspriorestimatesareverycloseto0.Noevidenceforanticipatory effects! (good news for the common trends assumption).
Theyearspostestimatesshowthattheeffectincreasesduringthefirstyearsof the treatment and then remains relatively constant.
32/73
What if the common trend assumption is violated?
Insomecases,ithappens.
Then,thestandardDDmethodwouldleadtobiased
estimates.
Abadie&Gardeazabal(2003)pioneeredasynthetic control method when estimating the effects of the terrorist conflict in the Basque Country using other Spanish regions as a comparison group.
Thebasicideabehindsyntheticcontrolsisthata combination of units often provides a better comparison for the unit exposed to the intervention than any single unit alone.
33/73
Synthetic Control Methods
Paper:Abadie&Gardeazabal(2003)-TheEconomic Costs of Conflict: A Case Study of the Basque Country
TheywanttoevaluatewhetherTerrorismintheBasque Country had a negative effect on growth.
TheycannotuseastandardDDmethodbecausenoneof the other Spanish regions followed the same time trend as the Basque Country.
TheythereforetakeaweightedaverageofotherSpanish regions as a synthetic control group.
TheyhaveJavailablecontrolregions(the16Spanish regions other than the Basque Country).
TheywanttoassignweightsW=(w1,…,wJ)toeach region.
TheweightsarechosensothatthesyntheticBasque country most closely resembles the actual one before terrorism.
34/73
The Basque Country is Different from the Rest of Spain
35/73
Synthetic Control Methods:
LetX1bevectorofKpre-terrorismeconomicgrowth predictors (i.e. the values in the previous table: investment ratio, population density, …) in the Basque Country.
LetX0beamatrixwhichcontainsthevaluesofthesame variables for the J possible control regions.
LetVbeadiagonalmatrixwithnon-negativecomponents reflecting the relative importance of the different growth predictors. The vector of weights W* is then chosen to minimize: (X1 − X0W )′V (X1 − X0W )
TheychoosethematrixVsuchthattherealpercapita GDP path for the Basque Country during the 1960s (pre terrorism) is best reproduced by the resulting synthetic Basque Country.
Theoptimalweightstheygetare:Catalonia:0.8508, Madrid: 0.1492, and all other regions: 0.
36/73
The Synthetic Basque Country Looks Similar
37/73
Constructing the Counterfactual Using the Weights
Y1 is a vector whose elements are the values of real per capital GDP values for T years in the Basque country.
Y0 is a matrix whose elements are the values of real per capital GDP values for T years in the control regions.
TheythenconstructthecounterfactualGDP(inthe absence of terrorism) as: Y ∗1 = Y0W ∗
38/73
Growth in the Basque Country with and without Terrorism
TheBasqueCountryandthesyntheticcontrolbehavesimilarly until 1975. From 1975, the Basque Country per capita GDP takes values lower than those of the synthetic control.
39/73
Growth in the Basque Country with and without Terrorism
Aftertheoutbreakofterrorisminthelate1960’s,percapitaGDP in the Basque Country declined about 10 percentage points relative to a synthetic control region without terrorism.
40/73
1.4. Multiple Control Groups and Two Time Periods (DDD method)
Wecanrefinethedefinitionoftreatmentandcontrolgroups to account for different trends in the absence of treatment.
Ifwehaveanothergroupthatisnotsubjecttothe treatment, we can use them to account for different trends.
Example:
In the class size example, suppose that it only applied to 4th
grade, not 3rd grade. Let ”F” denote whether a student is in
4th grade and ”T” denote whether a student is in 3rd grade.
Now we have a third dummy variable, dF , to denote 4th grade. Assume still two years are available but now use
third graders as an additional control.
We could do two diff-in-diffs. we can compute the DD
estimates for districts A and B, and then difference those:
δˆDDD = δˆ3 = [(Y ̄B,F,2 − Y ̄B,F,1) − (Y ̄B,T,2 − Y ̄B,T,1)] −[(Y ̄A,F,2 − Y ̄A,F,1) − ((Y ̄A,T,2 − Y ̄A,T,1)]
41/73
δˆDDD is the difference-in-differences-in-differences (DDD) estimate.
Thisestimateaccountsfortrendsif,intheabsenceof treatment, the third and fourth graders would have the same trends in both districts.
Theestimatingequation,byOLS,is:
y = β0 + β1dB + β2dF + β3dB · dF + δ0d2
+δ1d2·dB+δ2d2·dF +δ3d2·dB·dF +u
Wecanaddstudent-specificregressorsascontrols.Even if the regressors are independent of the intervention, adding them can improve efficiency if they help to predict y.
42/73
Reading for lecture 6
Inthemaintextbook(Wooldridge2013):Chapter14
MostlyHarmlessEconometrics(AngristandPischke): Chapter 5
Additional Reading:
Microeconometrics:MethodsandApplications(Cameron and Trivedi): Chapter 21
43/73
Panel Data
ApaneldatasetdiffersmeaningfullyfromPCS
When panel data are collected, information from the same individual units are recorded at each point in time.
Panel data are quite easily assembled for many economic units: countries, states, cities, counties, school districts, firms, or others
Most of panel data in microeconomic applications refer to many firms/individuals/… (> 1, 000) and relatively few time periods (< 50)
However, when we work with panel data, we cannot assume that observations are independently distributed over time. For each cross-section unit, errors are correlated across time and this leads to serial correlation of errors.
With individual-level data, the collection of unobservable factors that affect an individual’s wage will be present at each point in time, leading to correlations across time that we call unobserved heterogeneity. Common unobserved factors induce correlations across time.
44/73
The advantages of panel data
Withmultipleyearsofdatawecancontrolforunobserved characteristics that do not change (or change slowly) over time. Very useful for policy analysis.
Relationshipscanbeestimatedmoreefficiently
The increase in sample size increases the precision of
estimators.
Cansolvesomeomittedvariablesbiasproblem.
E.g. Control for unobserved family background variables (e.g. mother’s intelligence) by comparing siblings within families.
Allowsanalysisofdynamicphenomena.
E.g. How many of those currently unemployed were also
unemployed last month?
E.g. How does Australia’s economic performance last year
affect its performance this year?
45/73
Formats for Storing Panel Data
Thebestwaytostorepaneldataistostackthetime periods for each i on top of each other. In particular, the time periods for each unit should be adjacent, and stored in chronological order (from earliest period to the most recent). This is sometimes called the ”long” storage format. It is by far the most common.
46/73
Formats for Storing Panel Data
Initiallyusingxtsetheadsoffmostproblems,butitisnice to have the data appropriately sorted. Old versions of Stata had commands that would jumble the data. To get it sorted again, use sort distid year
YoudonotwanttosortbyyearandthendistrictID.(That would make the data set look more like independently pooled cross sections, and mask the panel structure.)
Sometimespaneldatasets(especiallywithtwoyears)will be stored as having only n records (rather than 2n, as above), with the variables from the different years given different suffixes (to distinguish the years). Generally, this makes the data harder to work with, especially if there are more than two years. It is sometimes called the ”wide” storage method; the above is called the ”long” storage method.
Statehasacommand,reshape,thatallowsonetogo from wide to long, and vice versa.
Seeexample.VOTE2.DTAisatwo-yearpaneldataset.
47/73
Long
Wide
48/73
Two-period Panel Data Analysis-Introduction
Assumethatforacross-sectionofindividuals,schools,firms, cities or whatever, we have two years of data: t=1,2.
Forexample,wehavedataonunemploymentandcrimefor46 cities for 1982 and 1987. Assume that t=1 (year=1982) and t=2 (year=1987). Using the 1987 cross-section we obtain:
Ifweinterprettheestimatedequationcausally,itimpliesthatan increase in the unemployment rate lowers the crime rate.
Thisiscertainlynotwhatweexpect.
Thecoefficientonunemisnotstatisticallysignificantatstandard significance levels: at best, we have found no link between crime and unemployment rates.
Forsure,wehaveourstandardomittedvariableproblem.
49/73
Two-period Panel Data Analysis-Introduction
Solution1:Controlformorefactors(suchasagedistribution, gender distribution, education levels, law enforcement efforts etc). But many factors might be hard to control for.
Solution2:Controlforcrmrtefromapreviousyear–inthiscase, 198–might help to control for the fact that different cities have historically different crime rates (autoregressive models, will not discuss them now).
Solution3:Usepaneldata–thatis,viewtheunobservedfactors affecting the dependent variable as consisting of two types: those that are constant and those that vary over time.
50/73
Analysis with Two-Periods of Panel Data
Wehavetimeperiodst =1andt =2foreachuniti. Theunits can be aggregated (schools or cities) or disaggregated (students or teachers).
Firstconsiderthecasewithasingleexplanatoryvariable: yit =β0+δ0d2t+β1xit+ci+uit,t=1,2.
We observe xit and yit in both periods.
The variable d2t is a constructed time dummy for the second time period:
d2t =1ift =2andd2t =0ift =1.
In t=1 the intercept is β0 and in t=2 the intercept is β0 + δ0 (allowing the
intercept to change over time is important in most applications. In the previous example, the crime rates in the U.S. cities might change considerably over a five-year period.)
Time-varying component of error: uit is the unobserved idiosyncratic error or or time-varying error.
Time-invariant component of error: The variable ci captures all unobserved, time-constant factors that affect yit and it is called the unobserved effect or unobserved heterogeneity or fixed effect (because it is fixed over time).
We are interested in estimating β1, the partial effect of x on y. Note that the model assumes this effect is constant over time.
51/73
Analysis with Two-Periods of Panel Data
Firstconsiderthecasewithasingleexplanatoryvariable: yit =β0+δ0d2t+β1xit+ci+uit,t=1,2.
If we had data for 200 schools and 2 time periods.
Q1: How many variables are in ci ?
Answer1: 199.
Q2: How would you interpret δ0?
Answer2: It is the difference in the mean of Y in period 2 compared to period 1.
Q3: How many time dummies do you have when there are 2 time periods?
Answer3: 1.
Q4: How would you interpret ci ?
Answer4: It is the difference in the mean Y between each
school and the baseline school.
52/73
Analysis with Two-Periods of Panel Data
How should we estimate the parameter of interest β1 given that we have only two years of panel data?
OnepossibilityistojustpoolthetwoyearsanduseOLS.Todo this, we need to consider a composite error term
vit = ci + uit , t = 1, 2, and thus write the equation like that:
yit =β0+δ0d2t+β1xit+vit,t=1,2.
IfweapplyOLS,wewillsimplyregressyond2andx. Wouldthisestimatehavethedesiredproperties?
53/73
Analysis with Two-Periods of Panel Data
Wouldthisestimatehavethedesiredproperties?Two problems!
Oneconcernisthatevenifweassumerandomsampling across i, we cannot reasonably assume that the observations for i across t=1,2 are independent. The errors will be correlated because of ci . Thus:
E(vi1,vi2) ̸= 0
Wecallthisserialcorrelationorclustercorrelation(eachunit
i is a cluster of two time periods)
TheusualOLSstandarderrorswillbeinvalid,ifthereisserial correlation.
Justusingheteroskedasticity-robuststandarderrorsdoesnot solve the problem.
Solution:obtain”cluster-robust”standarderrorsandtest statistics is very easy these days.
EasytoimplementinStata.
54/73
Amoreseriousconcernistheexogeneityassumption.
InorderforthepooledOLStoproduceanunbiasedand consistent estimator of β1, we would have to assume that the composite error, vit , is uncorrelated with xit .
Because vit = ci +uit, we need
Cov(xit,ci) = 0;Cov(xit,uit) = 0
Supposewearewillingtoassumethatthesecondholds.
Whataboutthefirstone?
IfCov(xit,ci)̸=0,thentheresultingbiasinpooledOLSis called heterogeneity bias.
Solution?
55/73
WaysofcopingwithCov(xit,ci)̸=0(heterogeneitybias):
1)FirstDifferenceEstimation(whichisaPOLSbutonthe
differences)
2)FixedEffectsEstimation(whichisaPOLSbutonthe time-demeaned variables)
3)RandomEffectsEstimation(whichisaPOLSbutonthe partially time-demeaned variables. It is a GLS procedure)
Firsttwomethodseliminateci.
56/73
WaysofcopingwithCov(xit,ci)̸=0(heterogeneitybias):
1)FirstDifferenceEstimation(whichisaPOLSbuton
the differences)
2)FixedEffectsEstimation(whichisaPOLSbutonthe time-demeaned variables)
3)RandomEffectsEstimation(whichisaPOLSbutonthe partially time-demeaned variables. It is a GLS procedure)
Firsttwomethodseliminateci.
57/73
First-Difference estimator
WhenCov(xit,ci)̸=0itisoftensaidthat(pooled)OLSsuffers from heterogeneity bias.
Iftheexplanatoryvariablechangesovertime–atleastforsome units in the population –heterogeneity bias can be solved by differencing away ci .
ToremovethesourceofbiasinPOLS,ci,writethetimeperiods in reverse order for any unit i (remember: d2t=0 in period 1, d2t=1 in period 2).
yi2 = β0 + δ0 + β1xi2 + ci + ui2 yi1 = β0 + β1xi1 + ci + ui1
Subtracttimeperiod1fromtimeperiod2toget yi2−yi1 =δ0+β1(xi2−xi1)+(ui2−ui1)
If we define ∆yi = yi2 − yi1, where ∆ = change over time.
ThenwecanapplyOLSinthefirst-differencedmodel:
∆yi = δ0 + β1∆xi + ∆ui (4)
Important: β1 is the original coefficient we are interested in. We can now apply OLS.
58/73
Differencingawaytheunobservedeffect,ci,issimpleand can be very powerful for isolating causal effects.
TheOLSestimatorappliedto(4)isoftencalledthe First-Difference (FD) estimator (with more than two time periods, other orders of differencing are possible; hence the qualifier ”first”.)
However,thisassumptionshouldholdfortheFDestimatorto be consistent:
Cov(xis,uit) = 0, for all s,t = 1,...,T.
Wecallthisthestrongexogeneityassumption.
ci is removed, so it cannot be a source of serial correlation. But even the uit might have serial correlation (and heteroskedasticity). So in Stata: Use the cluster-robust standard errors command.
The same differencing strategy works if xit is a binary program indicator. The differenced equation is the same.
59/73
WaysofcopingwithCov(xit,ci)̸=0(heterogeneitybias):
1)FirstDifferenceEstimation(whichisaPOLSbutonthe
differences)
2)FixedEffectsEstimation(whichisaPOLSbutonthe time-demeaned variables)
3)RandomEffectsEstimation(whichisaPOLSbuton partially time-demeaned variables. It is a GLS procedure)
Firsttwomethodseliminateci.
60/73
Fixed Effects Estimation
Wecanalsousethe”fixedeffects”or”within” transformation to remove ci using the within i time averages.
Inthesimplemodelwithonlyxit:
yit = β0 +β1xit +ci +uit (5)
Averagethisequationacrossttoget
y ̄i =β0+β1x ̄i+ci+u ̄i (6)
where y ̄i = T−1 ΣTt=1yit is a ”time average” for unit i. Similarly,forx ̄i andu ̄i.
Subtract(5)-(6):
yit −y ̄i = β1(xit −x ̄i)+(uit −u ̄i) (7)
61/73
WeapplyOLStoequation(7)andwecallitthe”Fixed Effects” (FE) estimator or the ”within” estimator.
Weviewthis”time-demeaned”(orwithin)equationasan estimating equation. As with FD, we interpret β1 as in the levels equation (2).
Theimportantthingisthattheunobservedeffect,ci,has disappeared.
Thisisaverycommontechniqueinappliedwork!
62/73
AswithFD,asufficientconditionfortheFEestimatortobe consistent is strict exogeneity:
Cov(xis,uit) = 0, for all s,t = 1,...,T.(5)
The idiosyncratic error should be uncorrelated with each explanatory variable across all time periods.
FE has an advantage over FD when strict exogeneity fails: under reasonable assumptions with large T : FE tends to have less bias compared to FD.
ci is removed, so it cannot be a source of serial correlation. But even the uit might have serial correlation (and heteroskedasticity). So in Stata: Use the cluster-robust standard errors.
If FD and FE are very different, it is a sign that strict exogeneity fails.
Do not worry too much about R2 with FE. The ”within” R2 is probably most informative (time-demeaned equation).
63/73
Drawback of FD and FE
DrawbackwithFDandFEmethods:Anyexplanatory variable that is constant over time (Xi such as gender, or whether a city is located near a river.) gets swept away by the fixed effects or first differences transformation.
Theimpactofthosetime-invariantvariablescannotbe estimated with these transformations.
64/73
WaysofcopingwithCov(xit,ci)̸=0(heterogeneitybias):
1)FirstDifferenceEstimation(whichisaPOLSbutonthe
differences)
2)FixedEffectsEstimation(whichisaPOLSbutonthe time-demeaned variables)
3)RandomEffectsEstimation(whichisaPOLSbuton partially time-demeaned variables. It is a GLS procedure)
Firsttwomethodseliminateci.
65/73
Random Effects Estimation Suppose we start with the same equation as before, written in shorthand for a unit i:
yit = xitβ+δt +ci +uit where δt represent a time dummy.
ThisequationbecomesaREmodelwhenweassumethat the unobserved effect ci is uncorrelated with each explanatory variable xit in all time periods Cov(xit,ci) = 0, for all t = 1,...,T.
Unlike FD and FE, Random Effects (RE) estimation leaves ci in the error term.
RE accounts for the serial correlation over time in vit = ci + uit via a generalized least squares (GLS) procedure.
REestimationallowstime-constantexplanatoryvariablesin xit (such as gender), and this is often important in RE applications.
Forconsistency,REmaintainsthatthecompositeerrorterm,
vit = ci + uit , is uncorrelated with the explanatory variables in all time periods.
66/73
Forpolicyanalysis,REistypicallylessconvincingthanFDor FE.
Withtime-constantcontrols,REmaybeconvincing.
Whenitisconsistent,REistypicallymoreefficient–sometimes
much more efficient – than FD or FE.
The”standard”REassumptionsalsoincludethat:
Cov (ci , uit ) = 0 (not especially controversial) Var (uit ) = σu2 for all t (constant variance over time) Cov (uis , uit ) = 0, t ̸= s (not serial correlation)
Thesecondandthirdassumptionscanfail,andempirically they often do.
67/73
Choosing Among POLS, FD, FE, and RE
Wehavecoveredfourpaneldataestimators:
POLS, which is on the levels.
FD, which is POLS but on the differences (changes)
FE, which is POLS on the time-demeaned variables
RE, which is POLS on the partially time-demeaned
variables
POLSonthelevelsisusuallydeficient,unlessweinclude things like lagged y (not allowed in the other methods). With good controls and lags of y, we might be able to make a convincing analysis. But many economists prefer unobserved effects models.
FDversusFE:IfFDandFEaredifferentinimportant ways, the strict exogeneity assumption may be violated. So FE is preferred.
68/73
RE versus FE
Time-constantvariablesdropoutofFEestimation.The
RE does a better job in this respect.
Onthetime-varyingcovariates,whatifFEandREare so different after all?
WhenFEandREaresimilar,itdoesnotreallymatter which we choose.
Whentheydifferalotandinastatisticallysignificantway, RE is likely to be inappropriate. Harder is when they differ by a lot practically but are not statistically different.
ThereisawaytodirectlycomparetheREandFE estimators on explanatory variables that change across i and t. It is called the Hausman Test.
69/73
Hausman Test We want to use a formal statistical test in selecting between RE and FE estimators.
If E(xitci) = 0 (Ho), then βˆFE and βˆRE should be similar because both are consistent.
If E(xitci) ̸= 0 (H1), then βˆFE and βˆRE should be different because βˆFE is consistent and βˆRE is inconsistent.
Therefore,wecantestwhetherE(xitci)=0,andhence choose between FE and RE estimators, by testing whether βˆFE =βˆRE.
Thesingleparameterversionofthetestconsistsof examining the t-statistic:
βˆFE −βˆRE
t = var(βˆFE)−var(βˆRE)
Whenwerejectthenullhypothesis,thenE(xitci)̸=0,and we prefer the FE estimator.
70/73
Hausman Test We want to use a formal statistical test in selecting between RE and FE estimators.
theHausmantestisperformedconditionalonproper specification of the underlying model. If we have omitted an important explanatory variable from both forms of the model, then we are comparing two inconsistent estimators of the population model.
ThenullhypothesissupportsthatE(xitci)=0.
Whenwerejectthenullhypothesis,thenweprefertheFE estimator.
xtreg depvar indepvars1, fe
estimates store fe
xtreg depvar indepvars2, re
estimates store re
hausman fe re, sigmamore
71/73
Supplementary Material
72/73
In applications where xit is a dummy variable with no assignment in the first period, the FD estimator has a simple interpretation. It is the same as applying OLS to
∆yi =δ0+β1∆xi2+∆ui
where xi2 is the second-period program participation (zero
or one).
Regressiononasingledummyvariableiseasyto characterize: the estimate of β1 is just the difference in means between the ”treated” group and the control group:
∆yi = ytreat − ycontrol βˆ1,FD = ∆ytreat − ∆ycontrol
Thishasalsobeencalleda”difference-in-differences” estimator. Here, unlike in the case of pooled cross sections, the differences are within the same unit.
73/73