An Introduction to Causality
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
Week Two
January 18, 2021
1/95
Last Week’s Lecture: Two Parts
(1) Introduction to the conditional expectation function (CEF) (2) Ordinary Least Squares and the CEF
2/95
Today’s Lecture: Four Parts
(1) Analyzing an experiment in R
Comparing means (t-test) and regression
(2) Causality and the potential outcomes framework What do we mean when we say X causes Y
(3) Linear regression, the CEF, and causality
How do we think about causality in a regression framework?
(4) Instrumental Variables (if time)
The intuition behind instrumental variables
3/95
Part 1: Analyzing an Experiment
Last year I performed an experiment for my MODES scores Can I bribe students to get great teaching evaluations?
Two sections: morning (9:00-12:00) vs. afternoon (1:00-4:00) Evaluation day: gave candy only to the morning students
Compared evaluations across the two: scored 1-5
4/95
Part 1: Analyzing an Experiment
Let’s define a few variables:
yi : Teaching evaluation for student i from 1-5
Di : Treatment status (candy vs. no candy)
1 if student received candy
Di = 0 otherwise
How do we see if the bribe was effective?
Are evaluations higher—on average—for students who got candy? E[yi|Di = 1] > E[yi|Di = 0]
Equivalently:
E[yi|Di = 1]−E[yi|Di = 0] > 0
5/95
Plotting the Difference in (Conditional) Means
5
4
3
2
1
0
01
Treatment: Candy or Not
6/95
Modes Score
Estimating the Difference in (Conditional) Means
Two exercises in R
What is the difference in means between the two groups
What is the magnitude of the t-statistic from a t-test for a difference in these means (two sample, equal variance)
7/95
Regression Provides Simple way to Analyze an Experiment
yi =β0+β1Di+vi
β1 gives the difference in means
t-statistic is equivalent to (two sample-equal variance) t-test
8/95
Part 2: Causality and the potential outcomes framework
Differences in conditional means often represent correlations Not causal effects
Potential outcomes framework helps us define a notion of causality And understand the assumptions necessary for conditional
expectations to reflect causal effects
Experiments allow CEF to capture causal effects
What’s really key is certain assumption: conditional independence
9/95
Conditional Means and Causality
In economics/finance we often want more than conditional means Interested in causal questions:
Does a change in X cause a change in Y?
How do corporate acquisitions affect the value of the acquirer? How does a firm’s capital structure impact investment?
Does corporate governance affect firm performance?
10/95
Thinking Formally About Causality
Consider the example we just studied
Compare evaluations (1 – 5) for two (equal sized) groups: Treated with Candy vs. Not Treated
Group
No Candy Candy
Sample Size Evaluation
80 3.32 80 4.33
Standard Error
0.065 0.092
Treated group provided significantly higher evaluations Difference in conditional means:
E[yi|Di = 1]−E[yi|Di = 0] = 1.01 So does candy cause higher scores?
11/95
Any Potential Reasons Why This Might not be Causal?
What if students in the morning class would have given me higher ratings even without candy?
Maybe I teach better in the morning? Or morning students are more generous We call this a “selection effect”
What if students in the morning respond better to candy? Perhaps they are hungrier
For both of these, would need to answer the following question:
What scores would morning students have given me without candy?
I’ll never know…
12/95
The Potential Outcomes Framework
Ideally, how would we find the impact of candy on evaluations (yi )? Imagine we had access to two parallel universes and could observe
The exact same student (i)
At the exact same time
In one universe they receive candy—in the other they do not
And suppose we could see the student’s evaluations in both worlds Define the variables we would like to see: for each individual i:
yi1 = evaluation with candy yi0 = evaluation without candy
13/95
The Potential Outcomes Framework
If we could see both yi1 and yi0 impact would be easy to find: The causal effect or treatment effect for individual i defined as
yi1 −yi0
Would answer our question—but we never see both yi1 and yi0!
Some people call this the “fundamental problem of causal inference” Intuition: there are two “potential” worlds out there
The treatment variable Di decides which one we see:
yi1 if Di = 1 yi= yi0ifDi=0
14/95
The Potential Outcomes Framework
We can never see the individual treatment effect
yi1 −yi0
We are typically happy with population level alternatives
For example, the average treatment effect:
Average Treatment Effect = E[yi1 −yi0] = E[yi1]−E[yi0]
This is usually what’s meant by the “effect” of x on y
We often aren’t even able to see the average treatment effect We typically only see conditional means
15/95
So What Do Differences in Conditional Means Tell You?
In the MODES example, we compared:
E[yi|Di = 1]−E[yi|Di = 0] = 1.01
We can estimate this We can estimate this
Or, written in terms of potential outcomes:
⇒ E[yi1|Di = 1]−E[yi0|Di = 0] = 1.01
̸= E[yi1]−E[yi0]
Why is this not equal to E[yi1]−E[yi0]?
E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1|Di = 1]−E[yi0|Di = 1]
Average Treatment Effect for the Treated Group +E[yi0|Di = 1]−E[yi0|Di = 0]
Selection Effect
16/95
So What Do Differences in Conditional Means Tell You?
E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1|Di = 1]−E[yi0|Di = 1]
Average Treatment Effect for the Treated Group +E[yi0|Di = 1]−E[yi0|Di = 0]
Selection Effect ̸= E[yi1]−E[yi0]
Average Treatment Effect
So our estimate could be different from the average effect of treatment E[yi1]−E[yi0] for two reasons:
(1) The morning section might have given better reviews anyway: E[yi0|Di = 1]−E[yi0|Di = 0] > 0
Selection Effect
(2) Candy matters more in the morning:
E[yi1|Di = 1]−E[yi0|Di = 1] ̸= E[yi1]−E[yi0]
Average Treatment Effect for the Treated Group Average Treatment Effect
17/95
What are the Benefits of Experiments?
Truly random experiments solve this “identification” problem—why? Suppose Di is chosen randomly for each individual
This means that Di is independent of (yi1,yi0) in a statistical sense (yi1,yi0) ⊥ Di
Intuition: potential outcomes yi1 and yi0 unrelated to treatment
18/95
What are the Benefits of Experiments?
(yi1,yi0) ⊥ Di
Sidenote: if two random variables are independent (X ⊥ Z ):
E[X|Z] = E[X]
Hence in an experiment:
E[yi1|Di = 1] = E[yi1]
so
E[yi0|Di = 0] = E[yi0]
E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1]−E[yi0]
We can estimate this We can estimate this Average Treatment Effect!
19/95
What are the Benefits of Experiments?
Why does independence fix the two problems with using E[yi1|Di = 1]−E[yi0|Di = 0]
(1) The selection effect is now 0
E[yi0|Di = 1]−E[yi0|Di = 0] = 0
Selection Effect
(2) Average treatment effect for treated group now accurately measures average treatment effect in the whole sample:
E[yi1|Di = 1]−E[yi0|Di = 1] = E[yi1]−E[yi0]
Average Treatment Effect for the Treated Group
20/95
The Conditional Independence Assumption
Of course an experiment is not strictly necessary as long as As long as (yi1,yi0) ⊥ Di
This happens, but not likely for most practical applications
Slightly more reasonable is the conditional independence assumption Let Xi be a set of control variables
Conditional Independence: (yi1,yi0) ⊥ Di|Xi
Independence holds within a group with the same characteristics Xi
E[yi1|Di = 1,Xi]−E[yi0|Di = 0,Xi] = E[yi1 −yi0|Xi]
21/95
A (silly) Example of Conditional Independence
Suppose I randomly treat (Di = 1) 75% of the morning class (with candy)
And randomly treat (Di = 1) 25% of the afternoon class
And suppose I am a much better teacher in the morning
Then
(yi1,yi0) ̸⊥ Di
Because E[yi0|Di = 1] > [yi0|Di = 0]
22/95
23/95
A (silly) Example of Conditional Independence
Let xi = 1 for the morning class, xi = 0 for afternoon We can estimate the means for all groups:
Afternoon, no candy: E[yi|Di =0,xi =0]=3.28 Afternoon, with candy E [yi |Di = 1, xi = 0] = 3.78 Morning, no candy: E[yi|Di = 0,xi = 1] = 3.95
Morning, with candy E[yi|Di = 1,xi = 1] = 4.45
24/95
A (silly) Example of Conditional Independence
If we try to calculate the difference in means directly
E[yi|Di =1]= 1×E[yi|Di =1,xi =0]+3×E[yi|Di =1,xi =1]=4.36
E[yi|Di =0]= 3×E[yi|Di =0,xi =0]+1×E[yi|Di =0,xi =0]=3.45 44
Our estimate is contaminated because the morning class is better E[yi|Di = 1]−E[yi|Di = 0] = 4.36−3.45 = 0.835
44
25/95
A (silly) Example of Conditional Independence
E[yi|Di = 0,xi = 0] = 3.28 and E[yi|Di = 1,xi = 0] = 3.78 E[yi|Di = 0,xi = 1] = 3.95 and E[yi|Di = 1,xi = 1] = 4.45
However, within each class treatment is random yi 0 ⊥ Di |Xi :
So we may recover the average treatment effect conditional on Xi :
E[yi1−yi0|xi =0]=? E[yi1−yi0|xi =1]=?
26/95
A (silly) Example of Conditional Independence
E[yi|Di = 0,xi = 0] = 3.28 and E[yi|Di = 1,xi = 0] = 3.78 E[yi|Di = 0,xi = 1] = 3.95 and E[yi|Di = 1,xi = 1] = 4.45
However, within each class treatment is random yi 0 ⊥ Di |xi :
So we may recover the average treatment effect conditional on xi :
For the afternoon:
E[yi1 −yi0|xi = 0] = 3.78−3.28 = 0.5
For the morning
E[yi1 −yi0|xi = 1] = 4.45−3.95 = 0.5
In this case:
E[yi1 −yi0] = 1E[yi1 −yi0|xi = 0]+ 1E[yi1 −yi0|xi = 1] = 0.5 22
27/95
Part 3: Causality and Regression
When does regression recover a causal effect? Need conditional mean independence
Threats to recovering causal effects
Omitted variables and measurement error
Controlling for confounding variables
28/95
Causality and Regression
When does linear regression capture a causal effect?
Start with a simple case: constant treatment effects
Suppose yi depends only on two random vars.: Di ∈ {0, 1} and vi
yi =α+ρDi +vi Di is some treatment (say candy)
vi is absolutely everything else that impacts yi
How good at R you are, How much you liked Paolo’s course, etc.
Set: E[vi]=0
29/95
Causality and Regression
yi =α+ρDi +vi Can then write potential outcomes:
yi1 =α+ρ+vi yi0 =α+vi
Because ρ is constant, individual and average treatment effects are: yi1 −yi0 = E[yi1 −yi0] = ρ
So ρ is what we want, the effect of treatment
But suppose we don’t know ρ, and only see yi and Di
30/95
Causality and Regression
yi =α+ρDi +vi
Suppose we regress yi on Di, and recover βOLS. When will
βOLS =ρ? 1
Because Di is binary, we have:
βOLS =E[y |D =1]−E[y |D =0]
1iiii
= E[α +ρ +vi|Di = 1]−E[α +vi|Di = 0] = ρ +E[vi|Di = 1]−E[vi|Di = 0]
31/95
Causality and Regression
βOLS = ρ +E[vi|Di = 1]−E[vi|Di = 0]
Selection Effect
So when does βOLS =ρ? 1
Holds under independence assumption (yi1,yi0) ⊥ Di Since yi1 =α+ρ+vi, yi0 =α+vi:
(yi1,yi0) ⊥ Di ⇐⇒ vi ⊥ Di This independence means
⇒ E[vi|Di = 1] = E[vi|Di = 0] = E[vi]
32/95
Causality and Regression
βOLS = ρ +E[vi|Di = 1]−E[vi|Di = 0]
Selection Effect
So when does βOLS =ρ?
β OLS = ρ even under weaker assumption than independence:
Mean Independence : E [vi |Di ] = E [vi ]
33/95
When will βOLS =ρ? 1
Suppose we don’t know ρ
yi =α+ρDi +vi
Our regression coefficient captures the causal effect ( βOLS = ρ1) if: 1
E[vi|Di] = E[vi]
The conditional mean is the same for every Di
More intuitive: E [vi |Di ] = E [vi ] implies that Corr(vi,Di)=0
34/95
What if vi and Di are Correlated?
Our regression coefficient captures the causal effect ( βOLS = ρ) if:
E[vi|Di] = E[vi]
This implies that Corr(vi , Di ) = 0
So anytime Di and vi are correlated:
βOLS ̸=ρ 1
βOLS ̸=α 0
Anytime Di correlated with anything else unobserved that impacts yi
1
35/95
Causality and Regression: Continuous xi
Suppose there is a continuous xi with a causal relationship with yi : A1unit↑inxi increasesyi byafixedamountβ1
e.g. an hour of studying increases your final grade by β1
Tempting to write:
yi =β0+β1xi
But in practice other things impact yi : again call these vi yi =β0+β1xi+vi
e.g. intelligence also matters for your final grade
36/95
OLS Estimator Fits a Line Through the Data
37/95
X
Y
OLS Estimator Fits a Line Through the Data
Y
X
βOLS +βOLSX 01
37/95
OLS Estimator Fits a Line Through the Data
37/95
X
Y
OLS Estimator Fits a Line Through the Data
37/95
βOLS +βOLSX 01
X
Y
Causality and Regression: Continuous xi yi =β0+β1xi+vi
Regression coefficient captures causal effect (β OLS = β ) if: 1
E[vi|xi] = E[vi] Failsanytimecorr(xi,vi)̸=0
An aside: we have used similar notation for 3 different things: 1. β1: the causal effect on yi of a 1 unit change in xi.
2. β OLS = Cov (xi ,yi ) : the population regression coefficient 1 Var(xi)
3. βˆOLS = Cov (xi ,yi ) : the sample regression coefficient
1 Var (xi )
38/95
The Causal Relationship Between X and Y
39/95
β0 + β1X
X
Y
One Data Point
39/95
β0 + β1X
yi=β0 + β1xi+vi vi
β0 + β1xi
xi
X
Y
If E[vi|xi] = E[vi] then βOLS = β1 1
Y
X
β0 + β1X
39/95
If E[vi|xi] = E[vi] then βOLS = β1 1
Y
X
βOLS +βOLSX 01
39/95
If E[vi|xi] = E[vi] then βOLS = β1 1
Y
X
βOLS +βOLSX 01
=β0 +β1X
39/95
What if vi and xi are Positively Correlated?
40/95
β0 + β1X
X
Y
What if vi and xi are Positively Correlated?
Y
vi
X
vj
β0 + β1X
40/95
If Corr(vi,xi) ̸= 0 then βOLS ̸= β1 1
41/95
βOLS +βOLSX 01
β0 + β1X
X
Y
An Example from Economics
Consider the model for wages
Wagesi =β0+β1Si+vi
Where Si is years of schooling
Are there any reasons that Si might be correlated with vi ? If so, this regression won’t uncover β1
42/95
Examples from Corporate Finance
Consider the model for leverage
Leveragei = α + β Profitabilityi + vi
Why might we have trouble recovering β?
(1) Unprofitable firms tend to have higher bankruptcy risk and should
have lower leverage than more profitable firms (tradeoff theory) ⇒ corr(Profitabilityi , vi ) > 0
⇒ E [vi |Profitabilityi ] ̸= E [vi ]
(2) Unprofitable firms have accumulated lower profits in the past and may have to use debt financing implying higher leverage (pecking order theory)
⇒ corr(Profitabilityi , vi ) < 0 ⇒ E [vi |Profitabilityi ] ̸= E [vi ]
43/95
One reason vi and xi might be correlated?
Suppose that we know yi is generated by the following
yi =β0+β1xi+γai+ei
Wherexi andei areuncorrelated,butCorr(ai,xi)>0
Could think of yi as wages, xi as years of schooling, ai as ability
Suppose we see yi and xi but not ai , and have to consider the model
yi=β0+β1xi+ vi
γai+ei
44/95
A Quick Review: Properties of Covariance
A few properties of Covariance: If W,X,Z are random variables: Cov(W,X +Z) = Cov(W,X)+Cov(W,Z)
Cov(X,X) = Var(X)
If a and b are constants:
Cov(aW,bX) = abCov(W,X)
Cov(a+W,Y)=Cov(W,X)
Finally, remember that correlation is just the covariance scaled:
Cov(X,Z) Corr(X,Z) = Var(X)Var(Z)
I’ll switch back and forth between them sometimes
45/95
Omitted Variables Bias
So if we have:
What will the regression of yi on xi give us?
Recall that the regression coefficient is βOLS = Cov(yi,xi) : 1 Var(xi)
βOLS = Cov(yi,xi) 1 Var(xi)
= Cov(β0 +β1xi +vi,xi) Var(xi)
= β1Cov(xi,xi) + Cov(vi,xi) Var (xi ) Var (xi )
= β1 + Cov (vi , xi ) Var(xi)
yi =β0+β1xi+vi
46/95
Omitted Variables Bias
So βOLS is biased 1
βOLS =β +Cov(vi,xi) 1 1 Var(xi)
Bias
Ifvi =γai+ei with
Corr(ai,xi) ̸= 0 and Corr(ei,xi) = 0
we can characterize this bias in simple terms: βOLS=β +Cov(γai+ei,xi)
1 1 Var(xi)
Bias = β1 + γ Cov (ai , xi )
Var(xi)
Bias
47/95
Omitted Variables Bias
βOLS =β +γCov(ai,xi) 1 1 Var(xi)
=β +γδOLS 11
Where δOLS is the coefficient from the regression:
a =δOLS+δOLSx +ηOLS i01ii
48/95
Omitted Variables Bias
Good heuristic for evaluating OLS estimates:
βOLS =β +γδOLS 111
Bias
γ: relationship between ai and Yi
δOLS: relationship between ai and xi
1
Might not be able to measure γ or δOLS—but can often make a 1
good guess
49/95
Impact of Schooling on Wages
Suppose wages (yi ) are determined by:
yi =β0+β1xi+γai+ei
and we see years of schooling (xi ) but not ability (ai ) Corr(xi,ai) > 0 and Corr(yi,ai) > 0
We estimate: And recover
yi =β0+β1xi+vi
βOLS =β +γδOLS 111
Bias
50/95
Impact of Schooling on Wages
βOLS =β +γδOLS 111
Bias
Is our estimated βOLS larger or smaller than β1? 1
menti.com
51/95
Controlling for a Confounding Variable
yi =β0+β1xi+γai+ei Suppose we are able to observe ability
e.g. an IQ test is all that matters For simplicity, let xi be binary
xi = 1 if individual i has an MSc, 0 otherwise Suppose we regress yi on xi and ai
βOLS =E[y |x =1,a ]−E[y |x =0,a ] 1 iiiiii
= E[β0 +β1 +γai +ei|xi = 1,ai]−E[β0 +γai +ei|xi = 0,ai]
52/95
Controlling for a Confounding Variable
βOLS =E[β +β +γa +e |x =1,a ]−E[β +γa +e |x =0,a ] 001iiii0iiii
Canceling out terms gives:
βOLS =β +E[e |x =1,a ]−E[e |x =0,a ]
11iiiiii So our β OLS = β1 if the following condition holds:
1
E[ei|xi,ai]−E[ei|ai] This is called Conditional Mean Independence
A slightly weaker version of our conditional independence assumption (yi1,yi0)⊥xi|ai
53/95
Example: Controlling for a Confounding Variable
54/95
Republican Votes and Income: South and North
55/95
Republican Votes and Income: South and North
56/95
Controlling for a Confounding Variable
Suppose we run the following regression
repvotesi =β0+β1income+vi
What is βˆols? menti.com… 1
So does being rich decrease Republican votes?
Suppose we run separately in the South and North:
repvotesi =β0+β1income+vi repvotesi =β0+β1income+vi
What is βˆols in the south? 1
Within regions, income positively associated with republican votes
57/95
Controlling for a Confounding Variable
Now suppose instead we run the following regression repvotesi = β0 + β1 income + γ southi + ei
Where southi = 1 for southern states We estimate βˆols = 0.340
This is just the (weighted) average for the two regions: βˆols ≈ 1βˆols + 1βˆols
12121
(weights in general do not have to be 1 ) 2
1
58/95
Why is This? Regression Anatomy
Suppose we have the following (multivariate) regression y =βOLS+βOLSx +Z′γ+v
i01iii
Here Zi is a potentially multidimensional set of controls Then the OLS estimator is algebraically equal to
βOLS = Cov(yi,x ̃i) 1 V a r ( x ̃ i )
Where x ̃i is the residual from a regression of xi on Zi
x ̃ =x −(δOLS+Z′δOLS) ii0i
59/95
Why is This? Regression Anatomy
βOLS = Cov(yi,x ̃i) 1 V a r ( x ̃ i )
Where x ̃i is the residual from a regression of xi on Zi x ̃ =x −(δOLS+Z′δOLS)
ii0i
Coefficient from a multiple regression is the same as that from a single regression!
After first subtracting (partialling out) the part of xi explained by Zi
60/95
Republican Votes and Income: South and North
61/95
Residualizing W.R.T. South Removes Difference in Income
62/95
OLS Finds Line of Best Fit on the “Residuals”
63/95
OLS Finds Line of Best Fit on the “Residuals”
63/95
Measurement Error
An omitted variable is a common reason why β OLS ̸= β 1
There are several others, including measurement error
Suppose yi is generated by:
y =β +β x∗+e i01ii
But we can’t exactly see xi∗, instead we see:
where Corr(xi∗,ηi)=0 Why might this happen?
x =x∗+η iii
64/95
Measurement Error
y =β +β x∗+e i01ii
x =x∗+η ⇒x∗ =x −η iiiiii
⇒yi =β0+β1xi−β1ηi+ei
vi
yi =β0+β1xi+vi
65/95
Measurement Error
So we can only estimate
yi =β0+β1xi+ vi
−β1 ηi +ei
But because of the measurement error: Corr (xi , vi ) ̸= 0
Why?
And our estimates are off!
Cov(yi,xi) = βOLS ̸= β Var(xi) 1 1
66/95
Measurement Error
Measurement error in xi really also an omitted variables problem With ηi as the omitted variable.
What happens if we mismeasure yi ?
67/95
Part 4: An Overview of IV
Assumptions for a valid instrument Two interpretations of an IV
The ratio of two OLS coefficients Two stage least squares
An example: the impact of schooling on wages
68/95
Hope remains for β
Suppose Corr (xi , vi ) ̸= 0 but want to estimate causal effect β yi =β0+β1xi+vi
For now keep yi =wages, xi =years of schooling in mind An instrument (or instrumental variable) can help
69/95
What makes a good instrument?
Suppose there is a variable zi that should be totally unrelated to yi For example: should the quarter you were born in impact your wage?
70/95
A sample of birth quarters
71/95
1234 Quarter of Birth
Number of Observations
0 20000 40000 60000 80000 100000
What makes a good instrument?
Suppose there is a variable zi that should be totally unrelated to yi For example: should your birth quarter impact your wage?
Except for one thing
zi has a direct impact on xi
72/95
In the US birth quarter influences years of schooling
73/95
1234 Quarter of Birth
Average Years of Education
12.6 12.7 12.8 12.9
In the US birth quarter influences years of schooling
Many US states mandate that students begin school in the calendar year when they turn 6
School years start in mid- to late- year (e.g., September)
Many states mandate students stay in school until they turn 16
74/95
In the US birth quarter influences years of schooling
Consider two students
(1) One born in February (Q1)
(2) One born in November (Q4)
Q1 student: starting school age ≈ 6.5
Q4 student: starting school age < 6
There is then variation in schooling completed when each turns 16
(Q4 > Q1)
75/95
Intuition behind instrumental variables
Suppose we really believe that zi should have no impact on yi except by changing xi
e.g. no possible other way birth quarter could influence wages
Then the impact of zi on yi should tell us something about the effect we are looking for
Any impact of birth quarter on wages must be because it raises education
76/95
Birth quarter impacts wages
77/95
1234 Quarter of Birth
Average Yearly Wage in Dollars
22500 22600 22700 22800 22900 23000
Intuition behind instrumental variables
Being born in 4th quart. vs. 1st quart. increases yearly wages by: ≈ $300
If this all happens because quarter of birth increases education, how do we recover the impact of education on wages?
Being born in 4th quart. vs. 1st quart. increases education by: ≈ 0.2 years
So each year of schooling increases wages by: Impact of birth quart. (zi ) on wages (yi )
βiv = $300 0.2
Impact of birth quart. (zi ) on education (xi )
=$1500
78/95
Formalizing the Instrumental Variables Assumptions
yi =β0+β1xi+vi
Our informal assumption: zi should change xi , but have absolutely
no other impact on yi . Formally:
1. Cov [zi , xi ] ̸= 0 (Instrument Relevance)
Intuition: zi must change xi
2. Cov [zi , vi ] = 0 (Exclusion Restriction)
Intuition: zi has absolutely no other impact on yi
Recall that vi is everything else outside of xi that influences yi
79/95
Hope remains for β
1. Instrument Relevance: Cov [zi , xi ] ̸= 0 2. Exclusion Restriction: Cov [zi , vi ] = 0
Under these assumptions, we may consistently estimate β IV = β βIV = Cov(yi,zi) = β
Cov(xi,zi)
This ratio should look familiar from our example…
And Cov(yi,zi) and Cov(xi,zi) can be estimated from the data!
80/95
Getting βIV in practice
An alternative way of writing this :
Coefficient from regression yi on zi
βIV = Cov(yi,zi) = Cov(yi,zi)/Var(zi) Cov(xi,zi) Cov(xi,zi)/Var(zi)
Coefficient from regression xi on zi
81/95
Getting βIV in practice
This means that if we run two regressions:
1. “First stage” impact of zi on xi
xi = α1 + φ zi + ηi
2. “Reduced form” impact of zi on yi
yi =α2+ρzi+ui
Then we can write:
βIV = Cov(yi,zi)/Var(zi) = ρOLS
Cov(xi,zi)/Var(zi) φOLS
Impact of zi on xi
Impact of zi onyi
82/95
Getting βIV in practice: Two stage least squares A more common way of estimating β IV :
1. Estimate φOLS in first stage:
xi = α1 + φ zi + ηi
2. Predict the part of Xi explained by zi
xˆ =αOLS +φOLSz
i1i 3. Regress Yi on predicted xˆi in a second stage
Note that:
y i = α 2 + β xˆ i + u i
β2ndStage = Cov(yi,xˆi) =β V a r ( xˆ i )
83/95
Getting βIV in practice: Two stage least squares
β2ndStage = Cov(yi,xˆi) V a r ( xˆ i )
= Cov(yi,α1 +φOLSzi) Var(α1 +φOLSzi)
= φOLS ·Cov(yi,zi) φOLS ·φOLSVar(zi)
= ρOLS φ OLS
=βIV =β
84/95
Two stage least squares works with more Xs and Zs
The same approach works with multiple X′s, Z′s and with control variables
Y =α+X′β+W′θ+v iiii
Xi =[xi1,…,xiM]′ with Cov(Xim,vi)̸=0 (“endogenous variables”) Wi = [wi1,…,wiL]′ (control variables)
Zi =[zi1,…,ziK]′ with E[zivi]=0 (instruments)
Note that K must be greater than M
More instruments than endogenous variables
85/95
Two stage least squares works with multiple Xs and Zs For each xim, run the first stage regression:
x =δ+Z′γ+W′φ+ε im i i i
Generate predicted values:
xˆ = δOLS +Z′γOLS +W′φOLS
im i i
Run the second stage regression
Y =α+Xˆ′β+W′θ+v
iiii WhereXˆi =[xˆi1,…,xˆiM]′
β2ndStage =β
86/95
Back to Example IV: Education and Wages
Adapted from Angrist and Krueger, 1991
Does an additional year of education lead to higher wages?
yi =β0+β1xi+vi
Are there any concerns about OVB here?
What are the requirements for a valid instrument zi ?
87/95
Example IV: Education and Wages
Angrist and Krueger use the quarter of birth as an instrument for education
Many US states mandate that students begin school in the calendar year when they turn 6
School years start in mid- to late- year (e.g., September)
Many states mandate students stay in school until they turn 16
88/95
Example IV: Education and Wages
Consider two students
(1) One born in February (Q1)
(2) One born in November (Q4)
Q1 student: starting school age ≈ 6.5
Q4 student: starting school age < 6
There is then variation in schooling completed when each turns 16
(Q4 > Q1)
89/95
First Stage
90/95
Reduced Form
91/95
Example IV: Education and Wages
A simple instrument:
zi = 1{Quarter of Birth = 1}
First stage:
xi =γ0+γ1xi+εi Do you expect γOLS >0 or γOLS <0?
Predict xˆ = γOLS +γOLSz i01i
Second stage:
11
y i = β 0 + β 1 xˆ i + v i
92/95
OLS and IV estimates of the economic returns to schooling
93/95
Today’s Lecture: Four Parts
(1) Analyzing an experiment in R
Comparing means (t-test) and regression
(2) Causality and the potential outcomes framework What do we mean when we say X causes Y
(3) Linear regression, the CEF, and causality
How do we think about causality in a regression framework?
(4) Instrumental Variables (if time)
The intuition behind instrumental variables
94/95
Next Week
Next week: introduction to panel data
95/95