DSUA-201 Problem Set 1
Fall 2021
This homework must be turned in on Brightspace by Oct 12th 2021, 11:59pm.
It must be your own work, and your own work only—you must not copy anyone’s work, or allow anyone to
copy yours. This extends to writing code. You may consult with others, but when you write up, you must
do so alone.
Your homework submission must be written and submitted using Rmarkdown. No handwritten solutions
will be accepted. You should submit:
1. A compiled PDF file named yourNetID solutions.pdf containing your solutions to the problems.
2. A .Rmd file containing the code and text used to produce your compiled pdf named yourNetID solutions.Rmd.
Note that math can be typeset in Rmarkdown in the same way as Latex.
Please make sure your answers are clearly structured in the Rmarkdown file:
1. Label each question part(e.g. 3.a).
2. Do not include written answers as code comments.
3. The code used to obtain the answer for each question part should accompany the written answer.
Problem 1
Suppose we have a sample of observations, each assigned a binary treatment Di ∈ 0, 1 with Di = 1 indicating
a unit is treated and Di = 0 indicating the unit is assigned control. Assume 0 < Pr(Di = 1) < 1.
We observe an outcome Yi for each observation. Define potential outcomes Yi(1) and Yi(0), denoting the
outcome observed for unit i if it were assigned treatment (Yi(1)) or control (Yi(0)) respectively. Assume
that Yi(1), Yi(0) are iid from the same distribution, which implies that: E[Y1(d)] = · · · = E[Yn(d)] = µ(d)
and V ar[Y1(d)] = · · · = V ar[Yn(d)] = σ2(d).
In class we have talked about the Average Treatment Effect (ATE), which is defined as τ = E[Yi(1)−Yi(0)].
Another very common effect of interest to researchers is the Average Treatment Effect on the Treated (ATT),
which is defined as:
τ t = E[Yi(1)− Yi(0)|Di = 1].
Part a (5pts):
What is the interpretation of the ATT? Give your description of what this effect means and how it is different
from the ATE.
Part b (10pts):
Assume that:
1
• Consistency holds for all treatment levels: Yi = Yi(1)Di + Yi(0)(1−Di)
• Weak ignorability holds only for the control outcome, i.e.: Yi(0) |= Di, and it is not true that:
Yi(1) |= Di
Show that consistency of all treatment levels, and weak ignorability of the control condition (the assumptions
just made) are enough to identify the ATT, i.e., show that: τ t = E[Yi|Di = 1]− E[Yi|Di = 0].
Part c (10pts):
Write and simplify the difference between the ATE and the ATT under the same assumptions as Part b.
What additional assumption is necessary for this difference to be 0, and for the ATT to be equal to the
ATE? Why is this assumption enough?
Problem 2
Under the same setting as Problem 1, suppose that Di is assigned to the n units in a Bernoulli trial, that is
each unit receives treatment independently with probability Pr(Di = 1) = p.
Part a (10pts)
Recall that the number of treated units, Nt is defined as Nt =
∑n
i=1Di. Recall that in the case of a Bernoulli
trial, Nt is a random variable.
1. What distribution does Di follow?
2. What is E[Nt]?
3. What is V ar[Nt]?
4. Suppose that we wanted the expected number of treated units in our Bernoulli trial to be the same
as the number of treated units as a completely randomized experiment with nt treated units. What
value of p should we choose?
Part b (15pts)
After conducting the experiment as described above, we wish to estimate the ATE. To do so, we employ the
following estimator:
τ̂IPW =
1
n
n∑
i=1
(
Yi
Di
p
− Yi
1−Di
1− p
)
.
Show that under consistency, positivity, and ignorability for all treatments this estimator is unbiased for the
ATE.
Part c (5pts)
Suppose now that we used the same estimator defined above in a completely randomized experiment,
where exactly nt units are treated. Show that, in this case, the estimator above is equal to the Neyman
“difference-in-means” estimator we saw in class.
Problem 3
Gerber, Green and Larimer randomly assigned households to receive a mailing encouraging them to turn
out to vote before the Michigan 2006 primary election (Gerber, Green and Larimer (APSR, 2008)). We will
be using the individual data obtained from the experiment. Each row in the dataset represent an individual
2
record, where p2000 represents whether the individual had voted in August 2000, g2000 represents whether
the individual had voted in November 2000 (same for p2002,g2002,p2004). Each individual belongs to a
household specified by hh id.
Part a. Data preparation (10pts):
In order to analyze the GOTV data we will need to reproduce the household-level dataset of the original
paper.
1. Recode the variable ”sex” by changing the character to float (i.e. ”female” → 1., ”male”→ 0.)
2. Recode the variable “yob” into a new variable called “age” by subtracting yob from the year the
experiment took place, 2006.
3. Group the data into households, i.e., create a new dataframe where each row is a household with a
unique hh id, and each column is the the mean value of each of the other individual-level variables in
that household. (Hint: you may consider using dplyr.)
4. In the paper, the authors analyzed households rather than individual. Why did they do this?
Part b. Validate Randomization (10pts):
Use the household dataset you obtained above, show that the experimental assignment is randomized at the
household level by computing and showing the sample means of each of the variables: p2000, g2000, p2002,
g2002, p2004, hhsize, sex, and age in each of the treatment groups. Are these means similar across groups?
And if so what does that imply for randomization and ignorability?
Part c. ATE (5pts):
Use the household dataset you obtained above, use the Neyman Estimator, denoted here as τ̂ , to compute
the average treatment effect for each treatment group comparing to the control group. Name and briefly
explain two assumptions in this experiment that allow us to compute the ATE.
Part d. Variance and Average HP testing (10pts):
Assuming that the experiment is a completely randomized experiment, give an estimate of the ATE variance
of the treatment effect of the Neighbors treatment compared to the control group, using the Neyman
variance estimator, denoted as V̂ ar[τ̂ ]. In addition, conduct a two-sided hypothesis test against the null
that the ATE is 0, i.e.: H0 : τ = 0, with the alternative is H1 : τ 6= 0, using the Z-statistic as your test
statistic, i.e.:
Zn =
√
n(τ̂ − τ)√
V̂ ar[τ̂ ]
.
Report both the value of Zn and the p-value for the test.
Part e. Randomization Inference (15pts):
Conduct a randomization inference hypothesis test on the experiment data for the sharp null hypothesis
that Yi(neighbors) = Yi(control) for all i. Using Zn as defined before as your test statistic, follow the steps
below:
1. Simulate the value of Zn under the sharp null for at least N = 1000 iterations
2. Plot the values you obtained as a histogram.
3. Add a marker for the observed value of Zn
3
4. Report the two-sided p-value for the test
Part f. Compare hypothesis tests (5pts):
Briefly comment on the difference between the p-value you obtained in parts d and e. Which is smaller?
And what could this difference be due to?
4