General Instructions
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
This homework must be turned in on Gradescope by July 28st 2022, 11:59pm. It must be your own work, and your own work only—you must not copy anyone’s work, or allow anyone to copy yours. This extends to writing code. You may consult with others, but when you write up, you must do so alone. Your homework submission must be written and submitted using Rmarkdown. No handwritten solutions will be accepted. You should submit:
1. A compiled PDF file named yourNetID solutions.pdf containing your solutions to the prob- lems.
Copyright By PowCoder代写 加微信 powcoder
2. A .Rmd file containing the code and text used to produce your compiled pdf named yourNetID solutions.Rmd. Note that math can be typeset in Rmarkdown in the same way as Latex.
Please make sure your answers are clearly structured in the Rmarkdown file:
1. Label each question part(e.g. 3.a).
2. Do not include written answers as code comments.
3. The code used to obtain the answer for each question part should accompany the written answer.
Problem 1 – Changing Minds on Gay Marriage? 30 points
In this exercise, we analyze the data from two experiments in which households were canvassed for support on gay marriage. Note that the original study was later retracted due to allegations of fabricated data. In this exercise, however, we analyze the original data while ignoring the allegations. Canvassers were given a script leading to conversations that averaged about twenty minutes. A distinctive feature of this study is that gay and straight canvassers were randomly assigned to households, and canvassers revealed whether they were straight or gay in the course of the conversation. The experiment aims to test the “contact hypothesis,” which contends that out-group hostility (towards gay people in this case) diminishes when people from different groups interact with one another. The data file is gay.csv, which is a CSV file. Table 2.7 presents the names and descriptions of the variables in this data set. Each observation of this data set is a respondent giving a response to a four-point survey item on same-sex marriage. There are two different studies in this data set, involving interviews during seven different time periods (i.e., seven waves). In both studies, the first wave consists of the interview before the canvassing treatment occurs.
1. (4 points) Using the baseline interview wave before the treatment is administered, examine whether randomization was properly conducted. Base your analysis on the three groups of study 1: “same-sex marriage script by gay canvasser,” “same-sex marriage script by straight canvasser” and “no contact.” Briefly comment on the results.
2. (4 points) The second wave of the survey was implemented two months after canvassing. Using study 1, estimate the average treatment effects of gay and straight canvassers on support for same-sex marriage, separately. Give a brief interpretation of the results.
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
3. (5 points) The study contained another treatment that involves contact, but does not involve using the gay marriage script. Specifically, the authors used a script to encourage people to recycle. What is the purpose of this treatment? Using study 1 and wave 2, compare outcomes from the treatment “same-sex marriage script by gay canvasser” to “recycling script by gay canvasser.” Repeat the same for straight canvassers, comparing the treatment “same- sex marriage script by straight canvasser” to “recycling script by straight canvasser.” What do these comparisons reveal? Give a substantive interpretation of the results.
4. (5 points) In study 1, the authors reinterviewed the respondents six different times (in waves 2 to 7) after treatment, at two-month intervals. The last interview, in wave 7, occurs one year after treatment. Do the effects of canvassing last? If so, under what conditions? Answer these questions by separately computing the average effects of straight and gay canvassers with the same-sex marriage script for each of the subsequent waves (relative to the control condition).
5. (4 points) The researchers conducted a second study to replicate the core results of the first study. In this study, same-sex marriage scripts are given only by gay canvassers. For study 2, use the treatments “same-sex marriage script by gay canvasser” and “no contact” to examine whether randomization was appropriately conducted. Use the baseline support from wave 1 for this analysis.
6. (3 points) For study 2, estimate the treatment effects of gay canvassing using data from wave 2. Are the results consistent with those of study 1?
7. (5 points) Using study 2, estimate the average effect of gay canvassing at each subsequent wave and observe how it changes over time. Note that study 2 did not have a fifth or sixth wave, but the seventh wave occurred one year after treatment, as in study 1. Draw an overall conclusion from both study 1 and study 2.
Problem 2 – Election Fraud in Russia 20 points
In this exercise, we use the rules of probability to detect election fraud by examining voting patterns in the 2011 Russian State Duma election. The State Duma is the federal legislature of Russia. The ruling political party, United Russia, won this election, but to many accusations of election fraud, which the Kremlin, or Russian government, denied. Some protesters highlighted irregular patterns of voting as evidence of election fraud. In particular, the protesters pointed out the relatively high frequency of common fractions such as 1/4, 1/3, and 1/2 in the official vote shares.
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
Note: The results of each election are stored in a data frame. The RData file fraud. RData contains data on four elections: the 2007 and 2011 Russian Duma elections, the 2012 Russian presidential election, and the 2011 Canadian election.
We analyze the official election results, contained in the russia2011 data frame in the RData file fraud.RData, to investigate whether there is any evidence for election fraud. The RData file can be loaded using the load( ) function. Besides russia2011 , the RData file contains the election results from the 2003 Russian Duma election, the 2012 Russian presidential election, and the 2011 Canadian election, as separate data frames. The table above presents the names and descriptions of variables used in each data frame. Note: Part of this exercise may require computationally intensive code.
1. (5 points) To analyze the 2011 Russian election results, first compute United Russia’s vote share as a proportion of the voters who turned out. Identify the 10 most frequently occurring fractions for the vote share. Create a histogram that sets the number of bins to the number of unique fractions, with one bar created for each uniquely observed fraction, to differentiate between similar fractions like 1/2 and 51/100. This can be done by using the breaks argument in the hist () function. What does this histogram look like at fractions with low numerators and denominators such as 1/2 and 2/3 ?
2. (10 points) The mere existence of high frequencies at low fractions may not imply election fraud. Indeed, more numbers are divisible by smaller integers like 2,3 , and 4 than by larger integers like 22,23 , and 24 . To investigate the possibility that the low fractions arose by chance, assume the following probability model. The turnout for a precinct has a binomial distribution, whose size equals the number of voters and success probability equals the turnout rate for the precinct. The vote share for United Russia in this precinct is assumed to follow a binomial distribution, conditional on the turnout, where the size equals the number of voters who turned out and the success probability equals the observed vote share in the precinct. Conduct a Monte Carlo simulation under this alternative assumption (1000 simulations should be sufficient). What are the 10 most frequent vote share values? Create a histogram similar to the one in the previous question. Briefly comment on the results you obtain. Note: This question requires a computationally intensive code. Write a code with a small number of simulations first and then run the final code with 1000 simulations.
3. (5 points) To judge the Monte Carlo simulation results against the actual results of the 2011 Russian election, we compare the observed fraction of observations within a bin of certain size with its simulated counterpart. To do this, create histograms showing the distribution of question 2’s four most frequently occurring fractions, i.e., 1/2, 1/3, 3/5, and 2/3, and compare them with the corresponding fractions’ proportion in the actual election. Briefly interpret the results.
Problem 3 – 35 points
Gerber, Green and Larimer randomly assigned households to receive a mailing encouraging them to turn out to vote before the Michigan 2006 primary election (Gerber, Green and
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
Larimer (APSR, 2008)). We will be using the individual data obtained from the experiment. Each row in the dataset represent an individual record, where p2000 represents whether the individual had voted in August 2000, g2000 represents whether the individual had voted in November 2000 (same for p2002,g2002,p2004). Each individual belongs to a household specified by hh id.
Part a. (4 points) Data preparation : In order to analyze the GOTV data we will need to reproduce the household-level dataset of the original paper
(a) Recode the variable ”sex” by changing the character to float (i.e. ”female” → 1., ”male” → 0).
(b) Recode the variable “yob” into a new variable called “age” by subtracting yob from the year the experiment took place, 2006.
(c) Group the data into households, i.e., create a new dataframe where each row is a house- hold with a unique hh id, and each column is the the mean value of each of the other individual-level variables in that household. (Hint: you may consider using dplyr.)
(d) In the paper, the authors analyzed households rather than individual. Why did they do this?
Part b. (4 points) Validate Randomization :
Use the household dataset you obtained above, show that the experimental assignment is randomized at the household level by computing and showing the sample means of each of the variables: p2000, g2000, p2002, g2002, p2004, hhsize, sex, and age in each of the treatment groups. Are these means similar across groups? And if so what does that imply for randomization and ignorability?
Part c. (4 points) ATE :
Use the household dataset you obtained above, use the Neyman Estimator, denoted here as τˆ, to compute the average treatment effect for each treatment group comparing to the control group. Name and briefly explain two assumptions in this experiment that allow us to compute the ATE.
Part d. (10 points) Variance and Average HP testing :
Assuming that the experiment is a completely randomized experiment, give an estimate of
the ATE variance of the treatment effect of the Neighbors treatment compared to the control ˆ
group, using the Neyman variance estimator, denoted as V ar[τ ]. In addition, conduct a two- sided hypothesis test against the null that the ATE is 0, i.e.: H0 : τ = 0, with the alternative is H1 : τ ≠ 0, using the Z-statistic as your test statistic, i.e.:
Report both the value of Zn and the p-value for the test Part e. (10 points) Randomization Inference :
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
Conduct a randomization inference hypothesis test on the experiment data for the sharp null hypothesis that Yi(neighbors) = Yi(control) for all i. Using Zn as defined before as your test statistic, follow the steps below:
(a) Simulate the value of Zn under the sharp null for at least N = 1000 iterations. (b) Plot the values you obtained as a histogram.
(c) Add a marker for the observed value of Zn (d) Report the two-sided p-value for the test
Part f. (3 points) Compare hypothesis tests:
Briefly comment on the difference between the p-value you obtained in parts d and e. Which is smaller? And what could this difference be due to?
Problem 4 – 15 points
Problem 1 from 2021 hw2. Marks distribution for subparts: 1(4), 2(5), 3(6).
In this question we will be using the same household-level dataset that you constructed in
part a of Problem 3.
Part a (4 points):
Compute the ATE of the ”Neighbors” treatment using the standard difference-in-means es- timator, i.e., τˆ = Y ̄t − Y ̄c. Provide standard errors and 95% confidence intervals for your estimates.
Part b (5 points):
Now compute the same ATE but with the stratification estimator that is defined as the weighted mean of the stratum CATEs that you computed in the previous problem:
5 Nx τ̂block = ∑ τˆ(x) N .
Compute variance and 95% confidence intervals for this estimator as well using the stratified
variance estimator defined as:
) = ∑ Var(τ (x)) (
Comment on the difference between the ATE estimates you obtained here and in part a and
their variances. What is it due to?
Part c (6 points):
Now Divide the data set into 6 strata in such a way that each of the strata have same proportion of Treated and Control observations. You can do so by creating a new variable called ”group” with values 0, 1, 2, 3, 4, 5 and randomly assigning each value to Nt/6 treated
DS-UA 201 Causal Inference Instructor: Homework 2-100 points Due: July 28, 2022
units and Nc/6 control units. You may exclude enough treated and control units from the data to make Nt and Nc divisible by 6.
Compute the ATE by applying the estimator τˆ to these newly created strata. Provide block
variance estimates and 95% confidence intervals for these ATE estimates as well using the stratified variance estimator. Is the variance of this estimator much different from that of τˆ you computed in part A? Why do you think this is the case?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com