BU.450.760 Technical Document T6.1 – Diff-in-Diff Prof.
Differences-in-Differences Analysis in R
This is a companion document for the script of S6.1.
We will utilize the data set D6.1 (described in C6.1), which lists the number of weekly visits (“weeklyvisits”, expressed in hundred thousand visits) for 88 Spanish and French websites (indexed by “siteid”) in the period covering 20 weeks prior and 20 weeks following the shutdown of the Spanish Google News site. Weeks are indexed by week=1,..,40, with the shutdown occurring in week 21 (early December 2014). For each website, the data indicates whether it has regional (as opposed to national) reach, and whether it has a focus on sports topics.
Copyright By PowCoder代写 加微信 powcoder
1. Preliminaries
The first few lines of code set the working directory, clear the workspace and load the data.
2. Assessing balance in website composition
The shutdown of the Spanish site of Google News can be taken as a natural experiment, which eliminates the News Aggregator for Spanish sites. We could be tempted to evaluate the impact of this treatment by comparing visit sites (treated = Spanish sites vs control = French sites) after the shutdown. However, this could be misleading if there were differences in the types of websites that make up for the Spanish and French subsamples. Hence, we start by evaluating the composition of websites in each subsample.
We will do this using the command “aggregate”, which can perform a series of data aggregation procedures. In line 18, we instruct that the command computes the mean (third argument) of ds$regional (first argument – variable of interest), by country (second argument). The instruction is repeated in line 19 but for the indicator for sports focus. Line 20 is a more synthetic way of providing the above instructions, whereby both variables are aggregated at the same time (note: the summarized variables are contained by the 5th and 6th column of ds, respectively).
The results shown below show that there exist noticeable disparities in website composition: the sample of Spanish websites contains about 50% more regional websites and about 50% less
BU.450.760 Technical Document T6.1 – Diff-in-Diff Prof.
sports websites. From this evidence, we should reject the premise that the treatment was assigned as-if-random.
3. Assessing balance in terms of baseline website visits
Another way to investigate whether it is valid to infer treatment effects from simple differences is based on baseline (i.e., pre-treatment) outcome levels. If one group of websites (treated or control) had a systematically larger number of visits before the treatment (for example, due to each country’s total population or internet speed), we would expect this baseline difference to also appear in the post period. Not accounting for this baseline difference would lead to estimate an incorrect treatment effect.
To compute the baseline difference, we will first code an indicator “post”, which equals one for weeks after the event (line 26). Next, we repeat the aggregation command used before, but focusing solely on the pre-treatment (i.e., post=0) period, as shown by line 27.
The results shown below suggest that Spanish websites had about 60% average visits than French websites in the pre-treatment period. This baseline difference is expected to continue in the “post” period. The experimental treatment will either make the difference bigger or smaller. Our inference of the treatment effect follows from how much this differences changes (i.e., differences-in-differences).
BU.450.760 Technical Document T6.1 – Diff-in-Diff Prof.
4. Differences-in-differences table
A quick and simple way to investigate the magnitude of the treatment effect of interest is based on difference-in-differences table, namely, a table that lists the average pre/post average outcomes for treated and control units.
As before, we will rely on the aggregate command. In this case, however, we need to aggregate by period (post=0,1) and country, so the second argument (line 33) corresponds to a list of these dimensions for aggregation. For later use, we will store the aggregation results in an object that we are calling “aggs”.
The results shown above indicate that average weekly visits in the post period decreased both for French and Spanish websites. However, for Spanish websites the drop was much larger.
Intuitively, the differences-in-differences estimator corresponds to the differences of these country-specific pre-post differences. The calculations below show how to derive a “model- free” estimate for the treatment effect, namely, as the difference of first differences. This estimate suggests that the shutdown of the Spain site of Google News lead to an average decrease of about 235 (x 100,000) weekly visits to Spanish websites.
“First differences” (i.e., pre/post differences for each group):
• France: 1411-1440 = -29
• Spain: 641-905 = -264
Model-free diff-in-diff estimate: (-264) – (-29) = -235
5. Graphical assessment
When conducting a diff-in-diff analysis, it is always very instructive to perform a graphical analysis. In particular, the main goal when conducting a graphical analysis is to generate additional evidence to either further support or reject a causal interpretation of the diff-in-diff estimate.
BU.450.760 Technical Document T6.1 – Diff-in-Diff Prof.
Our graphical analysis will consist on plotting the average weekly visits for treated and control websites, for each of the weeks covered by the sample. The resulting graph is presented below (codes are provided next). First notice here that, consistent with the diff-in-diff table, in the period prior to the event (marked with a black vertical line), average visits are larger for French sites (horizontal lines show period-specific averages). Following the event, visits to Spanish sites drop sharply, remaining stable but at a new lower level. What is most important from this graph is that the drop in Spanish visits after the event was sharp, that is, it does not appear to follow from a pre-existing downward trend. If the lower average level of Spanish sites after week 20 had been the result of a decline starting before week 20, then we may not be able to attribute our estimate to the shutdown of Google News Spain.
The codes that implement this plot are presented below. These codes utilize the command ggplot, which is provided by the library ggplot2. (If you have not installed this library, un-comment and run line 42.) The portion of lines 49-53 are responsible for plotting the conditional means (horizontal lines).
BU.450.760 Technical Document T6.1 – Diff-in-Diff Prof.
6. Formal diff-in-diff estimate
Lastly, we will derive a formal (i.e., model-based) diff-in-diff estimate. The advantage of this estimate is twofold: (i) it controls for website characteristics (regional, sports), and (ii) it provides a measure of statistical significance (ie. standard error of estimate and corresponding p- value).
To obtain it, we create an indicator for Spanish websites (line 63) and an interaction variable activated only for Spanish websites in the post period (line 64). The diff-in-diff estimate corresponds to the coefficient associated to this variable. From the regression results, note that: (i) the estimate is significant with 90% confidence (p-value = 0.056 < 0.1) and that (ii) the estimate of -234.54 is very similar to its model-free counterpart derived above.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com