程序代写代做代考 flex —


title: “Assignment 1”
subtitle: “Econ 3210, Fall 2020”
author: “Your name here”
output: html_document

“`{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
# Set the graphical theme
ggplot2::theme_set(ggplot2::theme_light())

library(AER)
library(tidyverse)

“`

## Instructions

Use this .rmd template to answer the 3 questions below. Once you have done this, you can `knit` the document to `HTML` and upload this document to moodle/e-class. I will **only** accept HTML documents, **no exceptions**. This assignment is due October 4th.

Please read all of the assignment before attempting it.

## Introduction

This assignment is meant to be a tutorial on using `R` markdown to “knit” together text, and statistical input and output. There will be a short video that accompanies this document to help get you started.

A basic ability to work with data, perform statistical analysis and interpret statistical output are part of the learning objectives of this course. `R` will be how we perform the statistical analysis, and `Rmarkdown` will be how we present our analysis. Markdown is a markup language used to produced documents in with a simple text syntax. Just some basics:

Section headers are done with the pound sign

## Section 1
### New subsection

Lists are done enumerated

1. first,
2. second,
3. third.

Or itemized

– first,
– second,
– third

The spacing is important.

We can write in math by using the dollar sign **notation**. This is an _inline_ equation $Y_i = \beta_0 + \beta_1x_i + u_i$ and the double dollar sign notation for regular equations:

$$
Y_i = \beta_0 + \beta_1x_i + u_i
$$

That is all the basic markdown syntax required for this course. The [cheat sheet is a good reference.](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)

## Input statistical commands

There are two ways to input statistical commands in `Rmarkdown`. The main one is called a code-chunk. They are started with three ticks (beside the one key) and then {r}, and then closed with three ticks. You can put them anywhere in the document. A code chunk is set up properly when it turns shaded gray in the markdown editor. Inside a code chunk, it is no longer a markdown language, but the `R` language.

### Data management

I will always use the `tidyverse` syntax in this course. In terms of data management, there are three main commands to know. These are `filter()`, `select()`, and `mutate()`. The filter command subsets rows with a logical argument. Select subsets columns by selecting variables — there will not be much need for this. Finally, we can create or alter variables by sign `mutate()`. I will demonstrate below. For the first couple of assignments, I will provide the code, so this is just a reference so that you can learn how the code works.

`R` is a an object orientated programming language. This means that we will create objects and manipulate them. Data is an object and, as we will see, so are regression models. Here is some basic usage:

“`{r}
# the pound sign is meant to comment within the code. R ignores any line starting with a pound sign.

# Load data
data(‘CPS1985’)

# Create data frame called df from CPS1985, and do some basic data management
df <- CPS1985 %>%
filter(education > 7, wage < 30) %>%
mutate(age_squared = age * age,
BA = ifelse(education >= 16, 1, 0))
“`

In the above code, I did the following:

1. Loaded the data `CPS1985` from the package `AER`
2. I created a new data frame called `df` from `CPS1985` using the assignment operator `<-`, this is called the "create and assign" approach, 3. I subsetted or selected only observations with education greater than 7 and wage less than 30, 4. I created two new variables, age squared and a dummy variable for having a Bachelor's degree (if education >= 16).

This is nearly all of the data management commands we will require for this course. I will introduce a few more commands as we go along, but these will come up a lot.

### Basic statistical commands

Lets start with the basic question of examining wages of those with a Bachelor’s degree (`BA` = 1). We have to know some basic statistical commands and how to tell `R` what data to work with. The most basic commands are `mean()`, `var()`, `sd()`, and `sum()`. Suppose we want to know the average wage of degree holders — this is a conditional mean, the mean wage of those with a bachelor degree. We can start a code chunk and work with the data from called `df` we created above:

“`{r}
mean(df$wage[df$BA == 1], na.rm = T)
“`

Lets break down this command.

1. we called the function `mean()`. This command will always fail if there are any missing data. This can occur, for example, if for some observation they refuse to answer a survey question. Since this happens from time to time, we will use the option `na.rm = T` which stands for not applicable remove equals true. I always use it for `mean()` and the othe basic functions.

2. Inside the function `mean()` we have to tell `R` what data to use. We tell it to use the data frame called `df` and the column called `wage` by the syntax `df$wage`.

3. Since we want a conditional mean, we have to tell `R` what rows to use. We want to use all observations who have a bachelor degree. These are denoted by the variable `BA` being one. Thus, we use the syntax `[df$BA == 1]`. Put together, the line `df$wage[df$BA == 1]` says use the column `wage` in the data `df` and use rows where the variable `BA` is equal to 1 in the data frame `df`.

There are alternatives that you may use to achieve the same thing. Its not necessary to know them, but for reference.

“`{r}
with(subset(df, BA == 1), mean(wage, na.rm = T))
“`

We could also just subset the data frame to include only those with bachelor’s degrees using the “create and assign” approach above.

“`{r}
df_BA <- df %>%
filter(BA == 1)

mean(df_BA$wage)
“`

This creates a new data frame called `df_BA`.

When we just run the command `mean()` it prints the answer to the console. However, we can chose to save the answer as an object. For example,

“`{r}
my_mean <- mean(df$wage[df$BA == 1], na.rm = T) ``` This code will print nothing to the screen, but running it will create a object in the environment below. We then can simply print this to the console or use it in some other calculation. ```{r} my_mean # print to console my_mean / (sd(df$wage[df$BA==1])/sqrt(sum(df$BA==1))) # t-ratio calculation ``` ## Class size and educational outcomes I want to demo some of these concepts in the context of the textbook example using the California schools data. We'll end up using this data a lot. It comes with the package `AER`, so it also easy to load. The basic idea behind the data set is that we want to examine whether smaller class sizes improve students performance. We want to think about this in a causal way, with an econometric model. But for now, we will just focus on a descriptive question "Do we observe that test scores are higher in schools with smaller class sizes?" We will also study a binary case, where classes can either be small or not. $$ \begin{align*} D_i = \left\{ \begin{array}{ll} 1 & \mbox{if class is small (treatment)};\\ 0 & \mbox{if class is not small (control)}.\end{array} \right. \end{align*} $$ Let $score$ denote a schools average performance on reading and math standardized exams. If average outcomes depend on class size, then we can write: $$ E(score|D) = \mu_0 + \mu_1 D $$ and its the case that expected difference in outcomes is just: $$ E(Y|D=1) - E(Y|D=0) = \mu_1. $$ However, $\mu_1$ is unknown to us because its a population parameter. We can hope to estimate $\mu_1$ by just using sample means to replace the expectations in the above equations. Lets take a look at the data: ```{r} # load the textbook data data('CASchools') # data management df <- CASchools %>%
filter(complete.cases(.)) %>%
mutate(score = (math + read)/2,
str = students / teachers,
D = ifelse(str < 19, 1, 0)) ``` In the above code chunk, I have loaded the California schools data and filtered to include only rows with complete data (some schools did not provide test scores, so we remove them from the data). Then I created three variables: $score$ which is just the average of the school's reading and math test scores, $str$ or the student-teacher ratio, and then a dummy variable called $D$ that is equal to one if a school has less than 19 students per teacher (a "small class school"). We want to use this data to estimate $\hat \mu_1$. In order to do so, we need to compute two conditional sample means. ```{r} Y_1 <- mean(df$score[df$D == 1]) Y_0 <- mean(df$score[df$D == 0]) mu_1 <- Y_1 - Y_0 mu_1 ``` This calculation suggests that schools with small classes perform better; on average they score about 8.8 points higher on the standardized exams. However, we used sample means to compute this number and we know that they have a sampling distribution because they are dependent on random data. Thus, we want to ask, is 8.8 **statistically** different from zero? To answer this, we can compute the standard error of our sample means. The formula is 3.19 of the text book: $$ SE(Y_1 - Y_0) = \sqrt{\frac{\sigma^2_{Y_1}}{N_{Y_1}} + \frac{\sigma^2_{Y_0}}{N_{Y_0}}} $$ This is calculated below: ```{r} var_1 <- var(df$score[df$D == 1]) / sum(df$D == 1) var_0 <- var(df$score[df$D == 0]) / sum(df$D == 0) se_difference <- sqrt(var_1 + var_0) ``` Finally, if we want to ask if we are confident that there really is a difference between the groups, taking into account sampling variation, we can compute a $t$-statistic with the Null of $$ H_0: \mu_1 = 0, \\ H_1: \mu_1 \neq 1 $$ ```{r} t <- mu_1 / se_difference t ``` ### Question 1: Interpret the $t$-statistic above: > answer here …

### Using regression

The above calculations are a bit of a hassle to type. Fortunately, we can use a simple regression to get the answer for us. **When you regress some outcome on a dummy variable, the coefficient on a dummy variable is simply the difference-in-means comparison**:

We can do all the above steps in the following code:

“`{r}
lm(score ~ D, data = df) %>%
coeftest(., sandwich)
“`
The first line of code uses the function `lm()` to run a regression. The syntax is `y_variable ~ x_variable, data = data you want to use`. The 2nd line just outputs the results with the correct standard errors for inference. We’ll use this code a lot in the course, but we’ll leave the details for later. Examining the output, the coefficient on $D$ is 8.79 as before, and the $t$-value is 4.33. We get all that in two lines of code. Plus, as we’ll see, regression offeres some additional flexiblilty that we’ll use down the road.

### Casual inference

In the above example, we calculated the difference in means between two groups. This is simply descriptive: we look at the data and compute the means and report the results. But what if you were a decision maker, and you were in charge of a school board. You know from this data that schools that have smaller class sizes do better. Should you spend a lot of money to reduce your jurisdiction’s class sizes?

This is different than simply describing the data or making a prediction. This is a **causal** question. If you lower class sizes, will it improve performance? The answer to this depends on the **interpretation** of the comparison in means.

To answer this question, we need an econometric model. Say,
$$
score_i = \beta_0 + \beta_1 D_i + u_i
$$
where $score_i$ is the average of math and reading scores in school $i$. $u_i$ is term that captures unobserved features of schools that also influence scores (if the school is a high income area, high quality, or other things that could contribute to test scores). $\beta_1$ is the causal effect of having smaller class sizes. It is the object of interest, because it answers an ‘if-then’ question about lowering class sizes.

The comparison of means analysis we just did gives us:
$$
\begin{align}
E(score|D=1) – E(score|D=0) &= \beta_0 + \beta_1 + E(u|D=1) – \left( \beta_0 + E(u|D=0)\right) \\
&= \beta_1 + \left(E(u|D=1) – E(u|D=0)\right) \\
\end{align}
$$
This says that the difference in means is equal to the causal effect that we want $\beta_1$, plus another term in brackets. This term has to do with the average observable characteristics between schools that have large and small classes. We don’t observe the $u$ term, so we don’t know if this is zero or not.

### Question 2

State the assumption required about the relationship between $u$ and $D$ to interpret the difference in means as **casual**

> answer here.

### Question 3

In the California Schools data set, there is an other variable called `income`, which is the median family income in the school district. Using the same methods as above, test whether schools with smaller class sizes are in higher income districts. What do you think this tells you about the casual interpretation difference in means comparison in test scores between schools with smaller and larger classes above?

> answer here

“`{r}
# Put your code here

“`