程序代写代做 html Important Information

Important Information
EC420 – PS #3
Spring 2020 – Dr. Kirkpatrick 3/16/2020
I have made a small change in homework format. I will use this pdf document to describe the tasks you will need to code, and to ask the questions you will need to answer. You will write your code in the template .Rmd file and render that to pdf to turn in. Many students felt the combination of tasks and answers in Problem Set #2 was confusing. Hopefully this is simpler.
However, any code written in the pdf will likely not work if copied and pasted into a code chunk. This is because a pdf includes invisible characters that you cannot see, but will cause R to crash or other strange bugs. So, any code written in this document should be copied manually to your Rmd file. “Manually” as in “type it by hand”. For longer pieces of code that you might need for your assignment, I will include them in the template.
This assignment is due Sunday March 29th at 11:59pm on D2L. The point total is 142 points. Problem 1
The goal of this problem is to get used to interpreting interaction terms, which we introduced before the midterm. There is additional discussion on continuous x continuous interaction terms at the end of the Instrumental Variables slides.
Task 1.1 (6 points)
1. (1 point) Install the package AER using install.packages(…). You only need to install the package once.
2. (1 point) Use require(…) to load the wooldridge package, the AER package, and the lmtest and sandwich packages for robust standard errors.
3. (2 points) Use the command data(“CASchools”) to load the CASchools dataset. This creates a data.frame in your session of R called CASchools. You do not need to create the object CASchools, it will happen automatically.
4. (0 points) Typing (directly into the console) ?CASchools will bring up a help window that will tell you about each variable in the data. It won’t appear in your Rmarkdown output. Read the variable descriptions.
5. (2 points)And finally, let’s get a count of observations by county to see which counties have the most observations. You’ll want to use the command table(…) on the column of caSchools that contains the county data.
Question 1.1 (5 points)
(a) (1 point) How many observations are in the data?
(b) (2 points) We are interested in test scores. Which variable(s) in CASchools would be our outcome of interest?
(c) (2 points) Using the output from the table(…) command, what county has the most observations in the data?
1

Task 1.2 (9 points)
This task will create some useful variables and run our first regression
1. (1 point) Many education experts think that the average student:teacher ratio is important in test
scores. Create a variable called studentTeacherRatio in the CASchools data.frame.
2. (2 points) Let’s keep only the three counties with the most observations: Sonoma, Los Angeles, and Kern. Create a conditional called bigCounties that is TRUE if the variable county is any of these three counties. Remember that | is the “or” logical operator.
3. (1 point) Subset CASchools so that it contains only the rows (observations) from those three counties.
4. Make two scatterplots using plot(…).
(a) (1 point) Plot math scores (y-axis) against student:teacher ratios (the variable we created in the first part of this task) on the x-axis. Plot the points in green using col=”green” in your plot.
(b) (1 point) Plot reading scores against the same variable as well
5. (2 points) Make two more plots (for a total of four) with math and reading plotted against income. Plot these points in blue.
6. (1 point) We’ll make one last plot. This one will be of income against student:teacher ratios, so that we can see if higher-income schools have lower student:teacher ratios. Let’s color-code each point by county. To do this, add col = as.factor(CASchools$county) in your call to plot(…). R will make each point a different color.
Question 1.2 (7 points)
(a) (2 points) Using your first two plots, does there appear to be a relationship between higher student:teacher ratios and math scores? What about reading scores?
(b) (2 points) Using your last two plots, does there appear to be a relationship between higher income and math scores? What about reading scores?
(c) (3 points) Using the final plot, does it appear that some counties have higher income or higher student:teacher ratios (or both)? Discuss.
Task 1.3 (10 points)
Let’s run some regressions. Before we do that, remember that we almost always want to use heteroskedasticity- consistent errors. We do this by running a regression like myOLS = lm(Y ∼ X1 + X2, df), then using coeftest(myOLS, vcov = vcovHC(myOLS, “HC1”)).
Someone has pointed out that we can combine the steps. coeftest lets us say what errors we want beforehand: coeftest(lm(Y ∼ X1+X2, df), vcov = vcovHC, “HC1”)
This gives the same result as above, but in one line. The vcov and “HC1” are always the same.
We think there might be a relationship between the share of students receiving aid from calworks, the State
of California’s aid program for families with children.
1. (10 points) Start by running a regression of read on calworks. Use an output that shows heteroskedasticity-robust standard errors.
Question 1.3 (8 points)
(a) (3 points) What is the coefficient on calworks and what does it mean? Note that calworks is in percentage points (you can see the range using range(CASchools$calworks))
2

(b) (5 points) What potential omitted variables might bias this coefficient? That is, is there something unobserved correlated with calworks that might also be correlated with read?
Task 1.4 (5 points)
We might worry that there is something unobserved about the counties that is common in all of the districts within them. After all, Kern County is a rural, agricultural county, while Los Angeles is urban and highly populated.
(a) (5 points) Run a regression that controls for anything that is common within all observations in a county.
Question 1.4 (5 points)
(a) (2 points) What is the new coefficient on calworks? Is it larger or smaller? (b) (1 point) What is the base county level?
(c) (2 points) What is the expected reading score for an observation in Sonoma County, at a schools with a calworks value of 25%?
Task 1.5 (8 points)
We might think that, within each county, schools with higher shares of students who speak english as a second language might have different reading scores. Furthermore, we might think that the effect of calworks on read is different for districts with higher shares of english as a second language.
1. (8 points) Run a regression with this interaction between calworks and english. Include county fixed effects as well, but do not interact the county fixed effects with anything.
Question 1.5 (10 points)
(a) (4 points) What is the effect of an increase in calworks by one unit for a school with 0% english learners (english=0)? Remember to consider both the main effect of english and the interaction. See lecture notes on continuous x continuous interactions.
(b) (4 points) What is the effect of an increase in calworks by one unit for a school with 40% english learners?
(c) (2 points) What is the formula you would use to determine english.
Problem 2
Task 2.0 (0 points; already in template)
dRead dCalworks
? Hint: it includes the variable
For this problem, we are going to generate the data ourselves. This is a common practice in econometrics when we want to test out a new estimation method. We can set up the data to have whatever issues we want to examine, and see how a “naive” regression fares versus our preferred method.
The code to generate the data is already in your template. We’ll use Y as the outcome, D as the variable of interest (treatment), Z as the instrument, and X1,X2 as other exogenous variables. We’ll have UO as the unobserved variable causing the problem. It will be part of our data construction, but not part of our estimation. We’ll see how UO biases a naive regression, and we’ll see how well our instrument, Z, addresses the issue.
3

We will set the R variables that will define our data. We need to tell R how many observations we will create using the variable NN = 1000, the true β’s that will determine the outcome Y , as well as the true δ’s that will determine the endogenous treatment D.
In order for our data to have an endogeneity problem in D, it must be the case that D is determined within the system. This occurs when D is determined in part by UO. For instance, in our KIPP example, if Y is “test score” and D is “attends KIPP”, then UO might be “parental involvement in the child’s education”.
The true data generating process for D and Y will be:
Y = β0 +βDD+βx1x1 +βx2x2 +βUOUO+u (1) D = δ0+δZZ+δUOUO+v (2)
We will create values for δUO,βUO, the increase in D and Y per unit increase in UO, ceteris paribus that reflect this endogeneity. The code in Chunk Task 2.0 has these values written in already. Our goal is to recover the true βD = 1.25, which is true by construction.
A couple of notes about the code before we move on:
1. We use rnorm and rpois to create our exogenous random variables X1,X2,Z,UO,u,v.
(a) rnorm draws n = NN = 1000 random variables from a normal distribution with the mean and std. dev specified in the template code.
(b) rpois does the same, but from a poisson distribution, which is a count variable. Try plot(Y X2, P2) to see what this looks like. All the X2 values are integers.
2. u and v are exogenous and have specified σ2.
3. Once we have all of the exogenous (determined outside the system) variables, we can create the
endogenous varables
(a) D is a combination of UO, the unobserved, Z, and v, which is exogenous and random. (b) Y is a combination of X1, X2, D, UO, and u. Z does not appear in Y.
Once we have constructed the data, we will pretend that we do not observe UO,u,v at all and try to recover the correct value for βD.
Task 2.1 (13 points)
1. (2 points) Drop the variables UO,v,u from the data entirely. The fastest way to do this is to subset P2 with the columns you do want by using a column index c(“Y”,”D”,”Z”,”X1″,”X2″). Leave the row index empty so we keep all rows.
2. (3 points) Plot the relationship between Y and D using plot(…). Set the color to as.factor(P2$X2).
3. (5 points) Run a naive regression of Y on D,X1,X2. Do not include UO.
4. (3 points) Generate an approximate 95% Confidence Interval for the coefficient on D. You can do this by taking βˆD and adding/subtracting 1.96 × sˆe(βˆ). Your sˆe(βˆ) is in your results from Task 2.1.3.
Question 2.1 (16 points)
(a) (5 points) What is the coefficient on D in our naive regression and what does it mean?
(b) (5 points) We can think of UO as being “in the error term”. Using what we learned in partialling out
and the known values of δUO,βUO, what is the sign of the bias?
(c) (2 points) Is the true value of βD within the 95% Confidence Interval of our estimate for βD?
4

(d) (2 points) Are the true values of βx1,βx2 within the 95% Confidence Intervals of our naive estimates? (e) (2 points) Why would expect (or not expect) 4 to be true?
Task 2.2 (20 points)
1. (4 points) Run the first stage regression using lecture notes as your guide. Include X1,X2 in your first stage. Call it FirstStageOLS.
2. (4 points) Create a predicted Dˆ from this first stage. You can use the function P2$Dhat = predict(FirstStageOLS) to create a column in P2 that contains the variable Dhat.
3. (2 points) Make a scatterplot with D on the x-axis and Dhat on the y-axis.
4. (7 points) Run the second stage regression using lecture notes as your guide. Include X1,X2 in this
stage as well.
5. (3 points) Create an approximate 95% Confidence Interval for βˆIV as in Task 2.1.d.
Question 2.2 (20 points)
(a) (3 points) Is Dhat exogenous or endogenous? Why?
(b) (3 points) Do you think Dhat is correlated with D given your scatterplot from Task 2.2.3?
(c) (3 points) What is the coefficient on D in the second stage and what is its interpretation?
(d) (3 points) How does it compare to the true value of βD that we created in Task 2.0? Is the true value
within the 95% Confidence Interval?
(e) (3 points) Given the 2SLS (two stage least squares) method we used, why would βD be unbiased?
(f) (2 points) Are the true values of βx1,βx2 within a 95% Confidence Interval of our estimates?
(g) (3 points) Approximately how long did you spend working on this problem set?
End matter
Please ensure that your whole assignment knits to one .html, then convert that .html file to a .pdf to turn in. We will not accept any .Rmd or .html files.
5
D