—
title: “Homework #2”
author: “Your name here”
date: 2/5/2020
output:
html_document:
df_print: paged
—
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(tinytex.verbose = TRUE)
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
fig.align=”center”,fig.height=4,
dpi=300,
cache=T,
echo=T)
options(“getSymbols.warning4.0″=FALSE)
“`
### Front Matter
This assignment is due **Sunday Feb 16th at 11:59pm on D2L**.
This Rmarkdown document is both your homework assignment and the template for you to use to complete your homework. Tasks require coding, which must be done within the code chunks as specified. Questions following each task will refer back to the results from the previous code chunk.
# Part 1
We will use a small dataset of home prices and characteristics which I host online. In Part 1, you will load the data, examine the variables, clean the data, and generate some additional variables. Next, you will test a hypothesis on the mean of the variable *homeprice*. Then, you will calculate a $\hat{\beta}$ coefficient for the regression of home price on number of bedrooms, its standard error, and the t-statistic you will use to test a hypothesis.
I will tell you which functions you will need to use, but you will need to do much of the writing of the code. Remember to run each chunk sequentially (or use the “run from the top” button in a code chunk) to run as you go. In the end, your whole assignment has to run from top to bottom in order to render the file you can turn in.
Please remember to save the .html as a .pdf and upload **just your .pdf** for your homework assignment. I do not need your Rmarkdown document. Remember to put your name up in the header.
## Task 1.1 (8 points)
“`{r Task11, out.width=’60%’}
# (a) (1 point) Load the data into an object called “df” (“df” is a common shorthand for “data.frame”) and use “head(df)” to see the top of the data.frame
df = read.csv(‘https://people.duke.edu/~ajk41/HW2.csv’, stringsAsFactors = F) # this lets you load from the web directly.
head(df)
# (b) (2 points) Print a list of all of the variable names in the data.frame df using the “names(df)” function.
# <
# (c) (2 points) The column “X” in data.frame df isn’t very useful. We can set df$X = NULL to drop it
# <
# (d) (1 point) Let’s look at the range of the homeprice. Use the “range(…)” function on df$homeprice to see the min and max
# <
# (e) (2 points) Plot a scatterplot with the variable homeprice on the y-axis and sqft on the x-axis.
# hint: use “plot” and specify the formula in the form “Y ~ X”, then the data name.
# <
“`
## Question 1.1 (5 points)
(a) (1 point) How many observations are there in this dataset?
<
(b) (1 point) What are the column names?
(c) (1 point) What was the max in the range of *homeprice*
(d) (2 points) Does the plot show much of a relationship between homeprice and square footage?
## Task 1.2 (7 points)
“`{r Task12, out.width=’90%’}
# (a) (2 points) Using a conditional statement like: which(df$Variable > 100), create an index of which rows of df have a homeprice less than $2,000,000. Call this index largeHomeprice
# <
# (d) (2 points) Now, plot the relationship between khomeprice and sqft again
# <
“`
## Question 1.2 (3 points)
(a) (1 point) What do the units in *khomeprice* refer to?
(b) (2 point) Does it look like there is a relationship between sqft (area of the home) and khomeprice? If so, what is the relationship?
## Task 1.3 (10 points) – Testing hypothesis about *khomeprice*
“`{r Task13}
# (a) (2 points) Generate the following:
# ybar as the mean of khomeprice
# N as the number of observations
# A new column in df called yMinusYbar that is equal to khomeprice – ybar
# Another new column in df that is equal to yMinusYbar^2. Call it df$yMinusYbar2
# (b) (2 points) Using df$yMinusYbar2, calculate the sum of squares and call it SSTy
# (c) (2 points) Calculate and print the sample standard deviation of khomeprice using SSTy. Call this new object sigma2y
# (d) (2 points) Using the formula for the standard error of the mean, calculate and print the std. error of the mean of khomeprice. Call it seMean
# (e) (2 points) Calculate and print a t-statistic called “tstat” testing the hypothesis “H0: the mean of khomeprice is 0”. See our stats review slides if you forget how to do this.
# (f) (2 points) The command below will give the critical values for a t-distribution’s rejection region with N=10 degrees of freedom. We do not have 10 degrees of freedom. Replace “10” below with the correct number of degrees of freedom.
qt(c(.025,.975), 10)
“`
## Question 1.3 (5 points)
(a) (1 point) What is the mean of *khomeprice*?
(b) (1 point) What is the t-statistic for the stated null hypothesis?
(c) (3 points) Interpret the t-statistic for the stated null hypothesis using the critical values in Task (f). Do we reject the null hypothesis?
## Task 1.4 (3 points)
“`{r Task14}
# (a) R has a built in function for the t-test we just ran. It is t.test(x =
# Use t.test here to test the same null hypothesis (that mu = 0)
“`
## Question 1.4 (2 points)
# (a) (2 points) Are the results the same?
## Task 1.5 (6 points)
“`{r Task15}
# (a) (4 points) Repeat Task 1.3, testing the mean, but do it *only* for those observations in df where df$city==’Long Beach’. The easiest
# way of doing this is to set df = df[which(df$city==’LONG BEACH’),]
# You will need to generate a new ybar and new columns like yMinusYbar, and you’ll have a different N
# Generate a t-statistic and use qt() to find the new critical values using the new degrees of freedom.
# (b) (2 points) Repeat the t.test with R’s built in function
“`
## Question 1.5 (2 points)
(a) (1 point) What is the result of your test? Do we reject the null hypothesis? (Yes it’s a silly null hypothesis)
(b) (1 point) Does it match R’s t.test results?
## Task 1.6 (10 points)
Now, we calculate our coefficient of interest for the relationship between *sqft* and *khomeprice*.
“`{r Task16}
# (a) (1 point) Reload the data using the read.csv(‘https://….’) command from Task 1.1 *AND* drop the observations in df where homeprice >2000000 as in Task 1.2. Also, re-create df$khomeprice again as in Task 1.2.
# (b) (2 points)
# Now we want to calculate the coefficient in a regression of khomeprice = beta_0 + beta_1 sqft + u. Create:
# ybar as the mean of df$khomeprice
# xbar as the mean of df$sqft
# N as the number of rows (hint: use NROW(df))
# (c) (5 points) create:
# A new column called yMinusYbar that is khomeprice – ybar
# A new column called xMinusXbar that is sqft – xbar
# A new column called yMinusYbar2 that is yMinusYbar^2 (we will need this later)
# A new column called xMinusXbar2
# A new column called xy
# (d) (5 points) Now, create:
# A new object called beta1 that uses df$xy and df$xMinusXbar2
# A new object called beta0 that uses ybar, xbar, and beta1 to get beta0
beta1=0 # delete this when you create your beta0.
# (e) (1 point) Print both beta1 and beta0
“`
## Question 1.6 (13 points)
(a) (2 points) What is the intercept (beta0) and how would you interpret it in this regression of home price on square footage?
(b) (4 points) What is the beta1, and how would you interpret it?
(c) (2 points) If square footage were to increase by 300 square feet, what would be the change in expected home value?
(d) (2 points) What have we assumed about u to say that beta1 is unbiased?
(e) (3 points) What might violate this assumption? That is, think of something that might be left out of the regression that might bias our estimate?
## Task 1.7 (10 points)
Now, we will calculate our standard error for beta1
“`{r Task17}
# We know beta0 and beta1, so we can calculate y-hat
# (a) (2 points) Create a new column called yhat in df that is equal to y-hat
# (b) (2 points) Create a new column called uhat in df that is equal to the residual
# (c) (2 points) In an object called SSR, square this column and sum it up.
# (d) (2 points) In an object called sigma2, calculate the sigma^2 of u-hat. Don’t forget we lose TWO degrees of freedom
# See the slides from Single Regression Inference if you aren’t sure how to calculate this.
# (e) (2 points) In an object called SEbeta1, calculate the standard error of beta1. Don’t forget, you’ll need to take the square root at some point.
# (f) (2 points) In an object called tBeta1, calculate the t-statistic testing the null hypothesis that beta1 = 0.
# (f) (2 points) Print beta1, seBeta1, and the t-statistic
# (g) (1 point) Get the critical values of the t-distribution using qt(c(.025, .975),
“`
## Question 1.7
(a) (4 points) What is the coefficient estimate for beta1? What is the interpretation of this results?
(b) (2 points) What is the t-statistic and what does it tell us about our null hypothesis (using the critical values)?
## Task 1.8 (5 points)
“`{r Task18}
# (a) (5 points) using lm(…), run a regression of khomeprice ~ sqft using the df data. Wrap the lm(…) call in summary() to see a complete output.
“`
## Question 1.8 (3 points)
(a) (3 points) How do your estimates for beta1 (sqft) compare to the coefficients in R’s regression?
## Task 1.9 (5 points)
We are going to calculate the R2 of our regression
“`{r Task 19}
# (a) (5 points)
# You already have SSR from Task 1.7
# And you already have SST (called SSTy) from Task 1.6
# Create an object called R2 that uses these to calculate the R2 of the regression
“`
## Question 1.9 (3 points)
(a) What is the R^2 and what is the interpretation of it?
# Part 2
It’s short. Don’t worry.
## Task 2.1 (10 points)
We want to also include the effect of number of bedrooms in our regression since we might worry that leaving it out is creating bias. First, let’s think about how we might “sign the bias”, then let’s see if we were right. Before you get to coding, answer the following:
## Question 2.1′ (7 points)
(a) (2 points) Can we sign the bias from leaving bedrooms out of our estimates in Task 1.7? What will the sign be on the relationship between bedrooms and square footage? Positive or negative? Why?
(b) (2 points) And what would the sign be on the relationship between *bedrooms* and home price? Why?
(c) (3 points) Our slides on Multiple Regress Inference have the formula for bias (the one with the tildes). Using your answers to 2.1’a and b, what will happen to beta1 when we add in *bedrooms* as a covariate?
“`{r Task21}
# (a) (3 points) Since we want to see how beta1 might change when we include a coefficient, beta2, on bedrooms, we should first check the relationship between sqft and bedrooms. Run a regression using lm(…) to see. Save the regression (the lm(…) object) in an object called lmFirstStage, then print summary(lmFirstStage) to see the full results.
# (b) (2 points) We can pull the coefficients out of the lmFirstStage object. We do this using lmFirstStage$coefficients, much like we’d use the $ to see a column in a data.frame. Create an object called delta that is equal to lmFirstStage$coefficients and print it. Then, try typing delta[‘bedrooms’].
# Note: Delta, it turns out, can be indexed with names, but it is not a data.frame, it is just a named vector, so we don’t use a comma when indexing. Just delta[‘bedrooms’] will do.
# (c) (3 points) Generate a new column in df called vhat that is the residula of vhat once we remove the vhat explained by bedrooms.
# I’ll write it for you, but make sure you understand what’s going on here:
# df$vhat = df$sqft – delta[‘(Intercept)’] + delta[‘bedrooms’]*df$bedrooms
# (c) (2 points) Now, run the regression of khomeprice ~ vhat. This is what we did in class on Wednesday the 5th. Print the results.
# (d) (3 points) Run the full (unbiased) regression of khomeprice on bedrooms and sqft. Print the results.
“`
## Question 2.1 (10 points)
(a) (2 points) From Task 1.7, what was beta1? Note: since you already have an R object called “beta1”, you can use an “inline code chunk” like this to print the value of beta1: `r beta1`. Take a look at it when you render – you’ll see the value of beta1 in there!
(b) (2 points) In Question 2.1′ above, we asked what we thought the bias would be on the “naive” single-variable estimate of beta1. How does our naive, single-variable (biased) estimate of beta1 compare to the unbiased estimate in Task 2.1 above?
(c) (3 points) What is the unbiased estimate of beta1, the coefficient on sqft, and how would you interpret it?
(d) (3 points) What is the unbiased estimate of beta2, the coefficient on bedrooms, and how would you interpret it?
## Task 2.2 (5 points)
This one is easy: plot *something* from the dataset `df` and explain what the plot tells us. This may involve a little research to see what sort of plots you can make. It’s your call, just plot something besides the scatterplot we made earlier!
“`{r Task 2.2, out.width=’90%’}
# (5 points) plot something!
“`
## Postword
Don’t forget to render this to .html, then open it with your browser and print it to .pdf. Do **not** turn in your Rmarkdown file. I’ll be able to see your code in the chunk outputs.