Front Matter
Problem Set 2 – EC420
Due by Jul 19th 11:59pm
This assignment is due Monday Jul 19th at 11:59pm on D2L.
This assignment covers loading and exploring the data and conducting the hypothesis. To start:
• Download a fresh copy of the EC420 Assignment Template from D2L. Save it in a folder you created just for EC420 Problem Set 2 and give it a real name.
• For each of the numbered main headers (Part 1) in this document, create a main header in your markdown document using a double ##, as shown in the template.
• When there are tasks (which require only coding) and questions (which are to be answered) (e.g. “Task 1.A” and “Question 1.B”), then use triple ###’s, as shown in the template.
• Tasks only require coding. Make sure your code is “echo”ed into your .pdf. Each Task can be done in one code chunk. Questions should not require any additional coding (though you can add to the corresponding code chunk if you want).
Part 1
We will use a small dataset of home prices and characteristics which I host online. In Part 1, you will load the data, examine the variables, clean the data, and generate some additional variables. Next, you will test a hypothesis on the mean of the variable homeprice. Then, you will calculate a βˆ coefficient for the regression of home price on number of bedrooms, its standard error, and the t-statistic you will use to test a hypothesis.
I will tell you which functions you will need to use, but you will need to do much of the writing of the code. Remember to run each chunk sequentially (or use the “run from the top” button in a code chunk) to run as you go. In the end, your whole assignment has to run from top to bottom in order to render the file you can turn in.
Task 1.A: Load and explore data (8 points)
Read in the data located at https://people.duke.edu/~ajk41/HW2.csv and name the data.frame df This is a dataset of house prices and characteristics (square feet, bedrooms, bathrooms) for a sample of homes in California. In one code chunk, please do the following:
• (1 point) Read the data using the following syntax: df=read.csv(“https://people.duke.edu/~ajk41/HW2.csv”, stringAsfActors=F)
• (1 point) Use print to output the column names of df
• (2 points) Use range to output the range of df$homeprice. You may have to use the argument na.rm=T to eliminate NA’s.
1
• (2 points) Again, Use range to output the range of df$sqft (eliminate NA’s). • (2 points) Use plot to output a scatterplot of homeprice on sqft.
Note: if you’re in an area with limited internet access , then load your data once and save it as a .csv locally:
df = read.csv(“https://people.duke.edu/~ajk41/HW2.csv”, stringsAsFactors = F)
write.csv(df, “mylocaldata.csv”)
df = read.csv(“mylocaldata.csv”)
This will work if your property creates a folder on your drive for your .Rmd and work from that folder. Once you have the local copy, then you can use # to comment-out the first two lines. That way, if your internet is down, it won’t matter as R will be looking for ‘mylocaldata.csv’ and not going online to find it.
Question 1.A: Load and explore data (5 points)
Please answer the following questions. Create a header in your template using ## Question 1.A, and label
each
• • • •
of your answers with the corresponding letter (a)-(d).
(a) (1 point) How many observations are there in this dataset?
(b) (1 point) What are the min and max in the range of homeprice?
(c) (1 point) What are the min and max in the range of sqft?
(d) (2 points) Does the plot show much of a relationship between homeprice and square feet?
Task 1.B: Data Cleaning (7 points)
In the next code chunk, do the following four tasks:
• (2 points) Using a conditional statement like: df$Variable < 100, create an index of which rows of
df have a homeprice less than $2,000,000. Call this index smallHomeprice
• (2 points) Using that index of rows where df$homeprice < 2000000, make it so that df does not
contain any observations greater than $2,000,000
• (1 point) Create a new variable called khomeprice that equals homeprice divided by 1000
• (2 points) Now, plot the relationship between khomeprice and sqft again
Question 1.B: Data Cleaning (3 points)
• (a) (1 point) What do the units in khomeprice refer to?
• (b) (2 point) Does it look like there is a relationship between sqft(area of the home) and khomeprice? If so, what is the relationship?
2
Task 1.C: Testing hypothesis about khomeprice (10 points)
Now we will calculate a test statistic that will let us test a hypothesis about khomeprice. In a new code
chunk, do the following:
• (4 points) Generate the following:
– ybar as the mean of khomeprice
– N as the number of observations
– A new column in df called yMinusYbar that is equal to khomeprice - ybar
– Another new column in df that is equal to yMinusYbarˆ2. Call it df$yMinusYbar2
• (2 points) Using df$yMinusYbar2, calculate the sum of squares and call it SSTy
• (2 points) Calculate and print the sample variance of khomeprice using SSTy and N. Call this new
object sigma2y. Make sure you use the correct degrees of freedom.
• (2 points) Using the formula for the standard error of the mean, calculate and print the std. error of
the mean of khomeprice. Call it seMean
• (2 points) Calculate and print a t-statistic called tstat testing the hypothesis “H0: the mean of khomeprice is 0”. See our stats review slides if you forget how to do this. And yes, this is a silly hypothesis test.
• (2 points) The command below will give the critical values for a t-distribution’s rejection region with N=10 degrees of freedom. We do not have 10 degrees of freedom. Replace “10” below with the correct number of degrees of freedom.
– qt(c(.025,.975), 10)
Question 1.C: Testing hypothesis about khomeprice (5 points)
• • •
(a) (1 point) What is the mean of khomeprice?
(b) (1 point) What is the t-statistic for the stated null hypothesis?
(c) (3 points) Interpret the t-statistic for the stated null hypothesis using the critical values from the qt function. Do we reject the null hypothesis?
Task 1.D: T-tests and Calculate P-value in R (5 points)
• (a) (2 points) R has a built in function for the t-test we just ran. It is t.test(x =
In a new code chunk, use t.test to test the same null hypothesis (that mu = 0)
• (b) (3 points) R has a built in function for the p-value: pt(q=