STA231 – Assignment 2
This assignment continues the analysis from Assignment 1. You will come up with some estimators, and use them and the gas data to make inferences about the fuel consumption of Professor Stringer’s car(s).
This assignment is out of 117 total points.
You must have a working installation of the tidyverse package to complete this assignment. Type
Copyright By PowCoder代写 加微信 powcoder
install.packages(‘tidyverse’) to install this package. You will submit two types of documents:
a. A total of three .pdf files containing your answers to questions 1 – 3 below. Upload each question to the corresponding question in Crowdmark. Within each document, please clearly mark the part
(a, b, c…) to which each answer corresponds.
b. A .R file containing R code that reproduces all your answers, including reading in the data. (You are given appropriate code on page 2 below)
You must include all answers to the questions in the .pdfs you submit to Crowdmark. Include all written, mathematical, numeric, and graphical output requested in the question. Do NOT refer to your R code file anywhere in your Crowdmark submission, and do NOT include any R code in your Crowdmark submission. The Crowdmark submission IS the assignment. The R code file must reproduce all the output in your Crowdmark submission. Answers not included in the Crowdmark submission will not receive marks, regardless of what is or isn’t in the R code file.
You must submit your .R file to LEARN to receive any marks for the Crowdmark submission. Your .R file must reproduce all numerical and graphical output contained in your Crowdmark submission. Failure to submit a .R file that exactly reproduces your answers will result in a grade of 0 for this assignment.
You may create your .pdf document in Word, Google Docs, LaTeX or any other word processor. If you wish to use LaTeX then you may find Overleaf particularly useful for this. See https://www.overleaf.com /edu/uwaterloo. You can also use RMarkdown, where you type your code and text in one file and it runs all your code and creates the document will all the output in it for you. If you use RMarkdown, please set knitr::opts_chunk$set(echo = FALSE) in the setup chunk to disable printing of code. Regardless of how you do the assignment, you must follow the submission instructions: one .pdf per question, and one .R file.
You will upload your .pdf file to Crowdmark for marking. You can upload your assignment as one document or individually for each problem. If you upload one document then you must drag and drop the pages for each problem to the appropriate question as indicated in Crowdmark. You can resubmit your assignment any number of times before the due time. Therefore, to ensure that there are no issues with uploading we advise you to upload your assignment well in advance of the due time. Assignments which are left as a single document and not uploaded to the appropriate places in Crowdmark will be assigned a 10% penalty.
You will submit your .R file to the appropriate dropbox on LEARN.
The instructions on this page are to get you started. Do not submit any answers for this page. Nothing on this page is directly for marks. You need to do these things in order to answer questions 1 – 3 below.
First read in the gas data from the previous assignment. You only need the one for car S.
You should type dplyr::glimpse(gasS) and get the following output. Make sure the number of rows and
columns, the column names, and the column datatypes all match mine:
dplyr::glimpse(gasS)
## Rows: 20
## Columns: 4
## $ car
## $ km
## $ gas
## $ date
Recall the model from Question 4 on Assignment 1:
𝑌𝑖 ∼N(𝜇,𝜎2),
independently, for 𝑖 = 1,…,𝑛, where 𝜇 ∈ R and 𝜎 > 0. Review that question on that assignment for all definitions, etc.
Starting on the next page, everything is for marks. Refer again to the instructions at the beginning of the assignment for what to submit.
1. (12 points) Study Design. Professor Stringer “designed” this “study” by him and his wife just driving around their normal amount, and recording how much gas was used. Could you do better, given what you learned in Chapter 3 of the course notes?
a) (8 points) Design an empirical study of the fuel consumption of Professor Stringer’s cars, using the steps of PPDAC.
b) (4 points) Clearly identify which steps of PPDAC Professor Stringer did not do, or did poorly, when he “designed” his “study” (he will not be offended). The fact that you are given limited information about how these data were collected is part of the question.
2. (40 points) Estimation. This question expands on Question 4 from Assignment 1. We wish to estimate 𝜇 and 𝜎2 in the model stated in that question and at the top of this assignment. We propose
the following estimators:
𝑌 = 𝑛 ∑𝑌𝑖,
21𝑛̄2 𝑆 = 𝑛−1∑(𝑌𝑖 −𝑌)
√ 𝑖=1 𝑆= 𝑆2.
a) (1 points) State formulas for the corresponding estimates of 𝜇, 𝜎2, and 𝜎. Your answer should include math.
b) (3 points) Compute the estimates of 𝜇, 𝜎2, and 𝜎 using the data for car S. Your answer should include numerical output.
c) (2 points) What is the sampling distribution of 𝑌̄? Specify any parameters explicitly. State, without proof, a formula for a pivotal quantity for 𝜇 based off of 𝑌̄. Your answer should include math.
d) (2 points) What is the sampling distribution, of 𝑆2? Specify any parameters explicitly. State, without proof, a formula for a pivotal quantity for 𝜎2 based off of 𝑆2. Your answer should include math.
e) (2 points) What is the sampling distribution of 𝑇 = (𝑌̄ − 𝜇)/(𝑆/√𝑛)? Specify any parameters explicitly. Is 𝑇 a pivotal quantity? Why or why not? Your answer should include math and a written explanation.
f) (30 points) We will perform simulations to help explain the meaning of these quantities. A sampling distribution is a probability distribution of a statistic. To simulate one draw from a sampling distri- bution of a statistic, we draw an entire sample, and compute that statistic. To simulate many draws from a sampling distribution of a statistic, we repeat this entire process many times. To do this for the sampling distribution of the sample mean 𝑌̄, when 𝜎 = 11 is known, we repeat the following procedure 𝐵 = 10000 (arbitrary) times: a) fix a value 𝜇 = 47 (arbitrary), b) draw a sample from a Normal distribution with this 𝜇 and 𝜎, and c) compute and store the mean of this sample. Here is the code:
set.seed(54534)
n <- nrow(gasS)
B <- 1e04 # Number of simulations to do, an arbitrary "large" number mn <- 47 # Known
ss <- 11 # Known
# Generate a sample of sample means
samps <- numeric(B)
for (b in 1:B) {
thesample <- rnorm(n,mn,ss)
samps[b] <- mean(thesample)
i) (5 points) First, plot a histogram of these samples, with the theoretical density of the sampling distribution of the sample mean drawn as a line on the plot.
Reminder: you must include your plot in the output you submit to Crowdmark. The answer to this question is a plot. You have to print that plot in your Crowdmark submission to receive marks.
ii) (20 points) Please repeat this simulation for the sampling distribution of the sample variance, 𝑆2 (use the var function), fixing the value 𝜎 = 11 for purposes of simulation. Please plot a histogram of samples from the sampling distribution of the sample variance, with the theoretical density drawn as a line on the plot. Deriving the theoretical density of 𝑆2 is a part of this question. Your answer to this question is a mathematical derivation, and plot. Make sure to set the seed as follows:
set.seed(53498)
Reminder: you must include your plot in the output you submit to Crowdmark. The answer to this question is a plot. You have to print that plot in your Crowdmark submission to receive marks.
iii) (5 points) Finally, for the case that 𝜎 is unknown and estimated by 𝑆 (use the sd function), but 𝜇 = 47 is known, please repeat the simulation for the pivotal quantity 𝑇 :
𝑌̄ − 𝜇 𝑇 = 𝑆/√𝑛
Please plot a histogram of samples from the probability distribution of 𝑇, with the theoretical density drawn as a line on the plot. Your answer to this question is a plot. Make sure to set the seed as follows:
set.seed(75294)
Reminder: you must include your plot in the output you submit to Crowdmark. The answer to this question is a plot. You have to print that plot in your Crowdmark submission to receive marks.
3. (65 points) Inference. We will now use what we know about the sampling distributions of our various statistics and the probability distributions of our pivotal quantities, to construct confidence intervals for 𝜇 and 𝜎, and perform simulations to enhance our understanding of the meaning of these intervals.
a) (3 points) State the formula for a 95% confidence interval for 𝜇 when 𝜎 = 11 is known, and evaluate your interval for the gasS data. Your answer is a mathematical formula AND a pair of numbers.
b) (3 points) State the formula for a 95% confidence interval for 𝜇 when 𝜎 is unknown and estimated by 𝑆, and evaluate your interval for the gasS data. Your answer is a mathematical formula AND a pair of numbers.
c) (3 points) State the formula for a 95% confidence interval for 𝜎2 when 𝜇 is unknown and estimated by 𝑌̄, and evaluate your interval for thegasSdata. Your answer is a mathematical formula AND a pair of numbers.
d) (3 points) State the formula for a 95% confidence interval for 𝜎 when 𝜇 is unknown and estimated by 𝑌̄, and evaluate your interval for thegasSdata. (Hint: use your answer for part c). Your answer is a mathematical formula AND a pair of numbers.
e) (6 points) Recall the relative likelihood 𝑅(𝜇) from Assignment 1. Plot the quantity −2log𝑅(𝜇), with a horizontal line at the value 𝑦 = Φ−1((𝑞+1)/2), where Φ is the CDF of a 𝑁(0,1) random variable and 𝑞 = 0.95. Your answer to this question is a plot. Reminder: you must include your plot in the output you submit to Crowdmark. The answer to this question is a plot. You have to print that plot in your Crowdmark submission to receive marks.
f) (4 points) Use the relative likelihood from Assignment 1 to construct a 95% confidence interval for 𝜇 when 𝜎 = 11 is known, and compare this interval to the one constructed in part a. Your answer is a mathematical formula AND a pair of numbers.
g) (30 points) We will perform simulations to enhance your understanding of what a confidence interval is and how to interpret it. The endpoints of a confidence interval (𝐿(𝑌 ), 𝑈 (𝑌 )) are statistics, random variables which are functions of the data. They are constructed such that there is a (for example) 95% probability that 𝐿(𝑌) ≤ 𝜇 ≤ 𝑈(𝑌). The interval is random, and 𝜇 (or any parameter) is fixed. We will fix the value of 𝜇, simulate a data set 𝑌 with that value of 𝜇, compute 𝐿(𝑌),𝑈(𝑌), and check whether 𝐿(𝑌 ) ≤ 𝜇 ≤ 𝑈 (𝑌 ). We will repeat this procedure a large number of times. The proportion of data sets for which 𝐿(𝑌 ) ≤ 𝜇 ≤ 𝑈 (𝑌 ) should be close to 95%. We call this proportion the empirical coverage probability of the interval.
For all parts, fix 𝜇 = 47 and 𝜎 = 11 for checking the coverages 𝐿(𝑌 ) ≤ 𝜇 ≤ 𝑈 (𝑌 ) (say), but construct your intervals using the formula that corresponds to whether each parameter is assumed known or unknown, as stated in each question.
i) (10 points) Estimate the empirical coverage probability of a 95% confidence interval for 𝜇 when 𝜎 = 11 is known, based on 10,000 simulated datasets having 𝜇 = 47. Use the value 𝜇 = 47 to calculate the coverage. Create a plot of the confidence intervals for the first 100 simulated datasets, and comment on how the plot communicates the meaning of the empirical coverage probability you report. Here is the code for this question. To answer this question, you have to run this code and put the number and the plot that it outputs in your Crowdmark submission, along with text containing the requested comments.
set.seed(435)
n <- nrow(gasS)
mn <- 47 # Known
ss <- 11 # Known
intlower <- intupper <- covr <- numeric(B) zval <- qnorm(.975)
for (b in 1:B) {
samp <- rnorm(n,mn,ss)
intlower[b] <- mean(samp) - zval * ss/sqrt(n) # Treating ss as known, so it appears in t intupper[b] <- mean(samp) + zval * ss/sqrt(n)
# Regardless of whether we pretend mu is known or not, calculate the coverage
# using the true value of mu.
covr[b] <- intlower[b] <= mn & mn <= intupper[b]
} mean(covr)
## [1] 0.9507
he interval
# Plot the first 100 simulated intervals
numplot <- 100
plot(c(intlower[1],intupper[1]),c(1,1),
xlim = c(min(intlower),max(intupper)),
ylim = c(1,numplot),
type='l',lty='dashed')
abline(v = mn)
for (i in 1:numplot) {
if (covr[i] == 1) { lines(c(intlower[i],intupper[i]),c(i,i))
} else { lines(c(intlower[i],intupper[i]),c(i,i),col='red')
35 40 45 50 55 60
c(intlower[1], intupper[1])
ii) (10 points) Please repeat the exact simulation from i), except estimate 𝜎 for each simulated dataset using the sd function, instead of using the true value. So 𝜎 is being estimated, but we are pretending that it is known by using the formula for a confidence interval that was derived by assuming it was known. Comment on whether the coverage of the resulting intervals should be higher, the same, or lower than when we used the true 𝜎. Your answer to this question is one number representing the empirical coverage probability from your simulations, one plot of the first 100 individual confidence intervals exactly like that shown in i), and written explanation of why the coverage should be lower/higher/the same as in i). Make sure to set the random seed as follows immediately before running your simulations:
set.seed(435)
iii) (10points)Pleaserepeattheexactsimulationfromii),exceptcomputethecorrectconfidence intervals under the assumption that 𝜇 and 𝜎 are both unknown. Comment on whether the coverage of the resulting intervals should be higher, the same, or lower than in each of i) and ii). Your answer to this question is one number representing the empirical coverage probability from your simulations, one plot of the first 100 individual confidence intervals exactly like that shown in i), and written explanation of why the coverage should be lower/higher/the same as in each of i) and ii). Make sure to set the random seed as follows immediately before running your simulations:
set.seed(435)
h) (2 points) What is the width of the 95% confidence interval for 𝜇 with 𝜎 unknown that you reported
i) (2 points) How many times should Professor Stringer have filled up car S if he wanted his confidence
intervals to have width 3 litres, assuming 𝜎 is unknown?
j) (9 points) How can Professor Stringer budget for his gas bill? Let’s make a prediction interval for a future fill-up.
0 20 40 60 80 100
I) (3 points) State the formula for a prediction interval for a future fill-up, treating 𝜇 and 𝜎 as unknown. Compute this interval for the gasS data. Your answer is a formula and a pair of numbers.
II) (6 points) Perform a simulation-based assessment of the coverage of your interval. Use B=1e04 simulations and set.seed(4723698). Report the average empirical coverage probability of your intervals. Your answer to this question is a number.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com