Instructions
STAT 341: Assignment 4 – Winter 2022
Instructor:
Due: Tuesday, April 5 at 11:59 pm ET
Copyright By PowCoder代写 加微信 powcoder
You must upload your solutions in the form of one pdf file for each part of each question by the deadline onto Crowdmark. Your instructor will NOT accommodate mistakes in submitting the pdf file of one question for another question. No assignment submission through email will be accepted. Note that your pdf solution file must have been generated by R Markdown. Additionally:
• For mathematical questions: your solutions must be produced by LaTeX (from within R Markdown), unless specifically mentioned that hand-written solutions are accepted where you can import a clear image of a hand-written solution. If such note is not included, screenshots and scanned/photographed handwritten solutions receive zero points.
• For computational questions: R code should always be included in your solution (via code chunks in R Markdown). If you don;t provide your R codes, you will receive zero points for the corresponding question.
• For interpretation questions: plain text (within R Markdown) is required. Text responses embedded as comments within code chunks will not be accepted.
• Alternative accommodations including, but not limited to, email submission and/or extensions due to RMarkdown breakdown and/or compilation to pdf will not be granted.
• The formatting requirement will be taken seriously. Screenshots of your solutions and/or R codes, even if the original file is RMarkdown-generated, will receive 0 marks. Your submitted file to Crowdmark must be directly compiled by RMarkdown.
Organization and formatting is part of a full solution. Consequently, points will be deducted for solutions that are not organized and incomprehensible. A disorganized solution which is difficult to understand or find parts will not receive full marks.
Academic Integrity:
• While you may discuss the questions with your classmates on Piazza, consulting another student’s solution is prohibited, and submitted solutions may not be copied from any source. You may not talk to any other individual about the questions in this assignment. The instructor will hold online office hours during which he will answer clarification questions. You also have access to Piazza where you can ask questions.
• You may not use and/or search the internet (except for LEARN and Piazza) to answer the questions in this assignment. However, you may search the internet for R syntax.
• If a question which you would like to post on Piazza shares your solution, you must make it a private post.
• In short, you can treat this assignment like an open-book exam, where you are only allowed to use the course material provided to you during lectures and on Piazza and/or LEARN as well as books you may find at the library.
• Any violation of the the academic integrity regulations outlined here and in the course syllabus (make sure to read the course outline again!) will be counted as cheating and will be reported to the Dean’s Office.
• The instructor reserves the right to conduct an online interview with you during which you will be asked questions about your solutions and the details of how you came to these responses. Should such an interview take place and you are unable to explain and defend your solutions, your grade for this assignment, and consequently, your course grade will be affected.
Question One – 25 Marks
Consider the Economic Mobility data-set discussed in Assignment #2 which includes information about 729 US communities to study the economic mobility across generations in the contemporary USA. The variables in the data-set are:
• Mobility: The probability that a child born in 1980-1982 into the lowest quantile (20%) of household income will be in the top quantile (20%) at age 30. Individuals are assigned to the community they grew up in, not the one they were in as adults.
• Commute: Fraction of workers with a commute of less than 15 minutes.
• Longitude / Latitude: Geographic coordinate for the centre of the community.
• Name: the name of principal city or town.
• State: the state of the principal city or town of the community.
• Population: The population of the community.
In particular, we are interested in the spread of the variable mobility measured by variance σ2 and inter-quartile range IQR = Q0.75 − Q0.25. Recall that σ2 = y(yu − y)2/N.
a) [2 Mark] Write a function named VarIQR that takes in a population or sample of variates and outputs the variance σ2 and the inter-quartile range IQR. Apply this function to the population of Mobility values.
b) [5 Marks] Here, we study the sampling distribution of two the attributes.
• Select M = 1000 samples of size n = 100 without replacement from the original data-set,
i.e. construct S1, S2, . . . , S1000.
• For each sample calculate the variance and the IQR. Then construct two histograms (in a single
row) of the sample error for each attribute.
c) The following sample S, labelled CommunitiesSample, was obtained by random sampling without replacement from the population of 729 communities in the data-set.
CommunitiesSample = c(265, 596, 270, 334, 653, 273, 93, 58, 113,
668, 235, 243, 703, 672, 411, 231, 723, 127, 640, 217, 626,
279, 482, 395, 410, 162, 7, 603, 28, 100, 68, 141, 593, 564,
557, 604, 443, 202, 480, 285, 210, 585, 199, 224, 577, 551,
464, 611, 292, 649, 80, 180, 3, 463, 479, 77, 453, 241, 548,
488, 447, 396, 124, 552, 340, 615, 63, 380, 599, 590, 386,
99, 374, 225, 116, 610, 215, 651, 55, 563, 562, 122, 476,
Using the given sample CommunitiesSample and the variable Mobility,
i) [2 Mark] Calculate the two attributes of interest using the given sample.
ii) [5 Marks] By re-sampling the sample S with replacement, construct B = 1000 bootstrap samples
S1⋆, S2⋆, . . . , S1⋆000 and calculate the two attributes of interest on each bootstrap sample. Then construct two histograms (in a single row) of the bootstrap sample error for each attribute. Make sure you label your histograms clearly.
iii) [5 Marks] Calculate standard errors for each sample estimate and then construct a 95% confidence interval for the population quantity using the percentile method.
d) [6 marks] For each of the two attributes of interest estimate the coverage probability when using the percentile method and give the standard error of your estimate. For the simulation, choose an appropriate number of samples and number of bootstrap samples. In addition, provide a conclusion about the procedure.
Note: this analysis can be computationally intensive. Do not leave it until last minute.
Question Two – 25 Marks
The data file OzoneData.csv includes the daily air pollution data in Chicago for a period of 550 days. Ozone, also known as O3 or trioxygen, is a strong oxidant with chlorine-like odour that can cause respiratory damage if inhaled. Hence, O3 is an air pollutant and when it is present where humans live, it can cause illnesses. The objective of this question is to explore the nature of the predictive accuracy of various polynomial regression models Yt = β0 + β1t + … + βptp + εt. We are interested in using time (day: 1 to 550) in the data collection window to predict future daily Oxone concentration in the air.
DISCLAIMER: If you are interested in modeling time series data you should, in general, not rely on polynomials. Many more sophisticated and useful time series models exist for this purpose. If you have interest in this I recommend taking STAT 443: Forecasting.
• • • • • •
355, 36, 293, 534, 652, 53, 571, 398, 353, 383, 627, 352,
377, 537, 151, 392, 51)
[5 marks] Generate six scatter plots of the data in a 3 × 2 grid, where the plots are data as well as polynomials of degrees 1, 2, 5, 10, 15, and 20, respectively overlaid. Use the getmuhat function defined in the lectures to estimate these polynomial predictor functions. Use a different colour for each of the different degrees, and use a legend to indicate which degree polynomial is visualized in each plot.
[6 Marks] generate M = 50 samples S1, S2, . . . , S50 of size n = 100. You are encouraged (but don’t have to) use functions getSampleComp and getXYSample from the lectures. Fit polynomials of degree 1, 2, 5, 10, 15, and 20 to every sample. Now, create another 3 × 2 grid of plots like in part (a), except this time:
on the first scatter plot overlay all M = 50 degree 1 polynomial predictor functions, and
on the second scatter plot overlay all M = 50 degree 2 polynomial predictor functions, and on the third scatter plot overlay all M = 50 degree 5 polynomial predictor functions, and on the fourth scatter plot overlay all M = 50 degree 10 polynomial predictor functions, and on the fifth scatter plot overlay all M = 50 degree 15 polynomial predictor functions, and on the sixth scatter plot overlay all M = 50 degree 20 polynomial predictor functions.
Use colours that are consistent with part (a).
(c) [6 points] Using the M = 50 samples of size n = 100 generated in part (b), calculate the APSE (and each of its components) for degrees 0:15. In particular, print out a table that shows for each degree apse, var_mutilde, bias2 and var_y. Note: You are encouraged (but don’t have to) use the
functions apse_all, getmubar, and gettauFun from the lectures.
Note: this analysis can be computationally intensive. Do not leave it until last minute.
(d) [5 points] Using your results from part (c) construct a plot whose x-axis is degree and which has four lines: one for apse, one for var_mutilde, one for bias2 and one for var_y. Specifically, and for interpretability, plot sqrt(apse), sqrt(var_mutilde), sqrt(bias2) and sqrt(var_y) vs. degree. Be sure to distinguish the lines with different colours and a legend. Briefly describe the trends you see in the plot.
(e) [3 points] Based on your findings in parts (c) and (d), which degree polynomial has the best predictive accuracy? Construct a scatter plot – like the ones from (a) and (b) – but this time create just one plot, and overlay just the polynomial predictor function with the degree you identified as best.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com