STAC51 (Winter 2020): Assignment 3
Due on 31st March, 2020 11:59 PM Sharp in Quercus All relevant work must be shown for credit.
Note: In any question, if you are using R, all R codes and R outputs must be included in your answers. You should assume that the reader is not familiar with R outputs and so explain all your findings, quoting necessary values form your outputs. Please note that academic integrity is funda- mental to learning and scholarship. You may discuss questions with other students. However, the work you submit should be your own. If I feel suspicious of any assignment (e.g. if your work doesn’t appear to be consistent with what we have discussed in class), I will not mark the assignment. In- stead, I will ask you to present your work in my office and your grade will be assigned based on your presentation. Assignments can be hand written but the R codes and outputs should be printed. Late Submissions: Late submissions will not be marked. You will get a ‘0’ in assignment. How- ever, if for some reason you are late in the class or could not come to the class then you need to provide sufficient proof for your reasoning. Only then you can submit by email ASAP.
1. (a) Perform the following simulation (for this please set the seed to your student ID),
• Generate 500 random values from X1 ∼ Uniform[−10, 10], X2 ∼ N(0, 4) and X3 ∼ Bernoulli(0.7)
• Set β = (−0.8, 0.1, 0.2, 0.3)
• Simulate Yi ∼ Poisson(μi), where, μi = exp(j xijβj)
[10 Marks]
(b) Estimate the βs using Iteratively Weighted Least Square (IRLS) method. Explain the procedure and state the W matrix as mentioned in lecture 7. Compare the results with glm code in R [15 Marks]
2. For this problem the Horseshoe Crab Mating is going to be used. The description can be found in Agresti page 123. To obtain the data you need to run the following R codes,
## Horseshoe Crab data ##
## To install the package rsq
install.packages(’rsq’)
library(rsq)
data(hcrabs)
(i) Execute a Poisson regression to estimate the mean number of satellites using all the other covariates [5 Marks]
(ii) Execute a negative binomial regression to estimate the mean number of satellites using
all the other covariates
(iii) Compare the models. Which model performed better
exp(−μ)μy 3. LetY |μ∼Poisson(μ). Thatis,P(Y =y|μ)= y!
[5 Marks] [5 Marks]
. Let,μ∼Gamma(α,β),i.e.,
βα
f(μ) = Γ(α)μα−1 exp(−βμ). Find the marginal distribution of Y , i.e., find P(Y = y) [10
Marks]
1
4. For this problem you need to load the NHANES dataset using the following command
## If the package is not installed then use ##
install.packages(’NHANES’) ## And install.packages(’tidyverse’)
library(tidyverse)
library(NHANES)
small.nhanes <- na.omit(NHANES[NHANES$SurveyYr=="2009_10"
& NHANES$Age > 17,c(1,3,4,7, 9:11,13,25,61)])
small.nhanes <- small.nhanes %>%
group_by(ID) %>% filter(row_number()==1)
This is data collected by US National Center for Health Statistics (NCHS). The preceeding codes creates a small dataset of the original NHANES dataset. With this dataset answer the following questions,
(a) Randomly select 500 observations from the data. For this selection use your student ID as seed. Fit a logistic regression to predict smoking status (variable SmokeNow), using all
the other variables (excluding ID). Explain your results in few sentences.
(b) Perform model selection using AIC/BIC based stepwise approach.
(c) Perform an internal validation using cross-validation. Explain your results.
[10 Marks] [10 Marks] [10 Marks]
(d) Construct the Receiver operating characteristic (ROC) curve. Calculate the area under the curve (AUC). How would you interpret the AUC. [10 Marks]
(e) Predict the probabilities for the remaining 476 observations. Calculate the deciles for the predicted probabilities. Does the observed and the predicted probabilities differ for the
deciles?
[10 Marks]
2