MTHM506/COMM511: Statistical Data Modelling
Question Sheet 2
Marks achieved in this assignment will contribute towards 25% of the final module mark. You should attempt all questions on this sheet. Note that the questions are organised in the order we covered the topics, and not in order of difficulty. Therefore it is advised that you read through the questions first, and start working on those that you feel more comfortable with.
Deadline: Noon (12pm), on 25th March 2022
Copyright By PowCoder代写 加微信 powcoder
You should submit one pdf via eBART containing your solutions – it should be written up using word processing software (e.g. LaTeX, R Markdown, or Word). Solutions are expected to be concise, well structured and well presented. Commented R code (e.g. ‘model <- glm(...)’) and the outcomes/plots should form part of your solutions. Do not display too much raw R output (e.g. don’t display the full output of ‘summary(model)’), but edit this down to the essentials. Ensure to include justification for each step of your analyses, providing comments alongside your R code to explain what you are doing and add appropriate titles and labelled axes to your plots. You are expected to work independently - strict disciplinary action will be taken for any plagiarism. Late submissions will also be penalised. The data required for this assignment datasets_sheet2.RData can be downloaded from the ELE page and loaded into R using the load() function. Question 1 In this question we return to Question 1 from Question Sheet 1 where we fit the following non-linear Gaussian model: θ1xi 2 Yi∼N θ+x,σ i = 1,2,...,100, Yi independent on the dataframe data frame nlmodel which contains data on a response variable y and a single explanatory variable x. (a) [1 mark] Fit a Gaussian GAM with identity link using the function \gam() in R package mgcv. Use a cubic spline basis and assume a basis dimension of q = 9. (b) [5 marks] Determine whether the rank of 9 is enough by running the function gam.check() on the fitted model and checking whether the effective degrees of freedom is close to the maximum possible (q − 1). This function also produces residual plots so also comment on those (but note that it plots deviance residuals and not standardised deviance residuals). (c) [3 marks] Use the function predict() with se.fit=T to produce the fitted line along with 95% confidence intervals. Given that the true relationship between x and the mean is the one given above, i.e. (θ1x)/(θ2 + x), state what you may want to do with the model. Question 2 In this question we return to Question 2 from Question Sheet 1 where we fit a series of models using the number of quarterly aids cases in the UK, yi, from January 1983 to March 1994. The data are in dataframe aids, where the variable cases is yi and date is time, symbolised here as xi. (a) [7 marks] Fit a Poisson GAM (using a cubic spline) with a log link, where the response is the number of cases and date is the predictor. Plot the counts against date and add the predicted line with associated 95% confidence intervals, and comment on the fit. Make sure to use an appropriate rank and perform relevant model checking with respect to residual plots and the deviance. (b) [8 marks] Suggest two alternative models that would improve the fit. Implement one of these, and perform the same model checking as in part (a). Also produce a plot of the predicted line with 95% CIs. Comment on any differences between the predicted smooth lines of the two models and the possible reason behind this. Question 3 The dataframe pupils which involves language scores in Dutch schools. This is an example of a two level situation. Specifically, the data considers 131 schools (but only 1 class per school) for i = 1,2,...,2287 students in grades 7 and 8. The nesting therefore occurs within each school j = 1, 2, . . . , 131. Interest lies in assessing the impact on language scores of pupil factors such as IQ (IQ) and pupil social status (ses). The response variable is denoted as test. The categorical variable (factor) Class refers to the class that each pupil belongs to (so Class = 1, 2, . . . , 131). [2 marks] First fit a (Gaussian) linear model using glm() with test as the response and IQ, ses and factor Class as the covariates. Comment on the significance of the two continuous variables and perform a likelihood ratio test to test on the overall significance of the factor Class. (i) [2 marks] State two reasons why one might want to treat the class effects as random. (ii) [2 marks] Write down the mathematical formulation of a Normal random effect model (IQ and ses as fixed effects and Class as a random effect). (iii) [3 marks] Fit this model and comment on the significance of the fixed effects based on t-tests. State any assumptions you are making. (iv) [5 marks] What is the estimate of the “within-class” variance and the “between-class” variance. What is the estimate of marginal variance of the response, it is different to the (marginal) variance from the model in (a) and if so, why? (v) [3 marks] Test whether the variance of the random effects is zero (i.e. the significance of the random effects) using a likelihood ratio test. (vi) [6 marks] Comment on the validity of the Likelihood Ratio Tests in mixed effects models, suggest an alternative way of implementing these tests, and use it to compare with results in (b)-(v). (vii) [3 marks] Plot a density estimate of the predicted random effects and superimpose their theoretical Normal distribution using the estimate of their variance. Use functions qqnorm() and qqline() to produce a QQ plot of the random effects and comment on the validity of a Gaussian model for the random effects. (viii) [3 marks] Note that the functions fitted() and resid() in lme4, will produce the fitted values yˆ and raw residuals y − yˆ. Use these functions, in conjunction with the two functions in (vii) to produce a QQ plot of the residuals and a residuals vs fitted values plot. Comment on the model assumptions using the two plots. One of the student-level covariates is IQ which may affect the test results per student. However, there may be class level (latent) variables, such as teacher competence, which may have an effect on how IQ relates to the test result in each class. Such a scenario may be accommodated by considering the parameter of IQ to be random rather than it being fixed (and constant across classes). (i) [4 marks] Extend the model in (b) to make the parameter of IQ vary with Class. Compare (qualitatively) the overall effect of IQ on test between this model and the model in (b). (ii) [3 marks] Test for the significance of the random slope for IQ using a likelihood ratio test. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com