1. (a)Considertheone-wayANOVAmodel
𝑦𝑖𝑗=𝜇+𝛼𝑖+𝜖𝑖𝑗, 𝑖=1,2,3;𝑗=1,2.
Assume that 𝜖𝑖𝑗 are IID Normal random variables with 𝐸(𝜖𝑖𝑗) = 0 and 𝑉 𝑎𝑟(𝜖𝑖𝑗) = 𝜎2 > 0 for all 𝑖, 𝑗.
i) Suppose the model is used to determine efficacy of three drugs A, B and C on cholestrol levels of patients. Interpret within this context each term in the model above, and the corresponding assumptions.
ii) For the model above construct the corresponding design matrix 𝑿, the vector of responses 𝒚, the vector of regression coefficients 𝜷, and the error vector 𝝐. Justify
why the least squares estimator (𝑿𝑇𝑿)−1𝑿𝑇𝒚 of 𝜷 cannot be computed without further constraints.
[15 marks]
(b) A farmer wanted to compare four types of wheat to find which gives greatest yield. Since he suspected growing conditions might vary across his field, he divided the field into four plots and performed experiments which led to the following data on yield (in tonnes).
26.4 Note: ∑4𝑖=1 ∑4𝑗=1 𝑦2𝑖𝑗 = 637.85.
Wheat 1 Wheat 2 Wheat 3 Wheat 4 Sum
Plot1 Plot2 Plot3 Plot4 Sum 25.3 26.2 24.2 25.2
6.5 6.6 6.3 5.9 7.2 6.4 6.4 6.2
6.3 6.1 5.9 5.9
6.4 6.4 6.3 6.1
i) What type of design has been used by the farmer?
ii) Explain how you would ensure this design is randomised.
iii) Write down an appropriate model for this experiment, clearly defining your notation and explaining any assumptions you make.
iv) Calculate the ANOVA table for this data.
v) Test for the significance of wheat type and comment on your findings.
[25 marks]
3 MATH3029-E1 2. (a)Showthatthepdfofanormaldistributionwithmean𝜇∈Randvariance1belongsto
the one-parameter GLM family. Clearly identify 𝜃, 𝑏(⋅), 𝑐(⋅, ⋅), 𝜙 and 𝑎(⋅). [5 marks]
(b) Suppose 𝑌𝑖, 𝑖 = 1, … , 𝑛 are IID 𝑁(0, 1) random variables. Denote by 𝜙 and 𝛷 their pdf and cdf (cumulative distribution function), respectively. For real numbers 𝑡𝑖, define 𝑍𝑖 =1if𝑌𝑖 ≤𝑡𝑖 or𝑍𝑖 =0otherwise.
i) For fixed 𝑡𝑖, write down the joint distribution of 𝑍𝑖.
ii) Consider 𝑡𝑖 = 𝛽1 + 𝛽2𝑥𝑖 with 𝑖 = 1, … , 𝑛, where 𝑥𝑖 are real-valued. Using 𝑍𝑖, write down the log-likelihood function 𝑙(𝛽1, 𝛽2). Also show that the score statistic
𝑼 =
𝑈1 ,where𝑈 =𝜕𝑙/𝜕𝛽,𝑖=1,2is: (𝑈2) 𝑖 𝑖
𝑈 = 𝑛 𝑍𝑖𝜙(𝛽1 +𝛽2𝑥𝑖) − (1−𝑍𝑖)𝜙(𝛽1 +𝛽2𝑥𝑖) 1 ∑𝑖=1[ 𝛷(𝛽1+𝛽2𝑥𝑖) 1−𝛷(𝛽1+𝛽2𝑥𝑖) ]
𝑈 = 𝑛 𝑍𝑖𝑥𝑖𝜙(𝛽1 + 𝛽2𝑥𝑖) − (1 − 𝑍𝑖)𝑥𝑖𝜙(𝛽1 + 𝛽2𝑥𝑖) 2 ∑𝑖=1[ 𝛷(𝛽1+𝛽2𝑥𝑖) 1−𝛷(𝛽1+𝛽2𝑥𝑖) ]
iii) Verify that 𝐸(𝑼) = 𝟎.
iv) Why is 𝛷−1 ∶ [0, 1] → R a valid link function for linking 𝐸(𝑍𝑖) with 𝑥𝑖?
[20 marks]
(c) In a study examining relationship between Alzheimer’s disease (yes=1 and no=0) and Age on 98 people, a binary logistic regression model was used. Output from R is given on the next page.
i) Using Output1: (1) interpret, in the context of the problem, the estimate of the Age parameter, and (2) explain the values obtained for the degrees of freedom.
ii) Using Output1 explain, using the GLM form of a Bernoulli distribution, the statement: ‘Dispersion parameter for binomial family taken to be 1’.
iii) Information on economic status (‘Lower’, ‘Middle’, ‘Higher’) of each person was added to the model containing Age. Using Output1 and Output2 perform a Deviance test to ascertain if economic status has a significant relationship with the chances of being diagnosed with Alzheimer’s.
iv) Using Output2 predict the probability of being diagnosed with Alzhemeir’s for a person aged 48 and classified as having a ‘Lower’ economic status.
[15 marks]

Output 1:
Estimate Std.Error z value Pr(>|z|)
(Intercept) -1.62437 0.40575 -4.003 6.25e-05 ***
Age 0.03183 0.01204 2.644 0.00819 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 114.91 on 96 degrees of freedom
Output 2:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.49037
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 122.32 on 97 degrees of freedom
Residual deviance: 111.50 on 94 degrees of freedom
0.52223 -2.854 0.00432 **
0.01247 2.507 0.01216 *
0.56145 -1.252 0.21047
0.55692 0.682 0.49517

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

3. (a)
i) Give an example of an offset in a Poisson GLM.
ii) How would you test for the significance of an offset variable in a Poisson GLM?
iii) Suppose 𝑌𝑖 are independent Poisson random variables with mean 𝜇𝑖, offset 𝑛𝑖, and rate 𝜃𝑖 for 𝑖 = 1, … , 𝑁. With 𝑌𝑖 as responses, consider a Poisson GLM with log link function consisting of a single real-valued predictor 𝑥𝑖 with regression coefficient
𝛽. Show that, for each 𝑖 = 1, … , 𝑛 the rate parameter 𝜃𝑖 changes by a factor of 𝑒𝛽1
when 𝑥𝑖 increases by one unit.
[15 marks]
The data below is on the monthly accident counts on a major US highway for each of the 12 months of 1970, then for each of the 12 months of 1971, and finally for the first 9 months of 1972.
1970 523749293132283432395063 1971 352227273423423036564840 1972 332631252320252036
Output from R showing results from fitting a GLM modelling number of accidents with appropriately defined predictors year and month is provided below.
glm(formula = y~year + month, family = poisson)
Estimate Std. Error z value Pr(> |z|)
(Intercept) 3.81969 0.09896 38.600 < 2e − 16 *** Year1971 -0.12516 0.06694 -1.870 0.061521 . Year1972 -0.28794 0.08267 -3.483 0.000496 *** month2 -0.34484 0.14176 -2.433 0.014994 * month3 -0.11466 0.13296 -0.862 0.388459 month4 -0.39304 0.14380 -2.733 0.006271 ** month5 -0.31015 0.14034 -2.210 0.027108 * month6 -0.47000 0.14719 -3.193 0.001408 ** month7 -0.23361 0.13732 -1.701 0.088889 . month8 -0.35667 0.14226 -2.507 0.012168 * month9 -0.14310 0.13397 -1.068 0.285444 month10 0.10167 0.13903 0.731 0.464628 month11 0.13276 0.13788 0.963 0.335639 month12 0.18252 0.13607 1.341 0.179812 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ (Dispersion parameter for poisson family taken to be 1) Null deviance: 101.143 on 32 degrees of freedom Residual deviance: 27.273 on 19 degrees of freedom Number of Fisher Scoring iterations: 3 MATH3029-E1 6 MATH3029-E1 i) Write down the mathematical model fitted along with assumptions. ii) Based on the output, is it fair to state that the average number of accidents appears to have decreased from 1970 to 1972? Justify your answer. iii) The Transport Authority wishes to check if the number of accidents tend to be higher from September-December when compared to January. What would be your recommendation? Justify accordingly. iv) Construct a 95% confidence interval for the coefficent of Year1972 in the model in i), and corroborate the conclusion obtained from the p-value corresponding to Year1972 in the output. v) What is your prediction for the number of accidents in October 1972? [25 marks] MATH3029-E1 END