CS代考 STAT3015/STAT4030/STAT7030

Student Number

Research School of Finance, Actuarial Studies & Statistics

EXAMINATION

Semester 2 – Final, 2020

STAT3015/STAT4030/STAT7030

GENERALISED LINEAR MODELLING

Writing Time: 2 hours
Reading Time: 15 minutes
Submission Time: 15 minutes
Total Time: 2 hours and 30 minutes

Exam Conditions:

This is an Open Book Exam, so any materials are permitted.
For the duration of the exam, no communication with other peo-

ple is allowed. Any such communication will constitute a breach

of ANU Academic Regulations.

Materials Permitted In The Exam Venue:

This is an Open Book Exam, so any materials are permitted.

Materials to Be Supplied To Students:

The exam paper will be available on Wattle in the assessment section.
It is recommended that you download the paper as soon as it becomes avail-

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 1 of 13

Instructions to Students for the exam:

Attempt ALL questions
Each of the three questions caries equal marks
Start your solution to each question on a new page
To be a candidate for full marks show all steps in working out your solu-
tion and where appropriate briefly explain your reasoning. Marks may be
deducted for failure to show appropriate calculations or formulae or for not
providing any reasoning.

Instructions to Students for submission of scripts:

It is recommended that you write your solutions to the exam questions by
At the end of the exam, you should transfer your solutions into a single pdf
file in one of two ways:
scan your solutions into a single pdf file, if you have access to a scanner ;
photograph your solutions, e.g. using a mobile phone, and save into a single
Then upload this pdf file to Wattle.
You may also email the pdf file with your solutions directly to me as “insur-
ance”, but please be sure also to upload this file to Wattle.
The deadline for uploading the pdf file to Wattle containing your solutions
is 150 minutes (two and a half hours) from the start of the exam.

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 2 of 13

1. We return to the productivity improvement data that was considered
during the course. The dataset consisted of a measure of the produc-
tivity improvement of 27 business firms, where each firm was classifed
according to whether their average expenditure for research and devel-
opment in the past three years was high, moderate or low. Some R
output is presented below.

> out1=lm(prodscre~RandD)
> summary(out1)

lm(formula = prodscre ~ RandD)

Residuals:
Min 1Q Median 3Q Max

-1.43333 -0.50556 0.02222 0.53333 1.32222

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.2000 0.3266 28.167 < 2e-16 *** RandDLow -2.3222 0.4217 -5.507 1.16e-05 *** RandDMod -1.0667 0.4000 -2.666 0.0135 * Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.8001 on 24 degrees of freedom Multiple R-squared: 0.5671, Adjusted R-squared: 0.531 F-statistic: 15.72 on 2 and 24 DF, p-value: 4.331e-05 (a) What type of model is being fitted here? Be brief but as precise as possible. (b) Write down, in mathematical form, the model under consideration, using the “reference” parametrisation (i.e. the same parametrisa- tion that is used by R). Briefly state the modelling assumptions that are being made. (c) What are the degrees of freedom in each of the t-tests performed in the R summary output given above? (d) Explain why, in this model, the (theoretical) variance of the es- timator (Intercept) is equal to the (theoretical) covariance be- tween the estimators RandDLow and RandDMod. Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 3 of 13 (e) There is a suspicion that there is no di↵erence between low in- vestment and moderate investment in the e↵ect on productivity improvement. State the corresponding null hypothesis and then perform a suitable test of this hypothesis, expressing your result in the form of a p-value. (f) In view of the p-value obtained in 1(e), what action if any would you consider taking concerning RandD? Briefly explain your thinking. HINT FOR PART (d): Consider carefully the implied R definitions of RandDLow and RandDMod, and hence decide which covariance to calculate. Solution to 1(a): One-way analysis of variance. Solution to 1(b): The fitted model may be written yij = µ+ ⌧j + ✏ij = µj + ✏ij, (1) where ⌧1 = 0, and j = 1 corresponds to high investment (RandDHig), j = 2 corresponds to low investment (RandDLow) and j = 3 corre- sponds to moderate investment (RandDMod). The modelling assump- tions are that the ✏ij are IID and that µ, ⌧2 and ⌧3 are fixed (i.e. non- Solution to 1(c): The degrees of freedom in each t-test is the the degrees of freedom in the residual sum of squares, which (in the notation of the lecture notes) is n� g = 27� 3 = 24. Solution to 1(d): Bearing in mind that R uses the reference parametri- sation for the one-way ANOVA, the parameter estimate (Intercept) equals bµ1, the sample mean of group 1 (high investment); the parameter estimate RandDLow equals bµ2 � bµ1, where bµ2 equals the sample mean of group 2 (low investment); and the parameter estimate RandDMod equals bµ3 � bµ1, where bµ3 equals the sample mean of group 3 (moderate investment). A key point is that under the one-way ANOVA model, bµ1, Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 4 of 13 bµ2 and bµ3 are all independent. Therefore Cov[RandDLow,RandDMod] = Cov[bµ2 � bµ1, bµ3 � bµ1] = Cov[bµ2, bµ3]� Cov[bµ2, bµ1] � Cov[bµ1, bµ3] + Cov[bµ1, bµ1] = 0� 0� 0 + Var[bµ1] = Var[(Intercept)], as required. Solution to 1(e): The null hypothesis is H0 : µ2 = µ3 or, equivalently, H0 : µ2 � µ1 = µ3 � µ1. Moreover, from the R summary output, and using the identity Var[aX+bY ] = a2Var[X]+b2Var[Y ]+2abCov[X, Y ], Var[RandDLow� RandDMod] = 0.42172 + 0.40002 � 2⇥ 0.32662 so the standard error of RandDLow� RandDMod is 0.1245 = 0.3528. So we should refer RandDLow� RandDMod se(RandDLow� RandDMod) �2.3222 + 1.0667 to the t-dsitrubtion with 24 degrees of freedom. The correponding 2- sided p-value is 0.0016. This provides fairly strong evidence for reject- ing H0, in that it is comfortably significant at the 0.01 level. Solution to 1(f): We might consider combining these two categories but, unless there is expert support for combining the low investment and moderate investment categories, might well be best to do nothing. Note: Any sensible comments should be marked sympathetically. Mark scheme for Question 1. (a)=2. (b)=4. (c)=4. (d)=6. (e)= 6. (f)=3. Total: 25. Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 5 of 13 2. Consider a random variable Y with a discrete distribution whose proba- bility mass function is given by Prob[Y = y] = f(y; p) = (1� p)2(y + 1)py y = 0, 1, 2, ... (a) Demonstrate that f(y; p) may be written in the form of a Gener- alised Linear Model (GLM) distribution, i.e. show that f(y; p) = exp where ✓, �, b(✓) and c(y,�) should be determined. (b) Find the mean µ = E[Y ] in terms of ✓. (c) What is the canonical link function for this GLM? (d) Find the variance function V (µ) for the GLM associated with (2). (e) Let `(µ; y) denote the log likelihood of this GLM for a single ob- servation from (2), but parametrised with µ rather than p. Show that `(µ; y) has the form `(µ; y) = y log(µ)� (y + 2) log(µ+ 2) + a(y), (3) where the function a(y) should be determined. What is the max- imum likelihood estimator of µ in the model (3)? You may state the result without proof. (f) Suppose now that we have response data y1, . . . , yn. After fitting a particular GLM with response distribution of the form (2), the fit- ted values corresponding to y1, . . . , yn, were found to be bµ1, . . . , bµn, respctively. Find an expression for the deviance residual for ob- servation i. (g) Consider a plot of fitted values versus deviance residuals, with the fitted values on the horizontal axis. Provide a rough sketch of what you might expect to see in the plot if the variance function is approximately correct for smaller fitted values and approximately correct for fitted values in the middle of the range, but the vari- ance function tends to over-estimate the variance of the response variable for larger fitted values. Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 6 of 13 Solution to 2(a): Taking logs, and using the fact that if ✓ = log(p) then p = e✓, we have log{f(y; p)} = 2 log(1� p) + log(y + 1) + y log(p) = y✓ + 2 log(1� e✓) + log(y + 1) where ✓ = log(p), b(✓) = �2 log(1� e✓), � = 1 and c(y, 1) = log(y+1). Solution to 2(b): From the theory of GLMs, µ = E[Y ] = b0(✓) = Solution to 2(c): The canonical link function is characterised by the requirement that ⌘ = ✓, where ⌘ is the linear predictor and ✓ is the natural parameter. Making e✓ the subject of (4), we find that ✓ = log(µ)� log(µ+ 2) = g(µ) is the canonical link function. Solution to 1(d): The variance function is given by b00(✓), expressed as a function of µ. Now From (5) it follows that 1�e✓ = 1�µ/(µ+2) = 2/(µ+2), and therefore (1� e✓)�1 = 1+(µ/2). Consequently, the variance function is given by Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 7 of 13 Solution to 2(e): Since p = e✓ = µ/(µ + 2), after substituting p = µ/(µ+ 2) into (2), using 1� p = 2/(µ+ 2), we find f(y; p(µ)) = so taking logs we obtain `(µ; y) = log{f(y; p(µ))} = 2 log(2)� 2 log(µ+ 2) + log(y + 1) + y log(µ)� y log(µ+ 2) = y log(µ)� (y + 2) log(µ+ 2) + a(y), as required, where a(y) = 2 log(2)+log(y+1). The maximum likelihood estimator of µ is y. Solution to 2(f): The deviance contribution from observation i is i = 2[`(yi; yi)� `(bµi; yi)] = 2[yi log(yi)� (yi + 2) log(y+2) + a(yi) � {yi log(bµi)� (yi + 2)� log(bµi + 2) + a(yi)}] � (yi + 2) log and so the deviance residual for observation i is given by sign(yi�bµi)di, where sign(.) is the sign function defined in the lectures. Solution to 2(g): Assuming for simplicity that the horizontal density of the points is not too non-uniform, one would expect to see more or less constant vertical spread for fitted values below the upper region; and for fitted values in the upper region we would expect to see reduced ver- tical spread compared to the lower and middle regions. Any reasonable attempt should be marked sympathetically. Mark scheme for Question 2. (a)=4. (b)=3. (c)=3. (d)=4. (e)=4. (f)=4. (g)=3. Total: 25. Semester 2 - Final, 2020 STAT3015/STAT4030/STAT7030 GLM Page 8 of 13 3(a) Di↵erent doses of two chemicals, A and B, were used in a trial whose purpose was to reduce cockroach numbers. The variable x1 gives the dose of chemical A and the variable x2 gives the dose of chemical B. In the R code, the first column of c gives the number of cockroaches killed and the second column of c gives the number of cockroaches that survived. The following R outputs were obtained: > out=glm(c~x1+x2,family=binomial)
> summary(out)

glm(formula = c ~ x1 + x2, family = binomial)

Deviance Residuals:
1 2 3 4 5 6 7 8

0.7922 0.5388 -0.3190 -1.2973 0.4378 -0.7025 0.4556 2.1441

Coefficients:
Estimate Std. Error z value Pr(>|z|)

(Intercept) -97.8769 22.8731 A 1.88e-05 ***
x1 56.4856 13.6157 B 3.35e-05 ***
x2 -0.5368 0.3122 C D .
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 284.2024 on 7 degrees of freedom
Residual deviance: 8.1925 on 5 degrees of freedom
AIC: 40.391

Number of Fisher Scoring iterations: 5

> anova(out)
Analysis of Deviance Table

Model: binomial, link: E

Response: c

Terms added sequentially (first to last)

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 9 of 13

Df Deviance Resid. .Dev
x1 1 G H 11.232
x2 J 3.04 5 K

(i) Determine the missing information indicated by the letters A,B,C,D,
E,F,G,H,J and K. Note that for E you are required to specify the
link function.

(ii) Write down the relevant model in mathematical form, focusing on
the contribution of observation i to the likelihood.

(iii) Briefly indicate your impressions of the results of the the statisti-
cal analysis so far.

(iv) What are the next questions you would investigate in the statis-
tical analysis? State what your next two steps would be.

Solution to 3(a)(i): The missing information is

• A = 4.280.
• B = 4.148.
• C = �1.719.
• D = 0.086.
• E is the logit (or logistic) link.
• F = 284.2024.
• G = 272.9704.
• K = 8.1925.

Solution to 3(a)(ii): The model states that yi, i = 1, . . . , n, where
here n = 8, are independent binomial random variables, with yi ⇠
Binomial(mi, pi) where mi is the number of binomial trials for obser-
vation i and pi is the probability of killing a cockroach in trial i. The

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 10 of 13

mathematical form of pi is

exp(��>xi)

1 + exp(��>xi)

exp(�0 + �1×1 + �2×2)

1 + exp(�0 + �1×1 + �2×2)

where �� = (�0, �1, �2)
> and xi = (1, x1, x2)

>. As usual the binomial
probabilities are given by

Prob[Yi = yi] =

, yi = 0, 1, . . . ,mi.

Solution to 3(a)(iii): It appears that variable x1, the dose of chemical
A, is quite important as the p-value of the Wald test is very small
and the associated reduction in deviance is large. The case for the
importance of the variable x2, the dose of chemical B, is somewhat less
clear; specifically, the p-value based on the Wald test is around 0.09 and
the sign of the coe�cient is negative, which perhaps is a bit strange.

Solution to 3(a)(iv): Four possibilities are: to fit the model with
x1 included and x2 excluded, to assess the importance of x2 through
the change-in-deviance test; see if there is any interaction between x1
and x2; try replacing x1 and x2 by their logs, assuming these variables
are positive; and perhaps try some other link functions. Any sensible
suggestions shuld be marked sympathetically.

Mark scheme for Question 3(a). (a)(i) = 5. (a)(ii)=3. (a)(iii)=3.
(a)(iv)=3. Total: 14.

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 11 of 13

3(b) Blood groups of peptic ulcer and control patients in London, Manch-
ester and Newcastle were recorded in a case-control study. Blood
groups A and O were represented in a factor B with two levels; the
cities mentioned were represented as a factor C with three levels, L, M
and N; and U represented a factor with two levels, Control and Ulcer.
In this case-control study it is appropriate to treat B as a response
factor and C and U both as covariate factors. The data were entered
into R as follows.

> B=c(“A”,”A”,”A”,”A”,”A”,”A”,”O”,”O”,”O”,”O”,”O”,”O”)
> C=c(“L”,”L”,”M”,”M”,”N”,”N”,”L”,”L”,”M”,”M”,”N”,”N”)
> U=c(“C”,”U”,”C”,”U”,”C”,”U”,”C”,”U”,”C”,”U”,”C”,”U”)
> count=c(4219,579,3775,246,5261,219,4578,911,4532,361,6598,396)

The following models were fitted, using Poisson regression with log
link. The deviance (as defined in the lecture notes) and the degrees of
freedom (df) are given in the third and fourth columns, respectively.

Model Model Formula Deviance df
M1 count~B+C+U 754.47 7
M2 count~B*C+U 737.75 5
M3 count~B*U+C 700.97 6
M4 count~B+C*U 83.559 5
M5 count~B*U+C*U 30.106 4
M6 count~B*C+C*U 66.878 3
M7 count~B*C+B*U 684.25 4
M8 count~B*C+B*U+C*U 2.9655 2

(i) Construct the three-way contingency table from the data that has
been input into R.

(ii) Provide a list of the models in the table that are relevant to the
situation where B is a response factor and C and U are covariate
factors. Briefly explain why the models on your list, and no others,
are the relevant models.

(iii) What do you conclude from the results in the table? Which model
is to be preferred? How should this model be interpreted? Discuss

Solution to 3(b)(i): It should be an easy exercise though I have not asked
them to do this before so it is possible one or two might struggle.

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 12 of 13

Solution to 3(b)(ii): If we wish to treat B as a response factor and C and
U as covariate factors, then we need to fix the totals at every combination of
levels of C and U. This is achieved by only considering those models which
include the interaction term C*U. In the list, the models which have the
interaction term C*U are: M4, M5, M6 and M8.

Solution to 3(b)(iii): Using the change of deviance test. we clearly reject
all of the models in the table except for M8, which we do not reject at any
reasonable level as the p-value is 0.23. This model is not particularly easy to
interpret: the distribution of B depends on the level of both C and U, but this
model is still simpler than the saturated model (corresponding to the 3-way
interaction.

Mark scheme for Question 3(b). (b)(i)=3. (b)(ii)=5. (b)(iii)=3.
Total: 11.

Total for Question 3: 14+11=25.

Semester 2 – Final, 2020 STAT3015/STAT4030/STAT7030 GLM
Page 13 of 13

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts