Assignment 1 – CSC/DSC 265/465 – Spring 2020 – Due February 27, 2020
Unless otherwise specified, statistical significance can be taken to hold when the relevant P-value is no larger than α = 0.05. Note that problem Q4 is reserved for graduate students. All questions have equal marks.
Q1: Consider the matrix representation of the multiple linear regression model
y = Xβ +ε (1)
where y is an n × 1 response vector, X is a n × q matrix, β is a q × 1 vector of coefficients, and ε is an n × 1 vector of error terms.
(a) Why is it the case that a unique least squares estimate of β can only exist if the matrix XT X is invertible?
(b) Suppose we are given paired observations of the form (x1, y1), . . . , (xn, yn), where each xi ∈ {1, 2, 3} is one of three values, and yi ∼ N(μk,σ2) if xi = k. Assume that the responses yi are independent, and that the variance σ2 is the same for all responses.
We decide to express this model as a linear regression model by defining three predictors X1,X2,X3, associated with the three outcomes of xi, using indicator variables, that is,
Xi1 = I{xi = 1},
Xi2 = I{xi = 2},
Xi3 = I{xi = 3},
for i = 1,…,n. Then suppose we attempt to fit the model
yi =β0 +β1Xi1 +β2Xi2 +β3Xi3 +εi, i=1,…,n, (2)
where εi ∼ N(0,σ2). We may express this model in the matrix form of Equation (1). Derive the matrix XT X. Is this matrix invertible? HINT: Let nk be the number of times xi = k, for each k = 1, 2, 3.
(c) Show that if any of the four terms associated with coefficients β0, . . . , β3 is deleted from Equation (2), then the resulting matrix XT X will be invertible.
(d) In Part (c), four linear regression models are obtained by deleting one of the four terms associated with the coefficients. Show that the least squares fit of each of these will give the same fitted values, and are therefore equivalent.
Q2: For this question, use the cats data set from the MASS package. This data includes the following observations for each of n = 144 cats:
Sex
sex: Factor with levels “F” and “M”.
Bwt
body weight in kg.
Hwt
heart weight in g.
(a) Suppose we have linear relationship y = β0 + β1x between two variables x, y. If β1 ̸= 0, this can always be written as x = β0′ + β1′ y. Express β0′ and β1′ as functions of β0 and β1.
(b) Fit the following linear models using the lm() function: Hwt ∼ Bwt
1
and
Bwt ∼ Hwt.
Do the least squares coefficients of the two models conform to the equivalence relationship given in Part (a)? Construct a scatter plot of the Hwt and Bwt paired observations (place Hwt on the vertical axis). For both models superimpose on this plot the estimated linear relationship between Hwt and Bwt. In each case, ensure that Hwt is represented on the vertical axis. Provide a brief explanation for your results.
(c) Fit the following three models (expressed using R’s model formula notation):
Hwt ∼ Hwt ∼ Hwt ∼
Bwt [Model 1]
Bwt + Sex [Model 2] Bwt ∗ Sex [Model 3]
For each model construct a scatter plot of Hwt and Bwt (placee Hwt on the vertical axis) and superimpose the estimated regression line (plot separate lines for the two Sex classes, and use a legend to identify line associated with each class). Is there statistical evidence at an α = 0.05 significance level that either Model 2 or 3 improves Model 1?
Q3: For this question, use the Insurance data set from the MASS package. This data includes the following observations for each of n = 64 insurance companies:
District
factor: district of residence of policyholder (1 to 4): 4 is major cities.
Group
an ordered factor: group of car with levels <1 litre, 1{1.5 litre, 1.5{2 litre, >2 litre.
Age
an ordered factor: the age of the insured in 4 groups labelled <25, 25{29, 30{35, >35.
Holders
numbers of policyholders.
Claims
numbers of claims
(a) Fit a linear model with response Claims, and the remaining variables as predictors. Create a residual plot (residuals against fitted values). Also create a normal quantile plot for the residuals. Do the usual assumptions for linear regression seem reasonable in this case? Comment briefly.
(b) We will try to transform Claims using the function h(x) = log(x + a) (use the natural logarithm). For the standard log-transformation we would set a = 0. Why can’t we do that here? Repeat Part (a) after replacing response Claims with the transformed response h(Claims). Use a = 1, then a = 10. Which succeeds better in normalizing the residuals?
(c) We can, in principal, consider all models using some subset of the original four predictors, including the original four predictors, and no predictors. We can assume all models include an intercept term. How many such models are there.
(d) Create a list in R of model formulae representing the collection of models defined in Part (c). Note that we can obtain the full model formula, then remove a predictor from the model with the following code:
> fit1 = lm(log(Claims+10) ~ .,data=Insurance)
> full.formula = formula(terms(fit1))
> next.formula = update(full.formula, ~ . -District)
> full.formula
2
log(Claims + 10) ~ District + Group + Age + Holders
> next.formula
log(Claims + 10) ~ Group + Age + Holders
>
Use this list to calculate R2 for each model. Identify the model with the largest R2 .
adj
adj
Q4: [For Graduate Students] Consider question Q2.
(a) Using Model 3, show how to construct a two-sided hypothesis test against null hypothesis
H o : μ Mx − μ Fx = 0 ,
where μMx , μFx are the mean heart weights of male and female cats of body weights x kg. Construct a plot of the observed t-statistic used in this hypothesis test as a function x, where x ranges from the 0 to 5 in increments of 0.1. Does it appear that the t-statistic is bounded over all x?
(b) What is the P-value for testing the null hypothesis that Model 3 does not improve Model 1? Is there a significant improvement at a α = 0.1 significance level? What is the two-sided P-value against Ho : μM3.5 − μF3.5 = 0? If μM3.5 ̸= μF3.5, does this imply Model 1 is incorrect?
(c) For large samples, we may reject simultaneously at a level of significance α (two-sided) all hypotheses H0 : aT β = 0
for which the absolute value of the t-statistic exceeds (χ2p;α)1/2, where χ2p;α is the α critical value of a χ2 distribution with p degrees of freedom, and p is the model degrees of freedom (for example, Cox & Ma (1995) Biometrics). What implication does this have for the issue raised in Part (b)?
3