MAST90138 Assignment 1 Instructions:
• The assignment contains 2 problems worth a total of 100 points which will count towards 15% of the final mark for the course. If you LATEXand knitr your assignment in a nice way, you will potentially get up to a maximum of 0.75% towards the final mark for the course as extra credits.
• Use tables, graphs and concise text explanations to support your answers. Unclear answers may not be marked at your own cost. All tables and graphs must be clearly commented and identified.
• No late submission is allowed.
Data: In the assignment you will analyse some wheat data. The dataset is available in .txt format on the LMS along with this assignment. The data come from three different varieties of wheat denoted by 1 to 3 in the dataset. Each row of the dataset corresponds to a different wheat kernel. Seven numerical characteristics were measured on the data: X1: area, X2: perimeter X3: compactness X4: length of kernel, X5: width of kernel, X6: asymmetry coefficient X7: length of kernel groove, whereas the eighth variable X8 contains values 1, 2 or 3 dependent on the variety of wheat the kernel comes from.
Problem 1 [60 marks]:
(a) State, explicitly, all possible values that a and b can take in order for the following matrix
to be a covariance matrix. Give arguments that justify your answer : [20 marks] 1 2
Σ=ab.
(b) Compute explicitly and without using R, all the eigenvectors and the eigenvalues of the
matrix
13 −4 Σ= −4 7 .
Deduce from there two orthogonal eigenvectors of norm 1 of that matrix. Give explicitly an orthogonal matrix Γ and a diagonal Λ such that we can write [20 marks]
Σ = ΓΛΓT.
(c) Read the wheat data in R and create a data matrix X of size n×p, where n = 210 and p = 7 which contains the seven attributes X1 to X7 described above from all n kernels. Then create a vector of length n which contains, for each kernel, the wheat variety it comes from, coded 1 to 3 as described above. If you use the menus in R studio to read your data, please print out the corresponding instructions (they are given by R studio). [10 marks]
(d) Using R, for the unbiased sample covariance matrix S of X at (c), give explicitly an orthogonal matrix Γ and a diagonal matrix such that we can write [10 marks]
S = ΓΛΓT 1
Problem 2 [40 marks]:
In our lecture and Ch.5 of the textbook by Ha ̈rdle and Simar, we briefly discussed the Hotelling’s T2 test, which is a multivariate generalization of the univariate t-test. We should get a taste of how it is done in R. You are expected to use the help() function in R to learn the suggested R functions below. The data frame pulmonary in the ICSNP package of R measures the difference in pulmonary function in 12 workers after being exposed to cotton dust for 6 hours. There are three measurements: forced vital capacity, forced expiratory volume, and closing capacity. For convenience we will let Yi = (Yi1,Yi2,Yi3)′ be the vector of observations for each worker. We will apply the Hotelling’s T2 test to see whether the means of the three variables are all zero, i.e. E[Y ] = 0.
(a) Make a scatterplot for this dataset. [5]
(b) One assumption of the Hotelling’s T2 test is that the data come from a multivariate normal distribution. If this assumption is valid, we would expect the squared Mahalanobis distances
( Y i − Y ̄ ) ′ S − 1 ( Y i − Y ̄ )
to roughly follow a chi-squared distribution with 3 degree of freedom, where S is the unbiased sample covariance matrix of the Yi’s. Use this fact to check with R whether normality holds. What is your conclusion? [15]
(Suggested functions to use: mahalanobis, qqnorm, pchisq, qnorm. Recall that for a random variable X with a continuous cumulative distribution function C(·), the variable C(X) is uniformly distributed. )
(c) Regardless of your conclusion above, we will proceed with the Hotelling test. Do this by using the function HotellingsT2 in R which automatically gives a p-value for the test. Report this p-value. Next, compute this same p-value “manually” as follows: Compute the T2 statistic using elementary matrix operations in R, and calibrate the p-value using the function pf(), based on Theorem 5.9 in Ha ̈rdle and Simar. Be careful with the degrees of freedom. [20]
2