Question 1
We start with some short theory questions:
(a) [1 marks] Suppose we had 250-dimensional observations x1, x2, . . ., x1000 that have a multivariate normal distribution with mean zero and covariance 2Ip. From these observations we construct the sample covariance matrix S and plot a histogram of the eigenvalues of S. What distribution will approximate the density of the eigenvalues? And with what parameters?
(b) [1 marks] What does the Fisher limiting spectral distribution (LSD) Ps,t(x) describe? And why does the left endpoint of this distribution converge to 1 (1 − s )2 as t → 1?
4
(c) [2 marks] Consider a sequence of two-dimensional random vectors (xi )i ≥ 0 with xi := (xi1,xi2)′ drawn from the bivariate normal distribution
1 2 2 xi∼N2 2,11.
What is the asymptotic distribution of X := x + x 2 as n → ∞? n1 n2
(d) [1 marks] Why can we associate a distribution to the largest eigenvalue λ1 of a sample covariance matrix S?
Question 2
In this question, we are going to consider the topic of factor analysis where the aim is to describe the covariance relationships among many variables in terms of a few underlying, but unobservable, random quantities called factors. Consider n daily returns from 1 Jan 2018 to 1 Jan 2019 for p = 11 stocks: BHP, RIO, ANZ, NAB, CBA, WBC, GXY, NUF, CGC, CGF, WSA. You can use the Rmd file I’ve provided to download this data.
Now implement the ”Principal Component Method” (PCM) found in Section 9.3 of [A] using the correlation matrix R of the daily returns. That is:
(a) [1 marks] Determine the number of factors m using a screeplot and print out L as a table showing stock names as row labels and factor numbers as column headers.
(b) [1 marks] Print out the communalities hi2, the estimated specific variances ψi, and the proportion of total sample variance due to the jth factor. Use these values to argue that you’ve made the correct choice of m.
Now implement the ”Maximum Likelihood Method” (see p.495 in [A]) using the correlation matrix R of the daily returns. For this part, you are allowed to use the inbuilt facanal command in R that inplements the ML method by default.
(c) [1 marks] First, perform the Maximum Likelihood Method (MLM) method without rotation using something like fit = factanal(daily.returns, factors = m, rota where m is the value of m you found previously and R is the correlation matrix. Print out the loadings with print(fit) and compare them to those found using the principal component method. What do you notice? Can you
give a label to one (or more) of these factors?
tion=”none”,
covmat=R)
Page 1 of 4
(d) [1 marks] We are now going to perform some factor rotations, see Section 9.4 in [A]. Perform a varimax orthogonal rotation using varimax(tilde.L) where tilde.L was derived using PCM. Now extract the MLM estimated loading matrix L from
using = loadings(fit). Now perform a varimax orthogonal rotation using and an oblique rotation using promax(hat.L). After rotation, do the loadings group the stocks in the same manner? Which rotation do you prefer?
For your favourite rotation, can you give labels to the factors?
(e) [1 marks] Now perform a large sample test for the number of common factors by testing the hypothesis H0 : Σ = LLT + Ψ with your choice of m at level α = 0.05. The test is given in Eq. (9-39) of [A] and, since we are using the correlation matrix R, it is based on the determinant of the matrix
R−1(LLT +Ψ) (1) and uses a chi-square approximation to the sampling distribution. Implement this
test in R. For your choice of m, do you accept or reject the null hypothesis?
(f) [1 marks] Considering the theory we’ve learnt this semester, comment on the form of (1) and the use of the chi-square approximation for the sampling distribution for the situation where the number of stocks p became large and yn := p/n = 0.5. What might be a better alternative for this high-dimensional case?
Question 3
fit
hat.L
varimax(hat.L)
We are now going to consider the theory of spiked Fisher matrices from the recent paper [B]. Consider two p-variate populations with covariance matrices Σ1 and Σ2 = Ip, and let S1 and S2 be the sample covariance matrices for samples of the two populations with degrees of freedom m and n, respectively. We set S := S−1S1.
2
(a) [1 marks] Suppose we had p-dimensional random variables x1, . . . , xm+1 ∼ Np(0, Σ1)
and p-dimensional random variables z1, . . . , zn+1 ∼ N(0, Ip). We stack these random variables to obtain the data matrices X and Z and sample covariance matrices
S :=1XXT, S :=1ZZT, S:=S−1S. 1m2n21
Nowassumen,m,p→∞suchthatyp :=p/n→y∈(0,1)andcp :=p/m→c>0. For y = 1/2 and c = 1/4, what is the upper bound of the limiting spectral distribution of S? [0.5 marks]. Plot the limiting spectral density of the eigenvalues of S [0.5 marks].
n1
(b) [1 marks] Suppose that Σ1 = Σ2 + ∆ where ∆ = diag(a1,…,a1,0,…,0) and a1 > 0, i.e., Σ2 is perturbed by a rank n1 diagonal matrix ∆. What is the critical value κ for which a1 > κ creates “outlier” sample eigenvalues? [0.5 marks]. Suppose that a1 = κ+1, c = 2/3 and y = 1/3, what value do you expect these outlier eigenvalues cluster around? [0.5 marks].
Page 2 of 4
(c) [1 marks] Continuing question (b), what would you expect to happen if a1 was only slightly larger than 1 (and less than κ)?
(d) [1 marks] Perform a simulation experiment to illustrate the phenomena in (b) in the case Σ2 = Ip. That is, sample data and plot a histogram of eigenvalues of S and compare it to the density obtained in (a). Can you see outlier eigenvalues?
(e) [1 marks] Perform a simulation experiment to expirically calculate the power of the method proposed in Section 7.1 of [B].
(f) [1 marks] Compare the results of your simulation experiment to the closed-form formula given in Theorem 7.1 of [B].
(g) [2 marks] Consider the signal detection problem where we are trying to determine the number of signals in observations of the form
xi =Usi +εi, i=1,…,m, (SD)
where the xi ’s are p-dimensional observations, si is a k × 1 low dimensional signal (k ≪ p) with covariance Ik, U is a p × k mixing matrix, and (εi) is an i.i.d. noise with covariance matrix Σ2. None of the quantities on the right hand side of (SD) are observed. In [B], they propose to estimate the number of signals k by
kˆ := max{i : λi ≥ β + log(p/p2/3)},
where (λi ) are the eigenvalues of S. Reproduce Table 1 in [B] for the Gaussian case
for values p = 25, 75, 125, 175, 225, 275.
(h) [1 marks] Comment how the methods and theory considered in Question 3 might
apply to Question 2.
Notation
Ip Identity matrix of size p × p.
N p(0, Σ) p-dimensional multivariate Normal distribution with (vector) mean 0 and covariance Σ.
References
[A] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.
[B] Wang, Yao (2017). Extreme eigenvalues of large-dimensional spiked Fisher matrices with application.
Annals of Statistics, Vol 45, No. 1.
Page 3 of 4