CS代考 MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES

MAST 90138: MULTIVARIATE STATISTICAL TECHNIQUES
See Ha ̈rdle and Simar, chapter 12.
6 FACTOR ANALYSIS
6.1 ORTHOGONAL FACTOR MODEL
Let X ∼ (μ, Σ) be a p-dimensional random vector. In PCA, we have seen that if Σ has only q < p non zero eigenvalues then we can express X as X − μ = Γ(1)Y(1) (1) (we subtract μ because our calculations were for X of mean zero) where and Γ(1) = (γ1,...,γq), 􏰣 􏰢􏰡 􏰤 p×q Y(1) = (Y1,...,Yq)T, 􏰣 􏰢􏰡 􏰤 q×1 the first q components of the p-vector Y = ΓT X of PCs. Lecture notes originally by Prof. 1  Recall that Y(1) ∼ (0, Λ1), where Λ1 = diag(λ1, . . . λq).  Letting Q = Γ(1)Λ1/2 and F = Λ−1/2Y(1), rewrite (1) as X = QF + μ.  Now we have E(F) = 0 var(F ) = Λ−1/2var(Y(1))Λ−1/2 = Iq (1) (1) q Σ = var(X) = Q var(F) QT = QQT = 􏰋λjγjγjT. j=1  In this case X can be completely determined by a weighted sum of the q < p uncorrelated factors in F = (F1,...,Fq)′. (1) (1) Lecture notes originally by Prof. 2  Note that var(X), originally of dimensions 􏰀p􏰁 + p = p(p+1), is com- 22 pletely explained by the factors via the loading matrix Q, which has qp entries.  When q << p, p(p+1) is roughly of order O(p2), where qp is roughly 2 of order O(p).... ⇒ we’ve achieved dimensionality reduction.  [Dimension of Q is a subtle issue, not just qp.... more on that later] Lecture notes originally by Prof. 3  Often things aren’t that ideal... Generally, in an orthogonal factor model we assume the follow- ing data generating mechanism: there exists a random vector F = (F1,...,Fq)T of q common factors and a vector U = (U1,...,Up) of p specific factors such that and • E(F) = 0 • var(F) = Iq • E(U) = 0 • cov(Ui,Uj) = 0 if i ̸= j • cov(F,U) = 0. X = QF + U + μ  Q is a p × q non random matrix with components called loadings. Lecture notes originally by Prof. 4  In particular, is a diagonal matrix.  Now for Σ := var(X), we have Σ=var(QF +U)=var(QF)+var(U)=QQT +Ψ  Dimensionality reduction achieved if q << p? Yes, there are only qp + p many entries in Q and Ψ together, much less than p(p + 1)/2.  [Again more on the dimension of Q later.] var(U) ≡ Ψ ≡ diag(ψ1,...,ψp) Lecture notes originally by Prof. 5  In a sense, the correlations among the p components of X is entirely explained by the factors via the loadings, and the Ui’s add a bit of noise specific to each component.  Some history: Factor analysis model finds prominent applications in psychology, pioneered by Spearman.( The namesake of ”Spearman’s ρ” in other statistics course)  If q = 1, F is an unobserved attribute like “intelligence”, where Xi’s may be scores obtained in different cognitive tasks.  See p.363 of Ha ̈rdle and Simar . Lecture notes originally by Prof. 6 6.2 INTERPRETING THE FACTORS We can interpret the factors using similar tools as in PCA.  We can express the model X = QF + U + μ component by component: for j = 1, . . . , p, q Xj =􏰋qjlFl+Uj+μj, l=1 where qjl is the (j, l)th element of Q. Lecture notes originally by Prof. 7  Recalling cov(U,F) = 0, var(F) = Iq and var(U) = diag(ψ1,...,ψp) we deduce that where 􏰊q q2 l=1 jl q var(Xj)=􏰋q2 +ψj, jl l=1 is called the communality and ψj is called the specific variance (or uniqueness in some textbooks) .  then 􏰊q q2 l=1 jl closer to 1, the better the variance of Xj is explained by the q factors. (2) is the proportion of variance of Xj explained by the q factors. The var(Xj) Lecture notes originally by Prof. 8  We can connect the factors to the Xj’s through their correlation. Since X = QF + U + μ, we have cov(X,F)=cov(QF +U,F)=cov(QF,F)=Qcov(F,F)=Q, and since var(Xj) = σjj and var(Fj) = 1, we deduce that corr(X, F ) = D−1/2Q where D = diag(σ11 . . . , σpp).  Just like we did in PCA, by analysing these correlations, we can see which Xj’s are strongly correlated with each factor and deduce from there an interpretation of the factors. Lecture notes originally by Prof. 9 6.3 SCALING THE DATA  What happens if we change the scale of the Xj’s? Suppose we use Y = CX instead of X, where C = diag(c1, . . . , cp). Recalling that X = QF + U + μ, we deduce that Y = QY F + UY + μY whereUY =CUandvar(UY)=ΨY andwherewedefine QY =CQ, ΨY =CΨCT, μY =Cμ. (3)  The factors F have not changed and the new model is still orthogo- nally factorial: • E(F) = 0, var(F) = Iq • E(UY ) = 0, cov(UY,i,UY,j) = 0 if i ̸= j • cov(F,UY ) = 0. Lecture notes originally by Prof. 10  Since scaling does not influence F , in many applications, the search of the loadings will be done through the scaled and centered data, i.e. through the data Y = D−1/2(X − μ). (Recall D = diag(σ11 . . . , σpp)) Thatis,wesearchQY andΨY inthemodel Y = QY F + UY under the same assumptions as before: • E(F) = 0, var(F) = Iq • E(UY ) = 0, cov(UY,i,UY,j) = 0 if i ̸= j • cov(F,UY ) = 0. Also,var(Y)=corr(Y)=QYQTY +ΨY. Lecture notes originally by Prof. 11  Let qY,jl denote the (j,l)−th element of QY . Then for j = 1,...,p q 􏰋q2 +ψY,j =var(Yj)=1, Y,jl l=1  If the communality 􏰊q q2 is close to 1, then this means that the l=1 Y,jl first q factors explain well the jth variable Xj.  Recall that in the non scaled case, we found in (2) that qq 􏰋 q2 􏰞var(Xj) = 􏰋 q2 􏰞σjj jl jl l=1 is the proportion of variance of Xj explained by the q factors.  Since qY,jl = qjl/􏰎var(Xj), qq l=1 is in fact the proportion of variance of Xj explained by the q factors. l=1 l=1 􏰋q2 Y,jl = 􏰋q2 􏰞σjj jl Lecture notes originally by Prof. 12  To interpret the factors we can also compute the correlation matrix corr(Y,F)=cov(Y,F)=cov(QYF +UY,F)=QY.  In particular the above corr is also the corr with the original X: corr(X,F)=D−1/2Q:=QY =corr(Y,F).  To summarise, calculations are easier with the scaled data and they provide the same interpretation. Lecture notes originally by Prof. 13 6.4 NON UNIQUENESS OF THE MATRIX Q)  (Caveat: Most challenging part of this topic)  Is the matrix Q of factor loadings unique? No. If we have the factor model X = QF + U + μ then for any orthogonal q × q matrix G we have so that holdswithQG =QGandFG =GTF. X = QGGT F + U + μ X = QGFG + U + μ Lecture notes originally by Prof. 14  This is still a valid factor model: • E(FG) = 0 • var(FG) = GT G = IQ • E(U) = 0 • cov(Ui,Uj) = 0 if i ̸= j • cov(FG,U) = GTcov(F,U) = 0.  More importantly, Σ=QQT +Ψ=QGQTG +Ψ.  From an inference point of view this is tricky since one cannot uniquely recover the loading matrix Q from the covariance Σ (Q and Ψ are the underlying parameters of our model).  From a numerical point of view it is bad too since it is difficult for an algorithm to find a solution Lecture notes originally by Prof. 15  Intuitively, if we don’t impose restrictions on Q, the possible solu- tion set for Q is at least of dimension (or degrees of freedom) q(q − 1)/2 since that is the dimension for the set of all q × q orthogonal matrices.  Think about how many ”degrees of freedom” we have when writ- ing down a q × q orthogonal matrix: Any such matrix has q2 many en- tries, but it needs to satisfy these algebraic constraints: all q columns need to have unit length and any pair of columns must give a zero dot-product. There are q + 􏰀q􏰁 = q(q + 1)/2 such constraints and hence 22 thedegreesoffreedomisq −q(q+1)/2=q(q−1)/2  Hence, one generally needs to impose q(q − 1)/2 constraints to make Q identifiable. There are different ways of doing it. Lecture notes originally by Prof. 16  One way is to restrict Q to be “lower triangular”, that is, all qjl = 0 for j < l. So there are exactly q(q − 1)/2 zero constraints here.  Typically, one would further restrict all the diagonal entries qj j , j = 1, . . . , q to be positive, but this doesn’t really cut down the degrees of freedom. (the whole real line and half the real line are both objects of dimension 1)  The above set of constraints is motivated by the following version of Cholesky decomposition theorem: For any p × p matrix A of rank q ≤ p, one can find a unique p × q matrix L = (ljl) such that A = LLT , and L is lower triangular with positive diagonal entries, i.e. ljj > 0 for allj =1,…q,andljl =0forallj 7
3
4
5
6
7 8 9
10
II
12
13
14
– 0.131 0.004 – 0.093 0.28 1
0.466 -0.023 0.017 0.212 0.234
0.153 – 0.318 0.434 0.33n
0.143 -0.031
t 0.090
– 0.134 -0.338 0.4011
2 -0.466 0.171 O.OJ7 -0.002 0.517 -0.194
-0.1 I 0.258 -0.345 -0.012 0.058
-0.362
-0.092 0.784 -n.n37 0.875 -0.807 11.494
0.207 0.853 0.522 0.296 0.241 0.817
0.019 O.OI)! lUll) I
-0.133
-O.oJ8
-0.043
-0.079
-0.265 -0.131 0.411
– 0 . 18 0 0.185 -0.221
0.055 0.173
0.019
0.085
0.417
0.839 -0.000 -0.001
0.201 0.037 0.202 0.188 0.025 -0.1 12 0.188 0.109 0.061 0.098 – 0.462 -0.336 – 0.116 15 -0.056 0.293 -0.441 0.577
f).B75 0.077 0.844 0.324
0.200 0.277 0.000
0.411 -O.oo t
0.277 0.167
0.316
0.307 -0.213 0.000 -0.000
0.330 n.657 0.070
0.807 -0.00 1
0.619 0.001 -0.000
Lecture notes originally by Prof. 28

Factor loadings after varimax rotation
Lecture notes originally by Prof. 29
hvpothesis test described in Section 9.5. it was found that seven raclOrs

Interpretation (copied from that book):
 It is very difficult to interpret the unrotated loadings but easier to interpret the rotated loadings.
 The first factor is loaded heavily on variables 5, 6, 8, 10, 11, 12, and 13 and represents perhaps an outward and salesmanlike per- sonality.
 Factor 2, weighting variables 4 and 7, represents likeability.
 Factor 3, weighting variables 1, 9, and 15 represents experience.
 Factors 4 and 5 each represent one variable, academic ability (3), and appearance (2), respectively.
 The last two factors have little importance and variable 14 (keen- ness) seemed to be associated with several of the factors.
Lecture notes originally by Prof. 30

6.7 FACTOR SCORES
Unlike PCA, the factors Fj are latent and not computable, although there are techniques to “estimate” them (since they are random vari- able, we say instead “predicting” them). The predicted values of the factors are called factor scores.
 At the population level, our factor model was X = QF + U + μ
whereX ∈Rp,F ∈Rq andU ∈Rp arerandomvectors.
 At the sample X1,…,Xn level, for i = 1,…,n we write
Xi = QFi + Ui + μ
where, this time, Xi = (Xi1,…,Xip)T, Fi = (Fi1,…,Fiq)T and Ui = (Ui1,…,Uip)T.
Lecture notes originally by Prof. 31

 It is not possible to compute the Fij’s but there are techniques to predict them: for each individual i, we can predict (“estimate”) the jth factor Fij.
 We won’t see here how this is done but note that the predictors Fˆ ij
are not equal to the true latent, unobservable Fij’s.
Lecture notes originally by Prof. 32

6.8 FACTOR ANALYSIS VERSUS PCA
 In PCA, our goal was to explicitly find linear combinations of the components of X. This is how we constructed the PCs Y1, . . . , Yp and this doesn’t depend on any model .
 In factor analysis, the factors are not directly computable, they are latent. They appear after we model the data by putting a structure on (modelling) the covariance matrix. The whole factor analysis depends on the factor model we assumed. If the model is wrong then the anal- ysis will be spurious.
 In factor analysis, the factors are not especially linear combinations of the original variables, they are factors on their own which often rep- resent characteristics that groups of variables may represent.
 In fact, in factor analysis, instead of taking the factors F to be func- tions of X, we express X as a function of F (it goes in the other direc- tion).
Lecture notes originally by Prof. 33

 The latent aspect of the factors is particularly useful when we want to extract traits that are not directly measurable. Example: an anxi- ety factor is not really something we can measure but it could be con- nected to measurable variables such as sweat, success in exams, diffi- culty talking to strangers.
 In PCA the first few PCs explain the largest variability of the data (this is how we construct them).
 In factor analysis, the q factors are often those that are the most eas- ily interpretable (this is how we construct them).
 Sometimes PCA and factor analysis give similar results, and we can understand why: as seen earlier, if only q < p eigenvalues of Σ are nonzero, then we can write a factor model using the first q PCs. Lecture notes originally by Prof. 34