CS代考 5.7 PCA WITH REGRESSION (PCR)

5.7 PCA WITH REGRESSION (PCR)
Hastie, Tibshirani and Friedman (ESL), Section 3.5
 The PCs are also often used as a dimension reduction tool used to perform other statistical analyses.
 For example, it is frequently used in regression analysis. It is also used in classification and clustering.
 In those problems, instead of applying the techniques to the origi- nal data we apply them to some of the PCs of the data.
Lecture notes originally by Prof. 1

 E.g. Consider a linear regression problem for a variable Z and a p-vector X:
Z = m(X) + ε,
where
is the regression function, E(ε) = 0, X and ε are independent.
m(x) = E(Z|X = x) = α + βT x  Assume for simplicity that Z and X are centered:
Z =m(X)+ε, m(x)=E(Z|X =x)=βTx (1) where β ∈ Rp is unknown.
⇒ no intercept α in the model
Lecture notes originally by Prof. 2

 Suppose we observe i.i.d. data (X1,Z1),…,(Xn,Zn)
from (1).
 We want: estimate β
⇒ provides an estimator of the regression function m ⇒ makes prediction of Z for new values of X.
 The least squares (LS) estimator of β is ˆ 􏰋n
βLS =argminb∈Rp (Zi −bTXi)2 =(XTX)−1XTZ i=1
whereX isthen×pdatamatrixandZ =(Z1,…,Zn)T isann-vector.
Lecture notes originally by Prof. 3

 When p is large, X T X can be non-invertible  E.g. if p > n:
• rank(X ) ≤ min(n, p).
• rank(XTX) = rank(X) ≤ min(n,p) = n < p •SinceXTX isap×pmatrix,itcannotbeoffullrank⇒itisnot invertible.  Even if p < n, this can also happen when some variables can be written as a linear combination of others. In that case X T X is non- invertible. Lecture notes originally by Prof. 4  Generally: as long as one variable can be approximately written as a linear combination of others, then X T X will be nearly non invertible, so that its inverse is very hard to estimate well in practice.  In those cases βˆ =(XTX)−1XTZ LS can be a very bad estimator of β.  Then βˆ is not very useful: if it is far from β then all it does is bring us some misleading information about the relationship between X and Z.  We can’t estimate the linear model at (1) in a reliable way, so we take a modified version of this model which approximates it well and which we can estimate reasonably well from data.  Suggestion: perform the linear regression on some PCs of X instead of on X itself. LS Lecture notes originally by Prof. 5  To understand why, note that if some of the variables can be written as a linear combination of the others, then some of the eigenvalues of Σ will be equal to zero.  For example if X = (X1, X2)T and X2 = cX1 where c is a constant, then we have Σ = 􏰆 var(X1) c · var(X1) 􏰇 c · var(X1) c2 · var(X1) so that the second column of Σ is equal to c times the first column. Thus |Σ| = 0 and Σ has one zero eigenvalue. Lecture notes originally by Prof. 6  More generally, suppose that only the first q eigenvalues of Σ are non zero and recall that for j = 1,...,p the variance of the PCs of X satisfy so that var(Yj)=λj, forj=1,...,p var(Yj)=0 forj=q+1,...,p.  Now recall that when E(X) = 0, we can write the p-vector of PCs as Y =ΓTX. With Σ as above, since the Yj’s are uncorrelated we have var(Y ) = Λ Lecture notes originally by Prof. 7 where Λ=􏰆Λ1 0􏰇 00 where the 0’s above are matrices and λ1 .0 ... 0 Λ = .. . . 1 .. 0 ... 0 λq Y ∼(0,Λ) so that E[Y ] = 0 and the last p − q components of Y have variance 0. In other words, Lecture notes originally by Prof. 8 IfwewriteY as where and then YT =(YT,YT) (1) (2) Y(1) = (Y1,...,Yq)T Y(2) = (Yq+1,...,Yp)T, Y(1)∼(0,Λ1), Y(2)∼(0,0). In other words Y(2) is a vector of mean and variance zero ⇒ it degen- rates to 0. Lecture notes originally by Prof. 9  Now since and Γ is orthogonal, then we also have where and Y =ΓTX (2) X = ΓY = Γ(1)Y(1) + Γ(2)Y(2), Γ(1) = (γ1,...,γq), Γ(2) = (γq+1,...,γp) Γ(2)Y(2) ∼ (0, 0). Thus X can just be expressed as X = Γ(1)Y(1). ⇒ X contains a lot of redundant information and we can just work with Y(1), the vector of the first q PCs. Lecture notes originally by Prof. 10  Getting back to our linear regression problem Z = m(X) + ε = βT X + ε, we were interested in estimating m in the case where Z and X had mean zero and where there was near collinearity among the compo- nents of X.  In the case of exact collinearity, we saw that if Σ had only q non zero eigenvalues then X = Γ(1)Y(1) where Y(1) is the vector of the first q PCs. There we can write where Z = mPC(Y(1)) + ε, (3) mPC(Y(1)) = βPCT Y(1) and βPCT = βT Γ(1). Lecture notes originally by Prof. 11  In other words if Σ has only q < p non zero eigenvalues then in- stead of solving a linear model with the p-vector X we can solve a linear model with the q-vector Y(1).  Indeed, we have just seen that βT X = m(X) = mPC(Y(1)) = βPCT Y(1) mPC is easier to estimate because we need to estimate only a q-dim param βPC instead of the p-dimensional param β.  To estimate βPC we can use the least squares estimator ˆ 􏰋n βPC = argminb∈Rq (Zi − bT Y(1),i)2 = (YT Y(1))−1YT Z , (1) (1) where Y(1),i = (Yi1, . . . , Yiq)T denotes the version of Y(1) for the ith indi- vidual and i=1 Y11 ...Y1q Y(1) =  . . .  . Yn1 ... Ynq Lecture notes originally by Prof. 12  More often, we only have near collinearity among some compo- nents of p, so that the eigenvalues of Σ may all be non zero but some may be near zero.  Suppose that λq+1, . . . , λp are small, and let YT = (YT ,YT ), Y(1) = (Y1,...,Yq)T, Y(2) = (Yq+1,...,Yp)T. Then are small. (1) (2) var(Yq+1) = λq+1, . . . , var(Yp) = λp  In that case, getting rid of the PCs that have small eigenvalues will give rise to only an approximation: X = ΓY = Γ(1)Y(1) + Γ(2)Y(2) ≈ Γ(1)Y(1). Lecture notes originally by Prof. 13  Full picture: For centered Z and X , we want to approximate the regression model  where E(η) = 0. mPC(y(1)) = E(Z|Y(1) = y(1)) = βPCT y(1) 1×1 p×1 Z = βT X + ε 􏰣􏰢􏰡􏰤 􏰣􏰢􏰡􏰤 =βTΓY +ε = βT Γ(1) Y(1) +βT Γ(2) Y(2) +ε 􏰣􏰢􏰡􏰤 􏰣􏰢􏰡􏰤 􏰣􏰢􏰡􏰤 􏰣􏰢􏰡􏰤 p×q q×1 Z=βT Y(1)+η, with the model believing that variability from Y(2) is negligible (X ≈ Γ(1)Y(1) ). p×(p−q) (p−q)×1 PC Lecture notes originally by Prof. 14  mPC is easier to estimate than m because βPC has lower dimension than β. However here since we only have X ≈ Γ(1)Y(1) then we only have m(x) ≈ mPC(y(1)).  As above the least squares estimator of βPC is equal to ˆ 􏰋n βPC = argminb∈Rq (Zi − bT Y(1),i)2 = (YT Y(1))−1 YT Z . i=1 We hope that the approximation is good so that we can estimate m(x) by mˆ PC(y(1)) = βˆT y(1). PC (1) (1) Lecture notes originally by Prof. 15  One can then declare the “PCR estimate” of β as where Γˆ(1) is the estimated version of Γ(1) in practice.  Recall βˆ =(XTX)−1XTZ LS βˆpcr(q) = Γˆ βˆ (1) PC , can have large (co)variance if X T X is invertible but has small eigen- values. Lecture notes originally by Prof. 16  It can be shown that when q < p, Cov(βˆ ) − Cov(βˆpcr(q)) ≥ 0. (Recall that this means “positive definiteness”)  See https://en.wikipedia.org/wiki/Principal_component_ regression for details.  But be aware that βˆpcr(q) is biased, unlike βˆ .  So we have reduced variance in exchange for some bias. LS LS Lecture notes originally by Prof. 17  In practice of course one can only get “empirical version” of Γ and Y based on the sample covariance matrix S.  How to choose q? Choosing it to minimise least squares does not work (overfitting problem, selects too large q). Lecture notes originally by Prof. 18  A criterion that is often used to choose the number q of PCs to use is cross-validation (CV). In its simplest form, it consists in choosing where 􏰋n i=1 and βˆ−i PC,k q = argmink=1,...,p CV (k) , CV (k) = (Zi − βˆ−i Y(1),i)2 , PC,k denotes the LS estimator of βPC computed as in the formula given earlier taking there q = k, except that the ith individual (or sam- ple) is removed from Z and Y(1).  This is often used in practice: do linear regression on a small num- ber q of PCs instead of on the large number of original p variables. Lecture notes originally by Prof. 19 PROBLEM: In regression, the goal is not to explain just X but ex- plain the relationship between Z and X, selecting the first few PCs is not especially the best thing to do.  All we know is that the first few PCs explain most of the variability of X, but our goal is rather to explain the relationship between Z and X.  If the goal is to do regression, the first few PCs might not be the best choice. It can still be a good idea to work with a few PC, but not especially the first few. Instead we could choose the few that explain most of the relationship between Z and X.  It is possible to do this and some people do, but isn’t it possible to reduce the dimension of X in a better way? Lecture notes originally by Prof. 20 5.8 PARTIAL LEAST SQUARES (ESL 3.5.2)  Consider the linear regression model with E(X) = 0, E(Z) = 0: Z=m(X)+ε, m(x)=E(Z|X=x)=βTx where β, X = (X1,...,Xp)T and x are p-vectors and β is unknown.  In PCA, we construct particular linear combinations of the Xj’s: Yj=γjTX, Y=ΓTX where the weight matrix Γ is constructed only from Σ = var(X).  We hope that the first few Yj’s contain most of the information we need about X and then we take the approximation m(x) = βT x ≈ mPC(y(1)) = βT y(1) PC Lecture notes originally by Prof. 21  PROBLEM: This decomposition into successive PCs Y1,...,Yp is only based on the eigenvectors of var(X) and ignores Z completely.  In partial least squares (PLS), we choose a decomposition of X into successive linear projections T1,T2,...,Tp which try to keep most of the information about the relationship be- tween Z and X.  Since our goal is to estimate the relationship between X and Z, it seems to be a better idea. Lecture notes originally by Prof. 22  In PCA, the first component Y1 is constructed by taking Y 1 = γ 1T X where ∥γ1∥ = 1 and is as large as possible. var(Y1)  In PLS, the first component T1 is constructed by taking T 1 = φ T1 X where ∥φ1∥ = 1 and is as large as possible. |cov(Z, T1)| Note: Z and T1 are of dimension 1 and φ1 is a p-vector. Lecture notes originally by Prof. 23  Then for k = 2, . . . , p, the kth component Tk is constructed by taking T k = φ Tk X where ∥φk∥ = 1, and φTkΣφj =0 forj=1,...,k−1 |cov(Z, Tk)| Note: Z and Tk are of dimension 1 and φk is a p-vector.  The projections of X into the Tj’s are constructed to keep the linear combinations of X that explain most of the linear relationship between Z and X.  Let Φ = (φ1,...,φp), a p × p matrix. We hope that the first few (q say) components of the p-vector T = (T1, . . . , Tp)T = ΦT X contain most of the information about X that is useful to estimate m(x), the linear relationship between X and Z. is as large as possible. Lecture notes originally by Prof. 24  Then we do the regression on T1,...,Tq instead of on X1,...,Xp.  Specifically, recall that for x, X and β ∈ Rp, our linear model was m(x) = E(Z|X = x) = βTx.  Define the p-vector t as t = (t1,...,tp)T = ΦTx and for q < p define the q-vectors T(1) = (T1,...,Tq)T and t(1) = (t1,...,tq)T.  We use the approximation m(x) = βT x ≈ E(Z|T(1) = t(1)) ≡ mPLS(t(1)) = βT PLS t(1). Lecture notes originally by Prof. 25  The least squares estimator of βPLS is equal to ˆ 􏰋n βPLS = argminb∈Rq (Zi − bT T(1),i)2 = (T T T(1))−1 T T i=1 where Z = (Z1,...,Zn)T is an n-vector, T(1),i =(Ti1,...,Tiq)T is a q-vector corresponding to the ith individual, and T11 ...T1q T(1) =  . . .  . Tn1 ... Tnq Z , (1) (1) Lecture notes originally by Prof. 26  As for PCA, we can choose q by cross-validation: q = argmink=1,...,p CV (k) , where 􏰋n i=1 and βˆ−i PLS,k CV (k) = {Zi − βˆ−i T(1),i}2 , PLS,k denotes the LS estimator of βPLS computed as in the formula above taking there q = k, except that the ith individual is removed from Z and T(1). Lecture notes originally by Prof. 27  We’ve so far presented everything at the population level. In prac- tice there is no way that we can get the φ1 that maximizes |cov(Z, T1)|. We instead compute an empirical version φˆ1 as follows, given the data vector Z (n × 1􏰊) and data matrix X(n × p). : 1.LetZ ̄=n−1 Zi andZ ̃=Z−(Z ̄,...,Z ̄)T,thecenteredversion. i 2. Similarly let X ̃ be the centered version of X , i.e. each column of X ̃ , denoted X ̃ , is centered analogously. j 3. For any candidate projection vector φ1 = (φ11, . . . , φ1p), n−1⟨Z ̃,X ̃φ ⟩=n−1􏰀⟨Z ̃,X ̃⟩φ +···+⟨Z ̃,X ̃⟩φ 􏰁 1 1 11 p 1p is the empirical analog of cov(Z, φ1T X). 4. Notice that the empirical covariance in display is itself a dot product between φ1 and the vector (⟨Z ̃,X ̃ ⟩,...,⟨Z ̃,X ̃ ⟩)T. By Cauchy inequality [Maybe modify this in later iteration of the subject], obviously the empirical covariance is maximized if we pick φˆ be(⟨Z ̃,X ̃⟩,...,⟨Z ̃,X ̃⟩)T rescaledtounitlength. 11p 1p Lecture notes originally by Prof. 28  Read Algorithm 3.3 in ESL (p.81) for how we get the other φˆ2, . . . ,etc if you wanna know more (Caveat: Their notation is different). Lecture notes originally by Prof. 29  Note: When X ̃ has orthogonal columns, i.e. when X ̃TX ̃ = I, then PLS will coincide with normal linear regression. (ESL p.81) Lecture notes originally by Prof. 30 5.9 DISCUSSION  PCA and PLS are also used for non linear regression models. There too we approximate m(x) = E(Z|X = x) by mPLS(t(1)) = E(Z|T(1) = t(1)) or mPC(y(1)) = E(Z|Y(1) = y(1)).  For regression estimation, dimension reduction by PLS as described above often works better than dimension reduction by PCA.  Typically, either mPLS(t(1)) approximates better m(x) than mPC(y(1)) does, or if they achieve the same approximation, then the number q of components to use with PLS is smaller than the one used by PC ⇒ more computationally efficient. Lecture notes originally by Prof. 31  Example: Phoneme data available at www-stat.stanford.edu/ElemStatLearn.  The Xi’s in R256 represent log-periodograms constructed from record- ings of phonemes “aa” as in “dark” and “ao” as in “water”.  For each i we simulated Zi values by taking Zi = m(Xi) + εi = βT Xi + εi with εi ∼ N(0, σ2) and for various p-vectors β constructed so that all the interaction between X and Z depends only on 5 PCs (the others play no role)  Remark: Recall Xi = ΓYi , so by choosing appropriate β, βT Γ can be made to have five non-zero coordinates only. Lecture notes originally by Prof. 32 Case (i): Information contained in the first 5 PCs Case (ii): Information contained in PC 6 to 10. Case (iii): Information contained in PC 11 to 15. Case (iv): Information contained in PC 16 to 20. Lecture notes originally by Prof. 33  These data have N = 1717 obsservations. We randomly created 200 subsamples of size n = 30 or n = 100, and we computed the PLS re- gression estimators and the PCA regression estimators from each of those subsamples.  Then in each case, for the remaining N − n observations not used to compute the regression estimators, we computed the predictors or for q = 1,2,...,10. ZˆPLS = mˆ PLS(T(1),i) i ZˆPC = mˆ PC(Y(1),i) i  For Zˆi denoting ZˆPLS or ZˆPC, we computed the average squared ii prediction error 1 N􏰋−n P E = N − n ( Z i − Zˆ i ) 2 i=1 Lecture notes originally by Prof. 34 Cases (i) (top) and (ii) (bottom) when n = 30 (left) or n = 100 (right). Box- plots of the PE’s for q = 1 to 10 from left to right. PARTIAL LEAST SQUARES PLS PCA PLS PCA PLS PCA PLS PCA Lecture notes originally by Prof. 35 80 0 50 100 150 0 200 400 600 800 1000 50 0 20 40 60 80 100 120 0 100 200 300 400 Cases (iii) (top) and (iv) (botttom) when n = 30 (left) or n = 100 (right). PLS PCA PLS PCA PLS PCA PLS PCA PLS PCA PLS PCA plots of the prediction error using the rst p PLS components rst group Lecture notes originally by Prof. 36 e rst p PCA components last group of à boxes , calculated from 2àà sa 10 20 30 40 20 40 60 80 0 0 5 10 15 20 25 0 10 20 30 40 50 0 xo hm = 3à rst column or n = àà second column generated from the phon