VIII Lecture
Principal Components Analysis
8.1. Introduction. The basics of the procedure
Principal components analysis is applied mainly as a variable reduction proce-
dure. It is usually applied in cases when data is obtained from a possibly large number
of variables which are possibly highly correlated. The goal is to try to “condense” the
information. This is done by summarising the data in a (small) number of transforma-
tions of the original variables. Our motivation to do that is that we believe there is some
redundancy in the presentation of the information by the original set of variables since
e.g. many of these variables are measuring the same construct. In that case we try to
reduce the observed variables into a smaller number of principal components (artifi-
cial variables) that would account for most of the variability in the observed variables.
For simplicity, these artificial new variables are presented as a linear combinations of
the (optimally weighted) observed variables. If one linear combination is not enough,
we can choose to construct two, three, etc. such combinations. Note also that principal
components analysis may be just an intermediate step in much larger investigations. The
principal components obtained can be used for example as inputs in a regression analysis
or in a cluster analysis procedure. They are also a basic method in extracting factors in
factor analysis.
8.2. Precise mathematical formulation.
Let X ∼ Np(µ,Σ) where p is assumed to be relatively large. To perform a reduction,
we are looking for a linear combination α′1X with α1 ∈ Rp suitably chosen such that it
maximizes the variance of α′1X subject to the reasonable norming constraint ‖ α1 ‖2=
α′1α1 = 1. Since V ar(α
′
1X) = α
′
1Σα1 we need to choose α1 to maximize α
′
1Σα1 subject
to α′1α1 = 1. Since it goes about optimization with respect to a vector argument, some
simple differentiation rules should be re-called:
For a vector variable y ∈ Rp, a vector of constants a ∈ Rp and a symmetric p × p
matrix A it holds
• ∂
∂y
(y′Ay) = 2Ay (even more generally
∂
∂y
(y′Ay) = Ay + A′y
if A is not necessarily symmetric).
• ∂
∂y
(y′y) = 2y
• ∂
∂y
(y′a) = a.
Performing the optimization requires to apply Lagrange’s optimization under con-
straint procedure:
i) construct the Lagrange function
Lag(α1, λ) = α
′
1Σα1 + λ(1− α
′
1α1)
1
where λ ∈ R1 is the Lagrange multiplier;
ii) take the partial derivative with respect to α1 and equate it to zero:
2Σα1 − 2λα1 = 0 −→ (Σ− λIp)α1 = 0 (8.1)
From (8.1) we see that α1 must be an eigenvector of Σ and since we know from the first
lecture what the maximal value of α
′Σα
α′α
is , we conclude that α1 should be the eigenvector
that corresponds to the largest eigenvalue λ̄1 of Σ. The random variable α
′
1X is
called the first principal component.
For the second principal component α′2X we want it to to be normed according to
α′2α2 = 1, uncorrelated with the first component and to give maximal variance of a linear
combination of the components of X under these constraints. To find it, we construct the
Lagrange function:
Lag1(α2, λ1, λ2) = α
′
2Σα2 + λ1(1− α
′
2α2) + λ2α
′
1Σα2
Its partial derivative w.r. α2 gives
2Σα2 − 2λ1α2 + λ2Σα1 = 0 (8.2)
Multiplying (8.2) by α′1 from left and using the two constraints α
′
2α2 = 1 and α
′
2Σα1 = 0
gives:
−2λ1α′1α2 + λ2α
′
1Σα1 = 0 −→ λ2 = 0
(WHY (?) Have in mind that α1 was an eigenvector of Σ). But then (8.2) also implies
that α2 ∈ Rp must be an eigenvector of Σ (has to satisfy (Σ− λ1Ip)α2 = 0). Since it has
to be different from α1, having in mind that we aim at variance maximization, we see that
α2 has to be the normed eigenvector that corresponds to the second largest eigenvalue
λ̄2 of Σ. The process can be continued further. The third principal component should be
uncorrelated with the first two, should be normed and should give maximal variance of a
linear combination of the components of X under these constraints. One can easily realize
then that the vector α3 ∈ Rp in the formula α′3X should be the normed eigenvector that
corresponds to the third largest eigenvalue λ̄3 of the matrix Σ etc.
Note that if we extract all possible p principal components then
∑p
i=1 V ar(α
′
iX) will
just equal the sum of all eigenvalues of Σ and hence
p∑
i=1
V ar(α′iX) = tr(Σ) = σ11 + . . .+ σpp
Therefore , if we only take a small number of k principal components instead of
the total possible number p we can interpret their inclusion as one that explains a
V ar(α′1X)+…V ar(α
′
k
X)
σ11+…+σpp
.100% = λ̄1+…λ̄k
σ11+…+σpp
.100% of the total population variance σ11+. . .+σpp.
8.3. Estimation of the Principal Components
In practice, Σ is unknown and has to be estimated. The principal components are
derived from the normed eigenvectors of the estimated covariance matrix.
Note also that extracting principal components from the (estimated) covariance ma-
trix has the drawback that it is influenced by the scale of measurement of each variable
2
Xi, i = 1, . . . , p. A variable with large variance will necessarily be a large component in
the first principal component (note the goal of explaining the bulk of variability by using
the first principal component). Yet the large variance of the variable may be just an arti-
fact of the measurement scale used for this variable. Therefore, an alternative practice is
adopted sometimes to extract principal components from the correlation matrix ρ instead
of the covariance matrix Σ.
Example (Eigenvalues obtained from Covariance and Correlation Matrices- see page
437 Johnston and Wichern). It demonstrates the great effect standardization may have on
the principal components. The relative magnitudes of the weights after standardization
(i.e. from ρ may become in direct opposition to the weights attached to the same variables
in the principal component obtained from Σ.
For the reasons mentioned above, variables are often standardized before sam-
ple principal components are extracted. Standardization is accomplished by calculat-
ing the vectors Zi =
X1i−X̄1√
s11
X2i−X̄2√
s22
. . .
Xpi−X̄p√
spp
, i = 1, . . . , n. The standardized observations matrix
Z = [Z1,Z2, . . . ,Zn] =
Z11 Z12 . . . Z1n
Z21 Z22 . . . Z2n
. . . . . . . . . . . .
Zp1 Zp2 . . . Zpn
∈ Mp,n gives the sample mean vector
Z̄ = 1
n
Z1n = 0 and a sample covariance matrix Sz =
1
n−1ZZ
′ = R (the correlation matrix
of the original observations). The principal components are extracted in the usual way
from R now.
8.4. Deciding how many principal components to include. To reduce the
dimensionality (which is the motivating goal), we should restrict attention to the first
k principal components and ideally, k should be kept much less than p but there is a
trade-off to be made here since we would also like the proportion ψk =
λ̄1+…λ̄k
λ̄1+…λ̄p
be close to
one. How could a reasonable trade-off be made? Three methods are most widely used:
• The “screeplot”: basically, it is a graphical method of plotting the ordered λ̄k against
k and deciding visually when the plot has flattened out. Typically, the initial part
of the plot is like the side of the mountain, while the flat portion where each λ̄k is
just slightly smaller than λ̄k−1, is like the rough scree at the bottom. This motivates
the name of the plot. The task here is to find where ”the scree begins.”
• Choose an arbitrary constant c ∈ (0, 1) and choose k to be the smallest one with
the property ψk ≥ c. Usually, c = 0.9 is used but please, note the arbitrariness of
the choice here.
• Kaiser’s rule: it is applied when extracting the components from the correlation
matrix and suggests that from all p principal components only the ones should
be retained whose variances are greater than unity, or , equivalently, only those
components which, individually, explain at least 1
p
100% of the total variance. (This is
the same as excluding all principal components with eigenvalues less than the overall
average). This criterion has a number of positive features that have contributed to
its popularity but can not be defended on a safe theoretical ground.
3
• Formal tests of significance. Note that it actually does not make sense to test
whether λ̄k+1 = . . . = λ̄p = 0 since if such a hypothesis was true then the population
distribution would be contained entirely within a k− dimensional subspace and the
same would be true for any sample from this distribution, hence we would have the
estimated λ̄ values for indices k+1, . . . , p being also equal to zero with probability
one! What seems to be reasonable to do instead, is to test H0 : λ̄k+1 = . . . = λ̄p
(without asking the common value to be zero). This is a more quantitative variant of
the scree test. A test for this hypothesis is to form the algebraic and geometric means
a0 =algebraic mean of the last p− k estimated eigenvalues; g0 =geometric mean of
the last p − k estimated eigenvalues, and then construct −2logλ = n(p − k) log a0
g0
.
The asymptotic distribution of this statistic under the null hypothesis is χ2ν where
ν =
(p−k+2)(p−k−1)
2
. The interested student can find more details about this test in
the monograph of Mardia, Kent and Bibby. We should note, however, that the last
result holds under multivariate normality assumption and is only valid as stated for
the covariance-based (not the correlation-based) version of the principal compo-
nent analysis. In practice, many data analysts are reluctant to make a multivariate
normality assumption at the early stage of the descriptive data analysis and hence
distrust the above quantitative test but prefer the simple Kaiser criterion.
8.5. Numerical example The Crime Rates example will be discussed at the lecture.
The data gives crime rates per 100000 people in seven categories for each of the 50 states
in USA in 1997. Principal components are used to summarize the 7-dimensional data in
2 or 3 dimensions only and help to visualize and interpret the data.
Basically, principal components analysis can be performed in SAS by using either the
PRINCOMP or the FACTOR procedures. Principal components can serve as a method
for initial factor extraction in exploratory factor analysis. But one should mention here
that Principal component analysis is not Factor analysis. The main difference is that in
factor analysis (to be studied later in this course) one assumes that the covariation in the
observed variables is due to the presence of one or more latent variables (factors) that
exert casual influence on the observed variables. Factor analysis is being used when it is
believed that certain latent factors exist and it is hoped to explore the nature and number
of these factors. In contrast, in principal component analysis there is no prior assumption
about an underlying casual model. The goal here is just variable reduction.
8.6. Example from finance: portfolio optimization. Many other problems in
multivariate Statistics lead to formulating optimization problems that are similar in spirit
to the Principal Component Analysis problem. Hereby, we shall illustrate the Efficient
portfolio choice problem.
Assume that a p−dimensional vector X of returns of the p assets is given. Then the
return of a portfolio that has these assets in proportions (c1, c2, . . . , cp) (with
∑p
i=1 ci = 1)
is Q = c′X and the mean return is c′µ (Here we assume that EX = µ,D(X) = Σ.) The
risk of the portfolio is c′Σc. Further, assume that a pre-specified mean return µ̄ is to be
achieved. The question is how to choose the weights c so that the risk of a portfolio that
achieves the pre-specified mean return, is as small as possible.
Mathematically, this is equivalent to the requirement to find the solution of an opti-
mization problem under two constraints. The Lagrange function is:
Lag(λ1, λ2) = c
′Σc+ λ1(µ̄− c′µ) + λ2(1− c′1p) (8.3)
4
where 1 is a p-dimensional vector of ones. Differentiating (8.3) with respect to c we get
the first order conditions for a minimum:
2Σc− λ1µ− λ21p = 0 (8.4)
To simplify derivations, we shall consider the so-called case of non-existence of a riskless
asset with a fixed (non-random) return. Then it makes sense to assume that Σ is positive
definite and hence Σ−1 exists. We get from (8.4) then:
c =
1
2
Σ−1(λ1µ+ λ21p) (8.5)
After multiplying by 1′p from left both sides of the equality, we get:
1 =
1
2
1′pΣ
−1(λ1µ+ λ21p) (8.6)
We can get λ2 from (8.6) as λ2 =
2−λ11
′
pΣ
−1µ
1′pΣ−11p
and then substitute it in the formula for c
to end up with:
c =
1
2
λ1(Σ
−1µ−
1′pΣ
−1µ
1′pΣ
−11p
Σ−11p) +
Σ−11p
1′pΣ
−11p
(8.7)
In a similar way, if we multiply both sides of (8.5) by µ′ from left and use the restriction
µ′c = µ̄ we can get one more relationship between λ1 and λ2 : λ1 =
2µ̄−λ2µ′Σ−11p
µ′Σ−1µ
The
linear system of 2 equations with respect to λ1 and λ2 can be solved then and the values
substituted in (8.7) to get the final expression for c using µ, µ̄ and Σ. (Do it (!))
One special case is of particular interest. This is the so-called variance-efficient port-
folio (as opposed to the mean-variance efficient portfolio considered above). For the
variance-efficient portfolio, there is no pre-specified mean return, that is, there is no re-
striction on the mean. It is only required to minimize the variance. Obviously, we have
λ1 = 0 then and from (8.7) we get the optimal weights for the variance efficient portfolio:
copt =
Σ−11p
1′pΣ−11p
.
5