IX Lecture
Canonical Correlation Analysis
9.1. Introduction. The basics of the procedure
Assume we are interested in the association between two sets of random variables.
Typical examples include: relation between set of governmental policy variables and a set
of economic goal variables; relation between college “performance” variables (like grades
in courses in five different subject matter areas) and precollege “achievement” variables
(like high-school gradepoint averages for junior and senior years, number of high-school
extracurricular activities) etc.
The way the above problem of measuring association is solved in Canonical Correla-
tion Analysis, is to consider the largest possible correlation between linear combination
of the variables in the first set and a linear combination of the variables in the second
set. The pair of linear combinations obtained through this maximization process is called
first canonical variables and their correlation is called first canonical correlation.
The process can be continued (similarly to the principal components procedure) to find
a second pair of linear combinations having the largest correlation among all pairs that
are uncorrelated with the initially selected pair. This would give us the second set of
canonical variables with their second canonical correlation etc. The maximization process
that we are performing at each step reflects our wish (again like in principal components
analysis) to concentrate the initially high dimensional relationship between the 2 sets of
variables into a few pairs of canonical variables only. Often, even only one pair is consid-
ered. The rationale in canonical correlation analysis is that when the number of variables
is large, interpreting the whole set of correlation coefficients between pairs of variables
from each set is hopeless and in that case one should concentrate on a few carefully chosen
representative correlations. Finally, we should note that the traditional (simple) correla-
tion coefficient and the multiple correlation coefficient (Lecture V) are special cases of
canonical correlation in which one or both sets contain a single variable.
9.2. Application in testing for independence of sets of variables
Besides being interesting in its own right (see 9.1) , calculating canonical correla-
tions turns out to be important for the sake of testing independence of sets of
random variables. Let us remember that testing for independence and for uncorre-
latedness in the case of multivariate normal are equivalent problems. Assume now that
that X ∼ Np(µ,Σ). Furthermore, let X be partitioned into r, q components (r + q = p)
with X(1) ∈ Rr,X(2) ∈ Rq and correspondingly, the covariance matrix
Σ = E(X− µ)(X− µ)′ =
σ11 σ12 . . . σ1p
σ21 σ22 . . . σ2p
. . . . . . . . . . . .
. . . . . . . . . . . .
σp1 σp2 . . . σpp
∈ Mp,p
has been also partitioned into Σ =
(
Σ11 Σ12
Σ21 Σ22
)
, accordingly. We shall assume for sim-
plicity that the matrices Σ,Σ11 and Σ22 are nonsingular. To test H0 : Σ12 = 0 against
a general alternative, a sensible way to go would be the following: for fixed vectors
1
a ∈ Rr, b ∈ Rq let Z1 = a′X(1) and Z2 = b′X(2) giving ρa,b = Corr(Z1, Z2) = a
′Σ12b√
a′Σ11ab′Σ22b
.
H0 is equivalent to H0 : ρa,b = 0 for all a ∈ Rr, b ∈ Rq. For a particular pair a, b , H0
would be accepted if |ra,b| =
|a′S12b|√
a′S11ab′S22b
≤ k for certain positive constant k. (Here Sij are
the corresponding data based estimators of Σij). Hence an appropriate acceptance region
for H0 would be given in the form {X ∈ Mp,n : maxa,b r2a,b ≤ k2}. But maximizing r2a,b
means to find the maximum of (a′S12b)
2 under constraints a′S11a = 1, b
′S22b = 1 and this
is exactly the data-based version of the optimization problem to be solved in 9.1. For the
goals in 9.1 and 9.2 to be achieved, we need to solve problems of the following type:
9.3. Precise mathematical formulation and solution to the problem.
Canonical variables are the variables Z1 = a
′X(1) and Z2 = b
′X(2) where a ∈ Rr, b ∈
Rq are obtained by maximizing (a′Σ12b)
2 under the constraints a′Σ11a = 1, b
′Σ22b = 1. To
solve the above maximization problem, we construct
Lag(a, b, λ1, λ2) = (a
′Σ12b)
2 + λ1(a
′Σ11a− 1) + λ2(b′Σ22b− 1)
Partial differentiation with respect to the vectors a and b gives:
2(a′Σ12b)Σ12b+ 2λ1Σ11a = 0 ∈ Rr (9.1)
2(a′Σ12b)Σ21a+ 2λ2Σ22b = 0 ∈ Rq (9.2)
We multiply (9.1) by the vector a′ from left and equation (9.2) by b′ from left and after
subtracting the two equations obtained we get λ1 = λ2 = −(a′Σ12b)2 = −µ2. Hence:
Σ12b = µΣ11a (9.3)
and
Σ21a = µΣ22b (9.4)
hold. Now we first multiply (9.3) by Σ21Σ
−1
11 from left, then both sides of (9.4) by the
scalar µ and after finally adding the two equations we get:
(Σ21Σ
−1
11 Σ12 − µ
2Σ22)b = 0 (9.5)
The homogeneous equation system (9.5) having a nontrivial solution w.r. b means that
|Σ21Σ−111 Σ12 − µ
2Σ22| = 0 (9.6)
must hold. Then, of course,
|Σ−
1
2
22 |.|Σ21Σ
−1
11 Σ12 − µ
2Σ22|.|Σ
− 1
2
22 | = |Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 − µ
2Iq| = 0
must hold. This means that µ2 has to be an eigenvalue of the matrix Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 .
Also, b = Σ
− 1
2
22 b̂ where b̂ is the eigenvector of Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 corresponding to this
eigenvalue (WHY (?!)). (Note, however, that this representation is good mainly for the-
oretical purposes, the main advantage being that one is dealing with eigenvalues of a
symmetric matrix. If doing calculations by hand, it is usually easier to calculate b directly
as the solution of the linear equation (9.5), i.e. find the largest eigenvalue of the (non-
symmetric) matrix Σ−122 Σ21Σ
−1
11 Σ12 and then find the eigenvector b that correponds to it.
2
Besides, we also see from the definition of µ that µ2 = (a′Σ12b)
2 holds. Since we wanted
to maximize the right hand side, it is obvious that µ2 must be chosen to be the largest
eigenvalue of the matrix Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 (or, which is the same thing, the largest
eigenvalue of the matrix Σ21Σ
−1
11 Σ12Σ
−1
22 ). Finally, we can obtain the vector a from (9.3):
a = 1
µ
Σ−111 Σ12b. That way , the first canonical variables Z1 = a
′X(1) and Z2 = b
′X(2) are
determined and the value of the first canonical correlation is just µ. The orientation of
the vector b is chosen such that the sign of µ should be positive.
Now , it is easy to see that if we want to extract a second pair of canonical variables
we need to repeat the same process by starting with the second largest eigenvalue µ2 of
the matrix Σ
− 1
2
22 Σ21Σ
−1
11 Σ12Σ
− 1
2
22 (or of the matrix Σ
−1
22 Σ21Σ
−1
11 Σ12). This will automatically
ensure that the second pair of canonical variables is uncorrelated with the first pair. The
process can theoretically be continued until the number of pairs of canonical variables
equals the number of variables in the smaller group. But in practice, much fewer canon-
ical variables will be needed. Each canonical variable is uncorrelated with all the other
canonical variables of either set except for the one corresponding canonical variable in the
opposite set.
Note. It is important to point out that already by definition the canonical correlation
is at least as large as the multiple correlation between any variable and the opposite set
of variables. It is in fact possible for the first canonical correlation to be very large while
all the multiple correlations of each separate variable with the opposite set of canonical
variables are small. This once again underlines the importance of Canonical Correlation
analysis.
9.4. Estimating and testing canonical correlations
The way to estimate the canonical variables and canonical correlation coefficients
is based on the plug-in technique. One follows the steps outlined in 9.3, by each time
substituting Sij instead of Σij).
Let us now discuss the independence testing issue outlined in 9.2. The acceptance
region of the independence test of H0 in 9.2. would be {X ∈ Mp,n : largest eigenvalue of
S
− 1
2
22 S21S
−1
11 S12S
− 1
2
22 ≤ kα} where kα has been worked out and is given in the so called Hecks
charts. This distribution depends on three parameters: s = min(r, q),m =
|r−q|−1
2
, N =
n−r−q−2
2
, n being the sample size. Besides using the charts, one can also use good F-
distribution-based approximations for a (transformations of) this distribution like Wilk’s
lambda, Pillai’s trace, Hotelling Trace and Roy’s greatest root. Here we shall only mention
that all these statistics and their P-values (using suitable F-distribution-based approx-
imations) are readily available as an output in the SAS program CANCORR so that
performing the test is really easy-one can read out directly the p-value from the SAS
output.
9.5. Some important computational issues
Note that calculating X−
1
2 and X
1
2 for a symmetric positive definite matrix X ac-
cording to the theoretically attractive spectral decomposition method may be numerically
unstable. This is especially the case when some of the eigenvalues are close to zero. In
many numerical packages including SAS, the so called Cholesky decomposition is be-
ing calculated instead. For a symmetric positive definite matrix X ∈ Mp,p its Cholesky
decomposition is defined as an upper triangular matrix U such that U′U = X holds. Note
3
that in SAS/IML , when you write e.g. xroot = root(x) you do not get as a result the
theoretical X
1
2 but you rather get the matrix U defined above. Looking back at 9.5, we
see that if U′U = Σ−122 gives the Cholesky decomposition of the matrix Σ
−1
22 then µ
2 is
the eigenvalue of the matrix A = UΣ21Σ
−1
11 Σ12U
′. Indeed, by multiplying from left by
U and from right by U′ in (9.6) we get:
|A− µ2UΣ22U′| = 0
But UΣ22U
′ = U(U′U)−1U′ = UU−1(U′)−1U′ = I holds.
9.6. Numerical examples
The SAS procedure for performing Canonical Correlation Analysis is called CAN-
CORR. The following numerical example will be discussed at lectures:
• Canonical Correlation Analysis of the Fitness Club Data. Three physiolog-
ical and three exercise variables were measured on twenty middle aged men in a
fitness club. The CANCORR procedure is being used to determine if the physiolog-
ical variables are related in any way to the exercise variables.
• Example 10.4, p. 552 in Johnston and Wichern. Studying canonical correla-
tions between leg and head bone measurements: X1, X2 are skull length and skull
breadth, respectively; X3, X4 are leg bone measurements: femur and tibia length,
respectively. Observations have been taken on n = 276 White Leghorn chicken.
The example is chosen to also illustrate how a canonical correlation analysis can be
performed when the original data is not given but the empirical correlation matrix
(or empirical covariance matrix) is available. The following SAS file implements the
calculations.
options linesize=70; title ’Canonical Correlation Analysis’;
data chicken (type=corr);
input _type_ $ _name_ $ x1 x2 x3 x4;
cards;
n . 276 276 276 276
CORR x1 1.0 . . .
CORR x2 .505 1.0 . .
CORR x3 .569 .422 1.0 .
CORR x4 .602 .467 .926 1.0
;
proc cancorr data=chicken vprefix=head wprefix=leg;
var x1 x2; with x3 x4;
run;
If covariance matrix was available instead and the analysis was performed using
the covariance matrix then the modification in the above example would have been
that one would need to write COV instead of CORR above and enter the elements
of the covariance matrix instead. After running the example, we get that U1 =
.781Z1 + .345Z2;V1 = .060Z3 + .944Z4 is the first pair of canonical variables. Here
Zi, i = 1, 2, 3, 4 are the standardized versions of Xi, i = 1, 2, 3, 4. Their correlation
(the first canonical correlation ) is .631 and is quite significant.
4
Lecture IX-supplement
Computations used in the MANOVA tests
In standard (univariate) Analysis of Variance, with usual normality assumptions on
the errors, testing about effects of the factors involved in the model description is based
on the F test. The F tests are derived from the ANOVA decomposition SST=SSA+SSE.
The argument goes as follows:
i) SSE and SSA are independent, (up to constant factors involving the variance σ2 of the
errors) χ2 distributed;
ii) by proper norming to account for degrees of freedom, from SSE and SSA one gets
statistics that have the following behaviour: the normed SSE always delivers an unbiased
estimator of σ2 no matter if the hypothesis or alternative is true; the normed SSA deliv-
ers an unbiased estimator of σ2 under hypothesis but delivers an unbiased estimator of a
“larger” quantity under the alternative.
The above observation is crucial and motivates the F-testing: F statistics are (suitably
normed to account for degrees of freedom) ratios of SSA/SSE. When taking the
ratio, the factors involving σ2 cancel out and σ2 does not play any role in the distribution
of the ratio. Under H0 their distribution is F. When the null hypothesis is violated, then
the same statistics will tend to have “larger” values as compared to the case when H0
is true. Hence significant (w.r to the corresponding F-distribution) values of the statistic
lead to rejection of H0.
Aiming at generalising these ideas to the Multivariate ANOVA (MANOVA) case, we
should note that instead of χ2 distributions we now have to deal with Wishart distribu-
tions and we need to properly define (a proper functional of) the SSA/SSE ratio which
would be a ”ratio” of matrices now. Obviously, there are more ways to define suitable
statistics in this context! It turns out that such functionals are related to the eigenval-
ues of the (properly normed) Wishart distributed matrices that enter the decomposition
SST=SSA+SSE in the multivariate case.
II.1. Roots distributions.
Let Yi, i = 1, 2, .., p ∼ Np(µi,Σ). Then the following data matrix:
Y = Y(nxp) =
Y ′1
Y ′2
.
.
Y ′n
=
Y 11 Y 12 .. Y 1p
Y 21 Y 22 .. Y 2p
. . .. .
. . .. .
Y n1 Y n2 .. Y np
= [U1|U2| . . . |Up] is a nxp matrix
containing n p-dimensional (transposed) vectors. Denote: E(Y ) = M,V ar(
−→
Y ) = Σ⊗ In.
Let A,B be projectors and such that Q1 = Y
′AY and Q2 = Y
′BY are two independent
Wp(v,Σ),Wp(q,Σ), respectively. Although the theory is general, to keep you on track, you
could always think about a linear model example:
Y = Xβ + E, Ŷ = Xβ̂,A = In −X(X ′X)−X ′, B = X(X ′X)−X ′ −
−→
1 n(
−→
1
′
n
−→
1 n)
−1−→1
′
n
and the corresponding decomposition
Y [In −
−→
1 n(
−→
1
′
n
−→
1 n)
−1−→1
′
n]Y = Y
′BY + Y ′AY = Q2 +Q1
5
of SST = SSA + SSE = Q2 + Q1 where Q2 is the ”hypothesis matrix” and Q1 is the
”error matrix”.
Lemma II.1 Let Q1, Q2 ∈ Mp,p be two positive definite symmetric matrices . Then
the roots of the determinant equation |Q2 − θ(Q1 + Q2)| = 0 are related to the roots of
the equation |Q2 − λQ1| = 0 by: λi = θi1−θi (or θi =
λi
1+λi
).
Lemma II.2 Let Q1, Q2 ∈ Mp,p be two positive definite symmetric matrices . Then
the roots of the determinant equation |Q1 − v(Q1 + Q2)| = 0 are related to the roots of
the equation |Q2 − λQ1| = 0 by: λi = 1−vivi (or vi =
1
1+λi
).
We can employ the above two lemmas to see that:
If λi, vi, θi are the roots of :
|Q2 − λQ1| = 0, |Q1 − v(Q1 +Q2)| = 0, |Q2 − θ(Q1 +Q2)| = 0
then:
Λ = |Q1(Q1 +Q2)−1| =
p∏
i=1
(1 + λi)
−1
(Wilks’ Criterion statistic) or
|Q2Q−11 | =
p∏
i=1
λi =
p∏
i=1
1− vi
vi
=
p∏
i=1
θi
1− θi
or
|Q2(Q1 +Q2)−1| =
p∏
i=1
θi =
p∏
i=1
λi
1 + λi
=
p∏
i=1
(1− vi)
and other functional transformations of these products of (random) roots would have a
distribution that would only depend on p, q, v. There are various ways to choose such
functional transformations (statistics) and many have been suggested like:
• Λ (Wilks’ Lambda)
• tr(Q2Q−11 ) = tr(Q
−1
1 Q2) =
∑p
i=1 λi (Lawley-Hotelling trace)
• maxiλi (Roy’s criterium)
• V = tr[Q2(Q1 +Q2)−1] =
∑p
i=1
λi
1+λi
(Pillai statistic)
Tables and charts for their exact or approximate distributions are available. Also, P-
values for these statistics are readily calculated in SAS, SPSS, Minitab and other statistical
packages. In these applications, the meaning of Q1 is of the “error matrix” (also denoted
by E sometimes) and the meaning of Q2 is that of a “hypothesis matrix” (also denoted by
H sometimes). The distribution of the statistics defined above depends on the following
three parameters:
• p = the number of responses
• q = νh = degrees of freedom for the hypothesis
• v = νe = degrees of freedom for the error
6
Based on these, the following quantities are calculated:
s = min(p, q),m = 0.5(|p − q| − 1), n = 0.5(v − p − 1), r = v − 0.5(p − q + 1), u =
0.25(pq − 2). Moreover, we define: t =
√
p2q2−4
p2+q2−5 if p
2 + q2 − 5 > 0 and t = 1 otherwise.
Let us order the eigenvalues of E−1H = Q−11 Q2 according to: λ1 ≥ λ2 ≥ . . . ≥ λp. Then
the following distribution results are true: ( Note: the F-statistic quoted below is exact
if s = 1 or 2, otherwise the F-distribution is an approximation):
• Wilks’s test. The test statistics, Wilks’s lambda, is Λ = |E||E+H| =
∏p
i=1
1
1+λi
Then it
holds: F = 1−Λ
1/t
Λ1/t
. rt−2u
pq
∼ Fpq,rt−2u df (Rao’s F).
• Lawley-Hotelling trace Test. The Lawley-Hotelling statistic is U = tr(E−1H) =
λ1 + . . .+ λp, and F = 2(sn+ 1)
U
s2(2m+s+1)
∼ Fs(2m+s+1),2(sn+1) df.
• Pillai’s test. The test-statistics , Pillai trace, is V = tr(H(H +E)−1) = λ1
1+λ1
+ . . .+
λp
1+λp
and F = 2n+s+1
2m+s+1
. V
s−V ∼ Fs(2m+s+1),s(2n+s+1) df.
• Roy’s maximum root criterium. The test-statistic is just the largest eigenvalue λ1.
Finally, we shall mention one historically older and very universal approximation to
the distribution of the Λ statistic due to Bartlett (1927):
It holds: level of −[νe − p−νh+12 ]lnΛ = c(p, νh,M)∗ level of χ
2
pνh
where the constant
c(p, νh,M = νe − p + 1) is given in accompanying tables. Such tables are prepared for
levels α = 0.10, 0.05, 0.025 etc.
In the context of testing the hypothesis about significance of the first
canonican correlation, we have:
E = S22 − S21S−111 S12, H = S21S
−1
11 S12.
The Wilks statistic becomes
|S|
|S11||S22|
(!) We also see that in this case, if µ2i were
the squared canonical correlations then µ21 was defined as the maximal eigen-
value to S−122 H, that is, it is a solution to |(E+H)−1H−µ21I| = 0 However, setting
λ1 =
µ21
1−µ2
1
we see that:
|(E+H)−1H−µ21I| = 0→ |H−µ
2
1(E+H)| = 0→ |H−
µ21
1− µ21
E| = 0→ |E−1H−λ1I| = 0
holds and λ1 is an eigenvalue of E
−1H. Similarly you can argue for the remaining
λi =
µ2
i
1−µ2
i
values.
II.1.2. Comparisons, applications.
From all statistics discussed, Wilks’s lambda has been most widely applied. One
important reason for this is that this statistic has the virtue of being convenient to use
and, more importantly, being related to the Likelihood Ratio Test! Despite the above,
the fact that so many different statistics exist for the same hypothesis testing problem,
indicates that there is no universally best test. Power comparisons of the above tests are
almost lacking since the distribution of the statistic under alernatives is hardly known.
7