Problem 7.1
CS5487 Problem Set 7
Linear Dimensionality Reduction
Antoni Chan Department of Computer Science City University of Hong Kong
Dimensionality
Shell of a hypersphere
Consider a hypersphere S1 of radius r in Rd, and its shell of thickness ✏. The shell can be defined as the region between S1 and another hypersphere S2 with radius r ✏. Hence, the volume of the
shell is
V (shell) = V (S1) V (S2) = 1 V (S2) V (S1), V(S1)
where the volume of a d-dimensional hypersphere of radius r is ⇡d/2rd
V (S) = (d + 1). 2
(a) Show that the ratio between the volumes of the two sphere is
(7.1)
(7.2)
(7.3)
V (S2) = ⇣1 ✏ ⌘d , V(S1) r
………
Principal Component Analysis
and hence, for any 0 < ✏ < r,
lim V (S2) = 0. d!1 V (S1)
(7.4) This shows that for high dimension V(shell) = V(S1). In other words, all the volume of the
hypersphere is in its shell!
Problem 7.2 PCA as minimizing reconstruction error
In this problem, we will prove that PCA finds a representation that minimizes the reconstruction error of the data points. Let X = {x1,··· ,xn} be the data points, with xi 2 Rd. Let the new representation (the PCA coe cients) be Z = {z1,··· ,zn}, with zi 2 Rk. Denote = [ 1,··· , k] 2 Rd⇥k as a matrix of basis vectors j , and c 2 Rd as the o↵set. The basis vectors are constrained to be orthonormal, i.e., Tj j = 1 and Tj i = 0 for i 6= j, or succinctly T = I. The reconstruction of a data point from its lower-dimensional representation is
Xk j=1
xˆi =c+ zi =c+ 48
jzij, (7.5)
where zij is the jth element of zi.
The total reconstruction error is given by
Z, ,c
(a) Note that only Z depends on X, hence we can first solve for the optimal Z as a function of
{c, }. Show that the optimal coe cients Z that minimize J are zi⇤ = T (xi c).
(7.8)
(7.9)
(7.10)
(7.11) (7.12)
(7.13)
(b) Substitute (7.8) into J to obtain a new objective function Jˆ(c, ), ˆ Xn
Xn i=1
kxi xˆik2. (7.6) {Z⇤, ⇤, c⇤} = argmin J(Z, , c) s.t. T = I. (7.7)
J(Z, ,c)=
Hence, the goal is to find the optimal coe cients Z, basis matrix , and o↵set c that minimize J,
J(c, ) = (xi c)T(I T)(xi c). i=1
(c) Show that the o↵set c that minimizes Jˆ is
⇤ 1XN
i.e., the sample mean.
(d) To find the optimal ⇤, first show that the optimization of Jˆ can be rewritten as
⇤=argminJˆ(c⇤, ) s.t. T =I
= argmax tr( T ⌃ )
Xk
c=n xi, i=1
where ⌃ = 1 Pn
n i=1
Ti ⌃ i
(xi c⇤)(xi c⇤)T is the sample covariance matrix.
= argmax
j=1
s.t. T = I s.t. T = I,
(e) Use Langrange multipliers to show that the stationary point for j of the constrained optimiza- tion problem in (7.13) is an eigenvector of ⌃, i.e.
⌃ j = j j,
where j is the corresponding eigenvalue. Finally, using this fact and (7.12), show that
Xk
⇤ = argmax tr(⇤) = argmax i,
j=1
(7.14)
(7.15)
which shows that the optimal ⇤ are the eigenvectors corresponding to the k largest eigenvalues of ⌃.
.........
49
Problem 7.3 PCA as maximizing variance
In this problem, we will show that PCA maximizes the variance of the projected data. Let X = {x1, · · · , xn} be the data points, with xi 2 Rd. Define the set of projection directions as =
d⇥k P T T
[ 1,···, k]2R ,wherethe i areorthonormal,i.e., j j =1and j i =0fori6=j,or
succinctly T = I. Let x ̄ = 1 xi be the sample mean of X. ni
The projected data is Z = {z1,··· ,zn}, with zi 2 Rk, where zi = T (xi x ̄).
The goal is to maximize the variance of the projected data Z by selecting the optimal ,
(7.16)
(7.17)
1Xn ! ⇤ =argmaxtr n (zi z ̄)(zi z ̄)T
s.t. T =I,
i=1 where z ̄= 1 P zi is the sample mean of Z.
(a) Show that z ̄ = 0.
ni
(b) Use (a) to show that (7.17) is equivalent to (7.12). Hence, maximizing the variance of the projected data is equivalent to minimizing the reconstruction error, leading to the PCA formu- lation.
......... Problem 7.4 PCA implementation using SVD
In this problem, we will consider implementing PCA using the singular value decomposition (SVD, see Problem 1.17). Let X = [x1, · · · , xn] be the matrix of data points with sample mean and covariance,
1 Xn μ= n xi,
i=1
Let X ̄ be the matrix of mean-subtracted points,
and construct the SVD of X ̄,
(a) Show that X ̄ can be written as
i=1
X ̄ = [x1 μ, · · · , xn μ], X ̄ =USVT.
X ̄ =X(I 111T), n
1 Xn
⌃= n (xi μ)(xi μ)T.
(7.18)
(7.19)
(7.20)
(7.21)
where 1 is the vector of ones.
(b) Show that {ui, s2i } is an eigenvector/eigenvalue pair of ⌃, where ui is the ith column of U.
(c) Rewrite the PCA training algorithm to use the SVD.
.........
50
n
Problem 7.5 PCA and classification
Consider a classification problem with two classes y 2 {1, 2} with class probabilities,
p(y = 1) = p(y = 2) = 1/2, (7.22)
and Gaussian class conditionals,
p(x|y = j) = N(x|μj,⌃j). (7.23) (a) The random variable x is not Gaussian, but we can still compute its mean and covariance.
Show that
and
μx = E[x] = 1(μ1 + μ2), 2
⌃x = E[(x μx)(x μx)T ] = 1 (⌃1 + ⌃2) + 1(μ1 μ2)(μ1 μ2)T 24
(7.24)
(7.25)
(7.26)
(7.27)
(b) Consider the 2-dimensional case, where
μ1 = μ2 =μ=↵0 ,
with ↵ > 0, and
⌃ =⌃ = =1 0 . 12 0 2
Use the principal component analysis of x to determine the best 1-dimensional subspace, i.e. the transformation
z = T x (7.28)
where is the eigenvector corresponding to the largest eigenvalue of ⌃x. For this problem, is the PCA approach to dimensionality reduction a good one from the classification point of view? Explain why.
………
Linear Discriminant Analysis
Problem 7.6 Fisher’s linear discriminant
Suppose we have feature vectors from two classes, where the sample mean and scatter matrix is given by {μ1, S1} for class 1, and {μ2, S2} for class 2. The Fisher’s linear discriminant (FLD) finds the optimal projection that maximizes the ratio of the “between-class” scatter and the “within- class” scatter,
w⇤ =argmaxJ(w), J(w)= wTSBw, w wTSWw
where SB and SW are the between- and within-class scatter matrices, SB =(μ1 μ2)(μ1 μ2)T, SW =S1 +S2.
In this problem, we will derive the optimal w⇤.
51
(7.29)
(7.30)
(a) Because (7.29) is maximizing a ratio in J(w), it su ces to fix the denominator to some value and maximize the numerator. Use Lagrange multipliers to rewrite (7.29) as a constrained optimization problem where wT SW w = 1.
(b) Show that the stationary point of the constrained optimization problem is given by the gener- alized eigenvalue problem,
SBw = SW w,
(c) Finally, assuming that SW is invertible, show that the optimal w⇤ is given by
………
Problem 7.7 Fisher’s linear discriminant as least-squares regression
In this problem, we will show that Fisher’s linear discriminant is equivalent to least-squares regres- sion for a particular set of target values. Let X = {x1, · · · , xn} be the data points with xi 2 Rd, and C1 be the set of points in class 1 (with size n1), and likewise for class 2 (C2, n2). Define the following sample statistics of the data for j 2 {1, 2},
μj=1Xxi, Sj=X(xi μj)(xi μj)T. (7.33) nj xi2Cj xi2Cj
We want to learn a linear projection of the data to separate the two classes,
zi = wT xi + b, (7.34)
where zi 2 R is the projected data point, w 2 Rd is the projection vector and b 2 R is the bias term. We will formulate this as a regression problem. For each data point xi, we define the target value as the reciprocal of the class probability,
(n, yi=1
ti= n1 (7.35)
The best projection is then obtained by minimizing the least-squares error between the projection zi and the target ti,
(7.31)
(7.32) Hint: we are only interested in the direction of the best projection w to maximize J(w); the
length of w does not a↵ect the ratio.
where is the Lagrange multiplier.
{w⇤, b⇤} = argmin E(w, b), w,b
E(w,b)= 2
i=1
i=1
1 Xn
1 Xn
(wTxi +b ti)2.
(7.37)
w⇤ / S 1(μ1 μ2). W
n, yi=2. n2
(7.36)
(zi ti)2 = 2
This is a linear regression problem where the target values ti are based on the class of each point.
52
(a) Show that Pni=1 ti = 0.
(b) Show that the optimal b⇤ can be expressed as a function of w,
T 1Xn 1
b= w μ, μ= n xi = n(n1μ1 +n2μ2).
i=1
(c) Using (b), show that a stationary point for w satistfies
(SW + n1n2 SB)w = n(μ1 μ2), n
where SW and SB are the within- and between-class matrices,
SW =S1 +S2, SB =(μ2 μ1)(μ2 μ1)T.
(7.38)
(7.39)
(7.40)
(7.41)
(d) Finally, show that the direction of the optimal w can be written as, w⇤ / S 1(μ2 μ1)
W
Hint: note that SBw is always in the direction of μ2 μ1.
………
Probabilistic PCA and Factor Analysis
Problem 7.8 Probabilistic PCA: the model
In this problem, we will consider a more probabilistic version of PCA, called probabilistic PCA (PPCA). Let x 2 Rd be the observation and z 2 Rk the lower-dimensional representation (called a latent variable). We assume the observation x is generated from latent variable z using a linear transformation and mean o↵set, with some Gaussian observation noise,
x = z + μ + ✏, (7.42)
where 2 Rd⇥k is the linear transformation, μ 2 Rd the mean vector, and ✏ ⇠ N(0, 2I) the observation noise. (Note that we do not impose an orthonormal constraint on the columns of ). Hence, the conditional distribution of the observation x given z is
p(x|z) = N (x| z + μ, 2I). (7.43) To complete the probabilistic model, we assume a Gaussian prior distribution on the latent variables, p(z) = N (z|0, I). (7.44)
Note that this probabilistic PCA is a slightly di↵erent formulation than standard PCA, in that we are directly modeling the generation of the data from the latent space representation. In contrast, PCA learns the mapping from observations to latent space.
In this problem we will derive the marginal, joint likelihood, and posterior of the PPCA model. In the next problem, we will consider learning the model using maximum likelihood.
53
(a) Show that the marginal distribution of x is
p(x)=Z p(x|z)p(z)dz=N(x|μ, T + 2I).
Hint: use Problem 1.9.
(b) Looking at p(x), what is the major di↵erence between the models of PPCA and PCA?
(c) Show that the joint likelihood of x and z is p(x,z)=N(x |μ , T + 2I T )
z0 I
Hint: define a = [xT , zT ]T and find its mean and covariance.
(d) Show that the posterior likelihood of z given x is
p(z|x) = N (z|μz|x, ⌃z|x),
μz|x = T ( T + 2I) 1(x μ),
⌃z|x = I T ( T + 2I) 1 .
Hint: complete the square.
(e) Use the matrix inverse identities in Problem 1.15 to show that
(7.45)
(7.46)
(7.47) (7.48) (7.49)
μz|x =M 1 T(x μ), ⌃z|x = 2M 1, M = T + 2I.
Hence, calculation of p(z|x) requires inverting only a k ⇥ k matrix, rather than a d ⇥ d matrix.
……… Problem 7.9 Probabilistic PCA: EM learning
Given a set of data points X = {x1, · · · , xn}, the goal is to learn the parameters { , μ, 2} of the probabilistic PCA model (see Problem 7.8) that maximize the log-likelihood of the data,
Xn logp(X)=
i=1
logp(xi)=
Xn i=1
logN(xi|μ, T + 2I). (7.51)
In this case taking the derivative and solving for the parameters will be quite complex (it is pos- sible though). Alternatively, because our model has latent variables Z = {z1, · · · , zn} that are unobserved, we can also find the maximum likelihood parameters using the EM algorithm.
(7.50)
(a) Show that the Q function can be written as
Q(✓, ✓ˆ) = E Xˆ[log p(X, Z|✓)] (7.52)
Z|X,✓
n ⇢d 2 1 ˆ 1 2 1 T T 1 ˆ T
= i=1 2log +2tr(Pi)+2 2kxi μk 2zˆi (xi μ)+2 2tr(Pi ) , (7.53)
54
where the required expectations for the E-step are
zˆi = EZ|X,✓ˆ[zi] = M 1 T (xi μ)
ˆ
and the parameters used to calculate these expectations are from the old model ✓ˆ. (b) Show that the M-step is given by
21nn2TT ˆTo ⇤=ndi=1 kxi μk 2zˆi ⇤(xi μ)+tr(Pi ⇤ ⇤) .
(c) Taking 2 ! 0 yields the standard PCA model. Let Zˆ = [zˆ1, · · · , zˆn] be the matrix of estimated latent variables, and X ̄ = [x1 μ, · · · , xn μ] the matrix of mean-subtracted data points. Show that the resulting EM algorithm for ! 0 is,
PCAE step: Zˆ=( T ) 1 TX ̄ (7.58) PCAM step: =X ̄ZˆT(ZˆZˆT) 1 (7.59)
Intuitively, what are the E- and M-steps doing in this algorithm?
(d) Analyze the complexity of the EM algorithm for PCA versus using eigen-decomposition for PCA. Recall that computing: k eigenvalues of a d ⇥ d matrix is O(kd2); the inverse of a k ⇥ k matrix is O(k3); and the outer product of two vectors (Ra and Rb) is O(ab). When will it be advantageous to use EM over the standard formulation?
………
Problem 7.10 Factor Analysis
Factor analysis (FA) is closely related to probabilistic PCA (Problem 7.8). The main di↵erence is that the observation noise is assumed to have a diagonal covariance matrix, rather than an isotropic form, i.e ✏ ⇠ N (0| ,), where is a diagonal covariance matrix. The conditional distribution of x is then
p(x|z) = N (x| z + μ, ). (7.60)
In essence, FA is also capturing correlations in the data using , while modeling the observation noise of individual features as independent, but with possibly di↵erent variance. In FA literature, the columns of are called factor loadings, and the diagonal elements of are called uniquenesses.
Given a set of samples X = {x1, · · · , xn}, the parameters of the FA model can be learned using the EM algorithm, similar to probabilistic PCA.
T 2 1 T Pi = EZ|X,✓ˆ[zizi ] = M + zˆizˆi ,
(7.54) (7.55)
(7.56)
(7.57)
n “n #”n# 1 1X XTXˆ
μ⇤ = n xi, ⇤ = (xi μ)zˆi Pi , i=1 i=1 i=1
(a) Show that the E-step for FA is to calculate:
G = (I + T 1 ) 1,
zˆi=G T 1(xi μ),
(7.61) (7.62) (7.63)
ˆ
55
T Pi =G+zˆizˆi .
(b) Show that the M-step for FA is:
μ⇤ = n
i=1
i=1
)
1 Xn
# ” Xn Tˆ
(xi μ)zˆi
1 Xn
i=1
where the diag operator sets all non-diagonal elements of a matrix to zero.
………
Canonical Correlation Analysis
Problem 7.11 Canonical correlation analysis (CCA)
xi, ( Xn
⇤ =
(xi μ)(xi μ)T ⇤ n zˆi(xi μ)T ,
(7.64) (7.65)
” Xn i=1
# 1 Pi ,
⇤ = diag
i=1
Canonical correlation analysis (CCA) is a dimensionality-reduction method similar to PCA. While PCA deals with only one data space, CCA is method for jointly reducing the dimension across two spaces. Each dimension of the lower-dimensional representation corresponds to correlated directions in the data spaces (i.e, the two directions describe the same thing).
Formally, let x 2 Rd and y 2 RD be a pair of observation random variables (with di↵erent dimensions), with covariances matrices
⌃xx = E[(x μx)(x μx)T ]
⌃yy = E[(y μy)(y μy)T ]
⌃xy = E[(x μx)(y μy)T ] = ⌃Tyx,
(7.66) (7.67) (7.68)
where ⌃xy is the covariance between the two variables. The covariance matrices can be computed using sample data pairs {xi,yi}ni=1. Let zx and zy be linear projections of x and y,
zx =wxTx, zy =wyTy, (7.69) where wx 2 Rd and wy 2 RD are the projection directions. The goal is to find {wx,wy} that
maximizes the correlation between zx and zy,
{wx⇤,wy⇤}= argmax ⇢,
wx 6=0,wy 6=0
⇢ = p cov(zx,zy)
var(zx)var(zy) In this problem, we will derive the solution using the SVD.
(a) Show that the covariances and variances in (7.71) are:
cov(zx,zy)=wxT⌃xywy, var(zx) = wxT ⌃xxwx, var(zy)=wyT⌃yywy.
.
(7.70)
(7.71)
(7.72) (7.73) (7.74)
(7.75)
Hence, the correlation function is
⇢=pT qT .
56
w xT ⌃ x y w y
wx ⌃xxwx wy ⌃yywy
(b) Perform a change of basis on (7.75),
wˆ = ⌃1/2w , wˆ = ⌃1/2w , (7.76)
xxxx yyyy
where A1/2 is the matrix square-root (A = A1/2A1/2), and derive an equivalent optimization
problem to (7.70),
{u⇤, v⇤} = argmax uT ⌃ 1/2⌃ ⌃ 1/2v,
u,v
w⇤ = ⌃ 1/2u⇤, w⇤ = ⌃ 1/2v⇤. x xx y yy
s.t. uT u = 1, vT v = 1.
(7.77) (7.78)
(7.79)
(c) Consider the SVD decomposition
xx xyyy
⌃ 1/2⌃ ⌃ 1/2 = USV T . xx xyyy
Show that (7.77) is maximized with u⇤ = u1 and v⇤ = v1, where u1 and v1 are the left and right singular vectors corresponding to the largest singular value (i.e., the first columns of U and V ).
………
57