CS5487 Problem Set 7
Linear Dimensionality Reduction
Antoni of Computer Science
Copyright By PowCoder代写 加微信 powcoder
City University of
Dimensionality
Problem 7.1 Shell of a hypersphere
Consider a hypersphere S1 of radius r in Rd, and its shell of thickness �. The shell can be defined
as the region between S1 and another hypersphere S2 with radius r − �. Hence, the volume of the
V (shell) = V (S1)− V (S2) =
V (S1), (7.1)
where the volume of a d-dimensional hypersphere of radius r is
(a) Show that the ratio between the volumes of the two sphere is
and hence, for any 0 < � < r, = 0. (7.4) This shows that for high dimension V (shell) = V (S1). In other words, all the volume of the hypersphere is in its shell! . . . . . . . . . Principal Component Analysis Problem 7.2 PCA as minimizing reconstruction error In this problem, we will prove that PCA finds a representation that minimizes the reconstruction error of the data points. Let X = {x1, · · · , xn} be the data points, with xi ∈ Rd. Let the new representation (the PCA coefficients) be Z = {z1, · · · , zn}, with zi ∈ Rk. Denote Φ = [φ1, · · · , φk] ∈ Rd×k as a matrix of basis vectors φj , and c ∈ Rd as the offset. The basis vectors are constrained to be orthonormal, i.e., φTj φj = 1 and φ j φi = 0 for i 6= j, or succinctly ΦTΦ = I. The reconstruction of a data point from its lower-dimensional representation is x̂i = c+ Φzi = c+ φjzij , (7.5) where zij is the jth element of zi. The total reconstruction error is given by J(Z,Φ, c) = ‖xi − x̂i‖2 . (7.6) Hence, the goal is to find the optimal coefficients Z, basis matrix Φ, and offset c that minimize J , {Z∗,Φ∗, c∗} = argmin J(Z,Φ, c) s.t. ΦTΦ = I. (7.7) (a) Note that only Z depends on X, hence we can first solve for the optimal Z as a function of {c,Φ}. Show that the optimal coefficients Z that minimize J are T (xi − c). (7.8) (b) Substitute (7.8) into J to obtain a new objective function Ĵ(c,Φ), (xi − c)T (I − ΦΦT )(xi − c). (7.9) (c) Show that the offset c that minimizes Ĵ is xi, (7.10) i.e., the sample mean. (d) To find the optimal Φ∗, first show that the optimization of Ĵ can be rewritten as Φ∗ = argmin Ĵ(c∗,Φ) s.t. ΦTΦ = I (7.11) tr(ΦTΣΦ) s.t. ΦTΦ = I (7.12) φTi Σφi s.t. Φ TΦ = I, (7.13) where Σ = 1 i=1(xi − c∗)(xi − c∗)T is the sample covariance matrix. (e) Use Langrange multipliers to show that the stationary point for φj of the constrained optimiza- tion problem in (7.13) is an eigenvector of Σ, i.e. Σφj = λjφj , (7.14) where λj is the corresponding eigenvalue. Finally, using this fact and (7.12), show that Φ∗ = argmax tr(Λ) = argmax λi, (7.15) which shows that the optimal Φ∗ are the eigenvectors corresponding to the k largest eigenvalues . . . . . . . . . Problem 7.3 PCA as maximizing variance In this problem, we will show that PCA maximizes the variance of the projected data. Let X = {x1, · · · , xn} be the data points, with xi ∈ Rd. Define the set of projection directions as Φ = [φ1, · · · , φk] ∈ Rd×k, where the φi are orthonormal, i.e., φTj φj = 1 and φTj φi = 0 for i 6= j, or succinctly ΦTΦ = I. Let x̄ = 1 i xi be the sample mean of X. The projected data is Z = {z1, · · · , zn}, with zi ∈ Rk, where T (xi − x̄). (7.16) The goal is to maximize the variance of the projected data Z by selecting the optimal Φ, Φ∗ = argmax (zi − z̄)(zi − z̄)T s.t. ΦTΦ = I, (7.17) where z̄ = 1 i zi is the sample mean of Z. (a) Show that z̄ = 0. (b) Use (a) to show that (7.17) is equivalent to (7.12). Hence, maximizing the variance of the projected data is equivalent to minimizing the reconstruction error, leading to the PCA formu- . . . . . . . . . Problem 7.4 PCA implementation using SVD In this problem, we will consider implementing PCA using the singular value decomposition (SVD, see Problem 1.17). Let X = [x1, · · · , xn] be the matrix of data points with sample mean and covariance, (xi − µ)(xi − µ)T . (7.18) Let X̄ be the matrix of mean-subtracted points, X̄ = [x1 − µ, · · · , xn − µ], (7.19) and construct the SVD of X̄, X̄ = USV T . (7.20) (a) Show that X̄ can be written as X̄ = X(I − 1 11T ), (7.21) where 1 is the vector of ones. (b) Show that {ui, s } is an eigenvector/eigenvalue pair of Σ, where ui is the ith column of U . (c) Rewrite the PCA training algorithm to use the SVD. . . . . . . . . . Problem 7.5 PCA and classification Consider a classification problem with two classes y ∈ {1, 2} with class probabilities, p(y = 1) = p(y = 2) = 1/2, (7.22) and Gaussian class conditionals, p(x|y = j) = N (x|µj ,Σj). (7.23) (a) The random variable x is not Gaussian, but we can still compute its mean and covariance. µx = E[x] = (µ1 + µ2), (7.24) Σx = E[(x− µx)(x− µx)T ] = (Σ1 + Σ2) + (µ1 − µ2)(µ1 − µ2)T (7.25) (b) Consider the 2-dimensional case, where µ1 = −µ2 = µ = with α > 0, and
Σ1 = Σ2 = Γ =
Use the principal component analysis of x to determine the best 1-dimensional subspace, i.e.
the transformation
z = φTx (7.28)
where φ is the eigenvector corresponding to the largest eigenvalue of Σx. For this problem,
is the PCA approach to dimensionality reduction a good one from the classification point of
view? Explain why.
. . . . . . . . .
Linear Discriminant Analysis
Problem 7.6 Fisher’s linear discriminant
Suppose we have feature vectors from two classes, where the sample mean and scatter matrix is
given by {µ1, S1} for class 1, and {µ2, S2} for class 2. The Fisher’s linear discriminant (FLD) finds
the optimal projection that maximizes the ratio of the “between-class” scatter and the “within-
class” scatter,
w∗ = argmax
J(w), J(w) =
where SB and SW are the between- and within-class scatter matrices,
SB = (µ1 − µ2)(µ1 − µ2)T , SW = S1 + S2. (7.30)
In this problem, we will derive the optimal w∗.
(a) Because (7.29) is maximizing a ratio in J(w), it suffices to fix the denominator to some value
and maximize the numerator. Use Lagrange multipliers to rewrite (7.29) as a constrained
optimization problem where wTSWw = 1.
(b) Show that the stationary point of the constrained optimization problem is given by the gener-
alized eigenvalue problem,
SBw = λSWw, (7.31)
where λ is the Lagrange multiplier.
(c) Finally, assuming that SW is invertible, show that the optimal w
∗ is given by
w∗ ∝ S−1W (µ1 − µ2). (7.32)
Hint: we are only interested in the direction of the best projection w to maximize J(w); the
length of w does not affect the ratio.
. . . . . . . . .
Problem 7.7 Fisher’s linear discriminant as least-squares regression
In this problem, we will show that Fisher’s linear discriminant is equivalent to least-squares regres-
sion for a particular set of target values. Let X = {x1, · · · , xn} be the data points with xi ∈ Rd,
and C1 be the set of points in class 1 (with size n1), and likewise for class 2 (C2, n2). Define the
following sample statistics of the data for j ∈ {1, 2},
(xi − µj)(xi − µj)T . (7.33)
We want to learn a linear projection of the data to separate the two classes,
Txi + b, (7.34)
where zi ∈ R is the projected data point, w ∈ Rd is the projection vector and b ∈ R is the bias
term. We will formulate this as a regression problem. For each data point xi, we define the target
value as the reciprocal of the class probability,
The best projection is then obtained by minimizing the least-squares error between the projection
zi and the target ti,
{w∗, b∗} = argmin
E(w, b), (7.36)
(zi − ti)2 =
(wTxi + b− ti)2. (7.37)
This is a linear regression problem where the target values ti are based on the class of each point.
(a) Show that
i=1 ti = 0.
(b) Show that the optimal b∗ can be expressed as a function of w,
b = −wTµ, µ = 1
(n1µ1 + n2µ2). (7.38)
(c) Using (b), show that a stationary point for w satistfies
SB)w = n(µ1 − µ2), (7.39)
where SW and SB are the within- and between-class matrices,
SW = S1 + S2, SB = (µ2 − µ1)(µ2 − µ1)T . (7.40)
(d) Finally, show that the direction of the optimal w can be written as,
w∗ ∝ S−1W (µ2 − µ1) (7.41)
Hint: note that SBw is always in the direction of µ2 − µ1.
. . . . . . . . .
Probabilistic PCA and Factor Analysis
Problem 7.8 Probabilistic PCA: the model
In this problem, we will consider a more probabilistic version of PCA, called probabilistic PCA
(PPCA). Let x ∈ Rd be the observation and z ∈ Rk the lower-dimensional representation (called
a latent variable). We assume the observation x is generated from latent variable z using a linear
transformation and mean offset, with some Gaussian observation noise,
x = Φz + µ+ �, (7.42)
where Φ ∈ Rd×k is the linear transformation, µ ∈ Rd the mean vector, and � ∼ N (0, σ2I) the
observation noise. (Note that we do not impose an orthonormal constraint on the columns of Φ).
Hence, the conditional distribution of the observation x given z is
p(x|z) = N (x|Φz + µ, σ2I). (7.43)
To complete the probabilistic model, we assume a Gaussian prior distribution on the latent variables,
p(z) = N (z|0, I). (7.44)
Note that this probabilistic PCA is a slightly different formulation than standard PCA, in that we
are directly modeling the generation of the data from the latent space representation. In contrast,
PCA learns the mapping from observations to latent space.
In this problem we will derive the marginal, joint likelihood, and posterior of the PPCA
model. In the next problem, we will consider learning the model using maximum likelihood.
(a) Show that the marginal distribution of x is
p(x|z)p(z)dz = N (x|µ,ΦΦT + σ2I). (7.45)
Hint: use Problem 1.9.
(b) Looking at p(x), what is the major difference between the models of PPCA and PCA?
(c) Show that the joint likelihood of x and z is
p(x, z) = N (
ΦΦT + σ2I ΦT
Hint: define a = [xT , zT ]T and find its mean and covariance.
(d) Show that the posterior likelihood of z given x is
p(z|x) = N (z|µz|x,Σz|x), (7.47)
T (ΦΦT + σ2I)−1(x− µ), (7.48)
Σz|x = I − ΦT (ΦΦT + σ2I)−1Φ. (7.49)
Hint: complete the square.
(e) Use the matrix inverse identities in Problem 1.15 to show that
−1ΦT (x− µ), Σz|x = σ2M−1, M = ΦTΦ + σ2I. (7.50)
Hence, calculation of p(z|x) requires inverting only a k× k matrix, rather than a d× d matrix.
. . . . . . . . .
Problem 7.9 Probabilistic PCA: EM learning
Given a set of data points X = {x1, · · · , xn}, the goal is to learn the parameters {Φ, µ, σ2} of the
probabilistic PCA model (see Problem 7.8) that maximize the log-likelihood of the data,
log p(X) =
log p(xi) =
logN (xi|µ,ΦΦT + σ2I). (7.51)
In this case taking the derivative and solving for the parameters will be quite complex (it is pos-
sible though). Alternatively, because our model has latent variables Z = {z1, · · · , zn} that are
unobserved, we can also find the maximum likelihood parameters using the EM algorithm.
(a) Show that the Q function can be written as
Q(θ, θ̂) = E
Z|X,θ̂[log p(X,Z|θ)] (7.52)
‖xi − µ‖2 −
T (xi − µ) +
where the required expectations for the E-step are
ẑi = EZ|X,θ̂[zi] = M
−1ΦT (xi − µ) (7.54)
P̂i = EZ|X,θ̂[ziz
2M−1 + ẑiẑ
i , (7.55)
and the parameters used to calculate these expectations are from the old model θ̂.
(b) Show that the M-step is given by
(xi − µ)ẑTi
‖xi − µ‖2 − 2ẑTi ΦT∗ (xi − µ) + tr(P̂iΦT∗ Φ∗)
(c) Taking σ2 → 0 yields the standard PCA model. Let Ẑ = [ẑ1, · · · , ẑn] be the matrix of estimated
latent variables, and X̄ = [x1−µ, · · · , xn−µ] the matrix of mean-subtracted data points. Show
that the resulting EM algorithm for σ → 0 is,
PCA E− step : Ẑ = (ΦTΦ)−1ΦT X̄ (7.58)
PCA M− step : Φ = X̄ẐT (ẐẐT )−1 (7.59)
Intuitively, what are the E- and M-steps doing in this algorithm?
(d) Analyze the complexity of the EM algorithm for PCA versus using eigen-decomposition for
PCA. Recall that computing: k eigenvalues of a d× d matrix is O(kd2); the inverse of a k × k
matrix is O(k3); and the outer product of two vectors (Ra and Rb) is O(ab). When will it be
advantageous to use EM over the standard formulation?
. . . . . . . . .
Problem 7.10 Factor Analysis
Factor analysis (FA) is closely related to probabilistic PCA (Problem 7.8). The main difference is
that the observation noise is assumed to have a diagonal covariance matrix, rather than an isotropic
form, i.e � ∼ N (0|Ψ,), where Ψ is a diagonal covariance matrix. The conditional distribution of x
p(x|z) = N (x|Φz + µ,Ψ). (7.60)
In essence, FA is also capturing correlations in the data using Φ, while modeling the observation
noise of individual features as independent, but with possibly different variance. In FA literature,
the columns of Φ are called factor loadings, and the diagonal elements of Ψ are called uniquenesses.
Given a set of samples X = {x1, · · · , xn}, the parameters of the FA model can be learned
using the EM algorithm, similar to probabilistic PCA.
(a) Show that the E-step for FA is to calculate:
G = (I + ΦTΨ−1Φ)−1, (7.61)
TΨ−1(xi − µ), (7.62)
P̂i = G+ ẑiẑ
i . (7.63)
(b) Show that the M-step for FA is:
(xi − µ)ẑTi
(xi − µ)(xi − µ)T − Φ∗
ẑi(xi − µ)T
where the diag operator sets all non-diagonal elements of a matrix to zero.
. . . . . . . . .
Canonical Correlation Analysis
Problem 7.11 Canonical correlation analysis (CCA)
Canonical correlation analysis (CCA) is a dimensionality-reduction method similar to PCA. While
PCA deals with only one data space, CCA is method for jointly reducing the dimension across two
spaces. Each dimension of the lower-dimensional representation corresponds to correlated directions
in the data spaces (i.e, the two directions describe the same thing).
Formally, let x ∈ Rd and y ∈ RD be a pair of observation random variables (with different
dimensions), with covariances matrices
Σxx = E[(x− µx)(x− µx)T ] (7.66)
Σyy = E[(y − µy)(y − µy)T ] (7.67)
Σxy = E[(x− µx)(y − µy)T ] = ΣTyx, (7.68)
where Σxy is the covariance between the two variables. The covariance matrices can be computed
using sample data pairs {xi, yi}ni=1. Let zx and zy be linear projections of x and y,
x x, zy = w
y y, (7.69)
where wx ∈ Rd and wy ∈ RD are the projection directions. The goal is to find {wx, wy} that
maximizes the correlation between zx and zy,
{w∗x, w∗y} = argmax
wx 6=0,wy 6=0
cov(zx, zy)√
var(zx)var(zy)
In this problem, we will derive the solution using the SVD.
(a) Show that the covariances and variances in (7.71) are:
cov(zx, zy) = w
xΣxywy, (7.72)
var(zx) = w
xΣxxwx, (7.73)
var(zy) = w
y Σyywy. (7.74)
Hence, the correlation function is
(b) Perform a change of basis on (7.75),
xx wx, ŵy = Σ
yy wy, (7.76)
where A1/2 is the matrix square-root (A = A1/2A1/2), and derive an equivalent optimization
problem to (7.70),
{u∗, v∗} = argmax
uTΣ−1/2xx ΣxyΣ
yy v, s.t. u
Tu = 1, vT v = 1. (7.77)
∗, w∗y = Σ
(c) Consider the SVD decomposition
Σ−1/2xx ΣxyΣ
T . (7.79)
Show that (7.77) is maximized with u∗ = u1 and v
∗ = v1, where u1 and v1 are the left and
right singular vectors corresponding to the largest singular value (i.e., the first columns of U
. . . . . . . . .
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com