Problem 1.1
CS5487 Problem Set 1
Probability Theory and Linear Algebra Review
Antoni Chan Department of Computer Science City University of Hong Kong
Probability Theory
Linear Transformation of a Random Variable
Let x be a random variable on R, and a,b ∈ R. Let y = ax + b be the linear transformation of x. Show the following properties:
E[y] = aE[x] + b, (1.1) var(y) = a2var(x). (1.2)
Now,letxbeavectorr.v. onRd,andA∈Rm×d,b∈Rm. Lety=Ax+bbethelinear transformation of x. Show the following properties:
E[y] = AE[x] + b, (1.3) cov(y) = Acov(x)AT . (1.4)
………
Problem 1.2 Properties of Independence
Let x and y be statistically independent random variables (x ⊥ y). Show the following properties: E[xy] = E[x]E[y], (1.5)
cov(x, y) = 0. (1.6) ………
Problem 1.3 Uncorrelated vs Independence
Two random variables x and y are said to be uncorrelated if their covariance is 0, i.e., cov(x, y) = 0. For statistically independent random variables, their covariance is always 0 (see Problem 1.2), and hence independent random variables are always uncorrelated. However, the converse is generally not true; uncorrelated random variables are not necessarily independent.
Consider the following example. Let the pair of random variables (x,y) take values (1,0), (0, 1), (−1, 0), and (0, −1), each with equal probability (1/4).
(a) Show that cov(x, y) = 0, and hence x and y are uncorrelated.
(b) Calculate the marginal distributions, p(x) and p(y). Show that the p(x, y) ̸= p(x)p(y) and thus x and y are not independent.
1
(c) Now consider a more general example. Assume that x and y satisfy
E[x|y] = E[x], (1.7)
i.e., the mean of x is the same regardless of whether y is known or not (the above example satisfies this property). Show that x and y are uncorrelated.
……… Problem 1.4 Sum of Random Variables
Let x and y be random variables (possibly dependent), show the following property: E[x + y] = E[x] + E[y].
Furthermore, if x and y are statistically independent (x ⊥ y), show that var(x + y) = var(x) + var(y).
However, in general this is not the case when x and y are dependent. ………
Problem 1.5 Expectation of an Indicator Variable
Let x be an indicator variable on {0, 1}. Show that E[x] = p(x = 1),
(1.8) (1.9)
(1.10) (1.11)
Problem 1.6
var(x) = p(x = 0)p(x = 1). ………
Multivariate Gaussian
The multivariate Gaussian is a probability density over real vectors, x = . ∈ R , which is
xd
parameterized by a mean vector μ ∈ Rd and a covariance matrix Σ ∈ Sd+ (i.e., a d-dimensional
(1.12)
(1.13)
is the Mahalanobis distance. In this problem, we will look at how different covariance matrices affect the shape of the density.
First, consider the case where Σ is a diagonal matrix, i.e., the off-diagonal entries are 0,
positive-definite symmetric matrix). The density function is
p(x) = N(x|μ,Σ) = 1 e−1∥x−μ∥2 ,
where |Σ| is the determinant of Σ, and
σ12 Σ = 0
0
.
σd2 2
..
. (1.14)
2Σ
x1
. d
(2π)d/2 |Σ|1/2 ∥x−μ∥2Σ = (x−μ)TΣ−1(x−μ)
(a) Show that with a diagonal covariance matrix, the multivariate Gaussian is equivalent to as- suming that the elements of the vector are independent, and each is distributed as a univariate Gaussian, i.e.,
d
N (x|μ, Σ) = N (xi|μi, σi2). i=1
Hint: the following properties of diagonal matrices will be useful:
1 0 d σ12
|Σ|=σi2, Σ−1 =0 … . i=1 1
(1.15)
(1.16)
σd2
(b) Plot the Mahalanobis distance term and probability density function for a 2-dimensional Gaus-
0 10
sian with μ = 0 , and Σ = 0 0.25 . How is the shape of the density affected by the diagonal
terms?
(c) Plot the Mahalanobis distance term and pdf when the variances of each dimension are the
0 1 0
same, e.g., μ = 0 , and Σ = 0 1 . This is sometimes called an i.i.d. (independently and
identically distributed) covariance matrix, isotropic covariance matrix, or circular covariance matrix.
Next, we will consider the general case for the covariance matrix. (d) Let {λi,vi} be the eigenvalue/eigenvector pairs of Σ, i.e.,
Show that Σ can be written as
Σvi = λivi, i ∈ {1,··· ,d}. Σ = V ΛV T ,
(1.17)
(1.18)
where V = [v1,··· ,vd] is the matrix of eigenvectors, and Λ = diag(λ1,··· ,λd) is a diagonal matrix of the eigenvalues.
(e) Let y = V T (x − μ). Show that the Mahalanobis distance ∥x − μ∥2Σ can be rewritten as ∥y∥2Λ, i.e., a Mahalanobis distance with a diagonal covariance matrix. Hence, in the space of y, the multivariate Gaussian has a diagonal covariance matrix. (Hint: use Problem 1.12)
(f) Consider the transformation from y to x: x = V y + μ. What is the effect of V and μ?
(g) Plot the Mahalanobis distance term and probability density function for a 2-dimensional Gaus-
0 0.625 0.375
sian with μ = 0 , and Σ = 0.375 0.625 . How is the shape of the density affected by the
eigenvectors and eigenvalues of Σ?
………
3
Problem 1.7 Product of Gaussian Distributions
Show that the product of two Gaussian distributions, N(x|μ1,σ12) and N(x|μ2,σ2,), is a scaled Gaussian,
where
N (x|μ1, σ12)N (x|μ2, σ2) = ZN (x|μ3, σ32), μ3 = σ32(σ−2μ1 + σ−2μ2),
(1.19)
(1.20) (1.21)
(1.22)
12
1 , σ−2 + σ−2
σ32 =
Z = 2π(σ12 +σ2)e2(σ1+σ2)
12
1
−1 2
2 (μ1−μ2)2 ………
Problem 1.8
=N(μ1|μ2,σ12 +σ2). Product of Multivariate Gaussian Distributions
Show that the product of two d-dimensional multivariate Gaussians distributions, N(x|a,A) and N (x|b, B), is a scaled multivariate Gaussian,
N (x|a, A)N (x|b, B) = ZN (x|c, C),
c = C(A−1a + B−1b), C = (A−1 + B−1)−1,
………
Problem 1.9 Correlation between Gaussian Distributions
(1.23)
(1.24) (1.25)
(1.26)
where
=N(a|b,A+B).
Hint: after expanding the exponent term, apply the result from Problem 1.10 and (1.35).
Z= d
(2π)2 |A+B|2
1 − 1 (a−b)T (A+B)−1(a−b) 1e 2
Using the result from Problem 1.8, show that the correlation between two multivariate Gaussian distributions is
N (x|a, A)N (x|b, B)dx = N (a|b, A + B). ………
Linear Algebra
(1.27)
4
Problem 1.10 Completing the square
Let x,b ∈ Rn, A ∈ Sn, c ∈ R, and let f(x) be a quadratic function of x, f (x) = xT Ax − 2xT b + c.
Show that f(x) can be rewritten in the form
f (x) = (x − d)T A(x − d) + e,
where
d = A−1b,
e = c − dT Ad = c − bT A−1b.
Rewriting the quadratic function in (1.28) as (1.29) is a procedure known as “completing the square”, which is very useful when dealing with products of Gaussian distributions.
Problem 1.11 Eigenvalues
Let {λi} be the eigenvalues of A ∈ Rn×n. Derive the following properties:
Problem 1.12
………
nn tr(A) = λi, |A| = λi.
(1.32)
i=1 ………
Eigenvalues of an inverse matrix
i=1
Let {λi, xi} be the eigenvalues/eigenvectors of A ∈ Rn×n . Show that { 1 , xi} are the eigenval-
(1.28)
(1.29)
(1.30) (1.31)
ues/eigenvectors of A−1.
Problem 1.13 Positive definiteness
λi
………
Derive the following properties:
1. A symmetric matrix A ∈ Sn is positive definite if all its eigenvalues are greater than zero. 2. For any matrix A ∈ Rm×n, G = AT A is positive semidefinite.
………
5
Problem 1.14 Positive definiteness of inner product and outer product matrices
Let X = [x1,··· ,xn] ∈ Rd×n be a matrix of n column vectors xi ∈ Rd. We can think of each column vector xi as a sample in our dataset X.
(a) outer-product: Prove that Σ = XXT is always positive semi-definite (Σ ≽ 0). When will Σ be strictly positive definite?
(b) inner-product: Prove that G = XT X is always positive semi-definite. When will G be strictly positive definite?
Note: If {x1, · · · , xn} are zero mean samples, then Σ is n times the sample covariance matrix. G is sometimes called a Gram matrix or kernel matrix.
………
Problem 1.15 Useful Matrix Inverse Identities
Show that the following identities are true:
(P −1 + BT R−1B)−1BT R−1 = P BT (BP BT + R)−1,
(A−1 + B−1)−1 = A(A + B)−1B = B(A + B)−1A,
(A−1 +B−1)−1 =A−A(A+B)−1A=B−B(A+B)−1B,
(A−1 + UC−1V T )−1 = A − AU(C + V T AU)−1V T A
The last one is called the Matrix Inversion Lemma (or Sherman-Morrison-Woodbury formula) Hint:
these can be verified by multiplying each side by an appropriate matrix term.
……… Problem 1.16 Useful Matrix Determinant Identities
Verify that the following identities are true:
I + ABT = I + BT A
I + abT = 1 + bT a
A−1 +UVT=I+VTAUA−1
The last one is called the Matrix Determinant Lemma.
……… Problem 1.17 Singular Value Decomposition (SVD)
(1.37) (1.38) (1.39)
The singular value decomposition (SVD) of a real n×m matrix A is a set of three matrices {U, S, V }, such that
where
A = USV T , (1.40)
6
(1.33) (1.34) (1.35) (1.36)
• U ∈ Rn×m is an orthonormal matrix of left-singular vectors (columns of U ), i.e., U T U = I . • S ∈ Rm×m is a diagonal matrix of singular values, i.e. S = diag(s1, · · · , sm). The singular
values are usually ordered s1 ≥ s2 ≥ ··· ≥ sm ≥ 0
• V ∈ Rm×m is an orthonormal matrix of right-singular vectors (columns of V ), i.e., V T V = I.
The SVD has an intuitive interpretation, which shows how the matrix A acts on a vector x ∈ Rm. Consider the matrix-vector product
z = Ax = USV T x. (1.41)
This shows that the matrix A performs 3 operations on x, namely rotation by V T , scaling along the axis via S, and another rotation by U. The SVD is also closely related to the eigen-decomposition, matrix inverse, and pseudoinverses.
(a) Show that the singular values of A are the square roots of the eigenvalues of the matrix B = AAT, and that the left-singular vectors (columns of U) are the associated eigenvectors.
(b) Show that the singular values of A are the square roots of the eigenvalues of the matrix C = AT A, and that the right-singular vectors (columns of V ) are the associated eigenvectors.
(c) Suppose A ∈ Rn×n is a square matrix of rank n. Show that the inverse of A can be calculated from the SVD,
A−1 =VS−1UT. (1.42) (d) Suppose A ∈ Rn×m is a “fat” matrix (n < m) of rank n. The Moore-Penrose pseudoinverse of
A is given by
A† = AT (AAT )−1.
Likewise, for a “tall” matrix (n > m) of rank m, the pseudoinverse is
A† = (AT A)−1AT .
Show that in both cases the pseudoinverse can also be calculated using the SVD,
A† =VS−1UT.
(1.43)
(1.44)
(1.45)
Note: Properties in (c) and (d) also apply for lower rank matrices. In these cases, some of the singular values will be 0, and S−1 is hence replaced with S†, where S† replaces all non-zero diagonal elements by its reciprocal.
………
7