CS计算机代考程序代写 Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 35

Multivariate Gaussians and PCA

Outline

I Multivariate Gaussians
I Eigendecompositions and covariance matrices
I Principal component analysis
I Principal component regression and spectral regularization
I Singular value decomposition
I Examples: topic modeling and matrix completion

1 / 35

Multivariate Gaussians: Isotropic Gaussians

I Start with X = (X1, . . . , Xd) ∼ N(0, I), i.e., X1, . . . , Xd are
iid N(0, 1) random variables.
I Probability density function is product of (univariate) Gaussian

densities
I E(Xi) = 0
I var(Xi) = cov(Xi, Xi) = 1, cov(Xi, Xj) = 0 for i 6= j
I Arrange in mean vector E(X) = 0, covariance matrix

cov(X) = I

2 / 35

0.05

d
e
n
s
it
y

0.1

x
2

x
1

0.15

-2 -2

Figure 1: Density function for isotropic Gaussian in R2

3 / 35

-3 -2 -1 0 1 2 3

x
1

-3

-2

-1

x
2

Figure 2: Density function level sets for isotropic Gaussian in R2

4 / 35

Affine transformations of random vectors

I Start with any random vector Z, then apply linear
transformation, followed by translation
I X := MZ + µ, for M ∈ Rk×d and µ ∈ Rk
I Fact: E(X) = ME(Z) + µ, cov(X) = M cov(Z)M T
I E.g., let u ∈ Rd be a unit vector (‖u‖2 = 1), and X := uTZ

(projection of X along direction u). Then E(X) = uTE(Z), and
var(X) = uT cov(Z)u.

I Note: These transformations work for random vectors with any
distribution, not just Gaussian distributions.
I However, it is convenient to illustrate the effect of these

transformations on Gaussian distributions, since the “shape” of
the Gaussian pdf is easy to understand.

5 / 35

Multivariate Gaussians: General Gaussians

I If Z ∼ N(0, I) and X = MZ + µ, we have E(X) = µ and
cov(X) = MMT
I Assume M ∈ Rd×d is invertible (else we get a degenerate

Gaussian distribution).
I We say X ∼ N(µ,MMT)
I Density function given by

p(x) =
1

(2π)d/2|MM T|1/2
exp

(
−

1
2

(x− µ)T(MM T)−1(x− µ)
)
.

I Note: every non-singular covariance matrix Σ can be written as
MMT for some non-singular matrix M . (We’ll see why soon.)

6 / 35

-2 -1 0 1 2 3 4

x
1

-2

-1

x
2

Figure 3: Density function level sets for anisotropic Gaussian in R2

7 / 35

Inference with multivariate Gaussians (2)

I Bivariate case: (X1, X2) ∼ N(µ,Σ) in R2

µ =
[
µ1
µ2

]
, Σ =

[
Σ1,1 Σ1,2
Σ2,1 Σ2,2

]

I What is the distribution of X2?
I N(µ2, Σ2,2)

I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1

with smallest MSE
I Miracle 3: variance doesn’t depend on x1

8 / 35

Inference with multivariate Gaussians (2)

I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1

with smallest MSE
I Miracle 3: variance doesn’t depend on x1
I OLS with X1 as input variable and X2 as output variable:

x1 7→ m̂x1 + θ̂

where
m̂ =

cov(X1, X2)
var(X1)

=
Σ1,2
Σ1,1

θ̂ = E(X2)− m̂E(X1) = µ2 − m̂µ1.
I Therefore:

E[X2 | X1 = x1] = m̂x1 + θ̂
= µ2 + m̂(x1 − µ1)

= µ2 +
Σ1,2
Σ1,1

(x1 − µ1)

9 / 35

Inference with multivariate Gaussians (3)

I What is the distribution of X2 | X1 = x1?
I Miracle 1: it is a Gaussian distribution
I Miracle 2: mean provided by linear prediction of X2 from X1

with smallest MSE
I Miracle 3: variance doesn’t depend on x1

var(X2 | X1 = x1) = E[var(X2 | X1)]
= var(X2)− var(E[X2 | X1])
= Σ2,2 − var(m̂X1 + θ̂)
= Σ2,2 − m̂2 var(X1)

= Σ2,2 −
Σ21,2
Σ21,1

Σ1,1

= Σ2,2 −
Σ21,2
Σ1,1

10 / 35

Inference with multivariate Gaussians (4)

I Beyond bivariate Gaussians: same as above, but just writing
things properly using matrix notations

E[X2 | X1 = x1] = µ2 +Σ2,1Σ−11,1(x1 − µ1)
cov(X2 | X1 = x1) = Σ2,2 −Σ2,1Σ−11,1Σ1,2

11 / 35

Eigendecomposition (1)

I Every symmetric matrix M ∈ Rd×d has d real eigenvalues,
which we arrange as

λ1 ≥ · · · ≥ λd

I Can choose corresponding orthonormal eigenvectors

v1, . . . , vd ∈ Rd

I This means
Mvi = λivi

for each i = 1, . . . , d, and

vTi vj = 1{i=j}

12 / 35

Eigendecomposition (2)

I Arrange v1, . . . , vd in an orthogonal matrix V := [v1| · · · |vd]
I V TV = I and V V T =

∑d
i=1 viv

T
i = I

I Therefore,
M = MV V T

=
d∑

i=1
Mviv

T
i

=
d∑

i=1
λiviv

T
i

I This is our preferred way to express the eigendecomposition
I Also called spectral decomposition
I Can also write M = V ΛV T, where Λ = diag(λ1, . . . , λd)
I The matrix V diagonalizes M :

V TMV = Λ

13 / 35

Covariance matrix (1)

I A ∈ Rn×d is data matrix
I Σ := ATA = 1

∑n
i=1 xix

T
i is

(empirical) second-moment matrix
I If 1

∑n
i=1 xi = 0 (data are “centered”), this is the

(empirical) covariance matrix
I For purpose of exposition, just say/write “(co)variance” even

though “second-moment” is technically correct
I For any unit vector u ∈ Rd,

uTΣu =
1
n

n∑
i=1

(uTxi)2

is (empirical) variance of data along direction u

14 / 35

Covariance matrix (2)

I Note: some pixels in OCR data have very little (or zero!)
variation
I These are “coordinate directions” (e.g., u = (1, 0, . . . , 0))
I Probably can/should ignore these!

Figure 4: Which pixels are likely to have very little variance?

15 / 35

Top eigenvector

I Σ is symmetric, so can write eigendecomposition

Σ =
d∑
i=1

λiviv
T
i

I In which direction is variance maximized?
I Answer: v1, corresponding to largest eigenvalue λ1

I Called the top eigenvector
I This follows from the following characterization of v1:

vT1Σv1 = max
u∈Rd:‖u‖2=1

uTΣu = λ1.

16 / 35

-2 -1 0 1 2 3 4

x
1

-2

-1

x
2

Figure 5: What is the direction of the top eigenvector for the covariance of
this Gaussian?

17 / 35

Top k eigenvectors

I What about among directions orthogonal to v1?
I Answer: v2, corresponding to second largest eigenvalue λ2

I (Note: all eigenvalues of Σ are non-negative!)
I For any k, Vk := [v1| · · · |vk] satisfies

k∑
i=1

vTiΣvi = tr(V
T
k ΣVk) = max

U∈Rd×k:UTU=I
tr(UTΣU) =

k∑
i=1

λi

(the top k eigenvectors)

18 / 35

Principal component analysis

I k-dimensional principal components analysis (PCA) mapping:

ϕ(x) = (xTv1, . . . , xTvk) = V Tk x ∈ R
k

where Vk = [v1| · · · |vk] ∈ Rd×k
I (Only really makes sense when λk > 0.)
I This is a form of dimensionality reduction when k < d. 19 / 35 dimension of projections k 0 200 400 600 800 fr a c ti o n o f re s id u a l v a ri a n c e 0 0.2 0.4 0.6 0.8 1 coordinate projections PCA projections Figure 6: Fraction of residual variance from projections of varying dimension 20 / 35 Covariance of data upon PCA mapping I Covariance of data upon PCA mapping: 1 n n∑ i=1 ϕ(xi)ϕ(xi)T = 1 n n∑ i=1 V Tk xix T i Vk = V T k ΣVk = Λk where Λk is diagonal matrix with λ1, . . . , λk along diagonal. I In particular, coordinates in ϕ(x)-representation are uncorrelated. -2 -1 0 1 2 3 4 x 1 -2 -1 0 1 2 3 4 x 2 -2 -1 0 1 2 3 4 x 1 -2 -1 0 1 2 3 4 x 2 21 / 35 PCA and linear regression I Use k-dimensional PCA mapping ϕ(x) = V Tk x with ordinary least squares I (Assume rank of A is at least k, so ATA has λk > 0)
I Data matrix is

1
√
n



← ϕ(x1)T →

…
← ϕ(xn)T →


 = 1√n



← xT1Vk →

…
← xTnVk →


 = AVk ∈ Rn×k

I Therefore, OLS solution is

β̂ = (V Tk A
TAVk)−1(AVk)Tb

= Λ−1k V
T
k A

(Note: here β̂ ∈ Rk.)

22 / 35

Principal component regression

I Use β̂ = Λ−1k V
T
k A

Tb to predict on new x ∈ Rd:

ϕ(x)Tβ̂ = (V Tk x)
TΛ−1k V

T
k A

= xT(VkΛ−1k V
T
k )(A

Tb)

I So “effective” weight vector (that acts directly on x rather
than ϕ(x)) is given by

ŵ := (VkΛ−1k V
T
k )(A

Tb).

I This is called principal component regression (PCR) (here, k is
hyperparameter)

I Alternative hyper-parameterization: λ > 0; same as before but
using the largest k such that λk ≥ λ.

23 / 35

Spectral regularization

I PCR and ridge regression are examples of
spectral regularization.

I For a function g : R→ R, write g(M) to mean

g(M) =
d∑
i=1

g(λi)vivTi

where M has eigendecomposition M =
∑d
i=1 λiviv

T
i .

I I.e., g is applied to eigenvalues of M
I Generalizes effect of polynomials: e.g., g(z) = z2

M2 = (V ΛV T)(V ΛV T) = V Λ2V T.

I Claim: Can write each of PCR and ridge regression as

ŵ = g(ATA)ATb

for appropriate function g (depending on λ).
24 / 35

Comparing ridge regression and PCR

I ŵ = g(ATA)ATb
I Ridge regression (with parameter λ): g(z) = 1

z+λ
I PCR (with parameter λ): g(z) = 1{z≥λ} · 1z
I Interpretation:

I PCR uses directions with sufficient variability; ignores the rest
I Ridge artificially inflates the variance in all directions

0 0.2 0.4 0.6 0.8 1

g
(z

)

ridge

pcr

Figure 7: Comparison of ridge regression and PCR

25 / 35

Matrix factorization

I Let A =



← xT1 →

…
← xTn →


 ∈ Rn×d (forget the 1/√n scaling)

I Try to approximate A with BC, where B ∈ Rn×k and
C ∈ Rk×d, to minimize ‖A−BC‖2F .
I Here, ‖ · ‖F is a matrix norm called Frobenius norm, which

treats the n× d matrix as a vector in nd-dimensional Euclidean
space

I Think of B as the encodings of the data in A
I “Dimension reduction” when k < d I Theorem (Schmidt, 1907; Eckart-Young, 1936): Optimal solution is given by truncating the singular value decomposition (SVD) of A 26 / 35 Singular value decomposition I Every matrix A ∈ Rn×d—say, with rank r—can be written as A = r∑ i=1 σiuiv T i where I σ1 ≥ · · · ≥ σr > 0 (singular values)
I u1, . . . , ur ∈ Rn (orthonormal left singular vectors)
I v1, . . . , vr ∈ Rd (orthonormal right singular vectors)

I Can also write as
A = USV T

where
I U = [u1| · · · |ur] ∈ Rn×r, satisfies U TU = I
I S = diag(σ1, . . . , σr) ∈ Rr×r
I V = [v1| · · · |vr] ∈ Rd×r, satisfies V TV = I

27 / 35

Truncated SVD

I Let A have SVD A =
∑r
i=1 σiuiv

T
i (rank of A is r)

I Truncate at rank k (for any k ≤ r): rank-k SVD

Ak :=
k∑
i=1

σiuiv
T
i

I Can write as Ak := UkSkV Tk , where
I Uk = [u1| · · · |uk] ∈ Rn×k, satisfies U TU = I
I Sk = diag(σ1, . . . , σk) ∈ Rk×k
I Vk = [v1| · · · |vk] ∈ Rd×r, satisfies V TV = I

I Theorem (Schmidt/Eckart-Young):

‖A−Ak‖2F = min
M :rank(M)=k

‖A−M‖2F =
r∑

i=k+1
σ2i

28 / 35

Encoder/decoder interpretation (1)

I Encoder: x 7→ ϕ(x) = V Tk x ∈ R
k

I Encoding rows of A: AVk = UkSk
I Decoder: z 7→ Vkz ∈ Rd

I Decoding rows of UkSk: UkSkV Tk = Ak
I Same as k-dimensional PCA mapping!

I ATA = V S2V T, so eigenvectors of ATA are right singular
vectors of A, non-zero eigenvalues are squares of the singular
values

I PCA/SVD finds k-dimensional subspace of smallest sum of
squared distances to data points.

Figure 8: Geometric picture of PCA
29 / 35

Encoder/decoder interpretation (2)

I Example: OCR data, compare original image to decoding of
k-dimensional PCA encoding (k ∈ {1, 10, 50, 200})

Original M = 1 M = 10 M = 50 M = 250

Figure 9: PCA compression of MNIST digit

30 / 35

Application: Topic modeling (1)

I Start with n documents, represent using “bag-of-words” count
vectors

I Arrange in matrix A ∈ Rn×d, where d is vocabulary size

aardvark abacus abalone · · ·
doc 1 3 0 0 · · ·
doc 2 7 0 4 · · ·
doc 3 2 4 0 · · ·

…
…

31 / 35

Application: Topic modeling (2)

I Rank k SVD provides an approximate factorization

A ≈ BC

where B ∈ Rn×k and C ∈ Rk×d
I This use of SVD is called Latent Semantic Analysis (LSA)
I Interpret rows of C as “topics”
I Bi,t is “weight” of document i on topic t
I If rows of C were probability distributions, could interpret as
Ct,w as probability that word w appears in topic t

32 / 35

Application: Matrix completion (1)

I Start with ratings of movies given by users
I Arrange in a matrix A ∈ Rn×d, where Ai,j is rating given by

user i for movie j.
I Netflix: n = 480000, d = 18000; on average, each user rates

200 movies
I Most entries of A are unknown

I Idea: Approximate A with low-rank matrix, i.e., find

B =



← bT1 →

…
← bTn →


 ∈ Rn×k, C =


 ↑ ↑c1 · · · cd
↓ ↓


 ∈ Rk×d

with goal of minimizing ‖A−BC‖2F
I Note: If all entries of A were observed, we could do this with

truncated SVD.

33 / 35

Application: Matrix completion (2)

I Need to find a low-rank approximation without all of A:
(low-rank) matrix completion
I Lots of ways to do this
I Popular way (used in Netflix competition): based on “stochastic

gradient descent” (discussed later)
I Another way: fill in missing entries with plug-in estimates (based

on a statistical model), then compute truncated SVD as usual

34 / 35

Feature representations from matrix completion

I MovieLens data set (n = 6040 users, d = 3952 movies,
|Ω| = 800000 ratings)

I Fit B and C by using a standard matrix completion method
(based on SGD, discussed later)

I Are c1, . . . , cd ∈ Rk useful feature vectors for movies?

I Some nearest-neighbor pairs (cj ,NN(cj)):
I Toy Story (1995), Toy Story 2 (1999)
I Sense and Sensibility (1995), Emma (1996)
I Heat (1995), Carlito’s Way (1993)
I The Crow (1994), Blade (1998)
I Forrest Gump (1994), Dances with Wolves (1990)
I Mrs. Doubtfire (1993), The Bodyguard (1992)

35 / 35

MovieLens

Multivariate Gaussians and PCA

Related Posts