COMP3223 Foundations of ML: Dimensionality reduction by
Principal Components Analysis (PCA)
Srinandan Dasmahapatra
…………… . ….
…………….. …
1/16
Are more complex models better?
With large number of features p, accidental correlations can obscure genuine patterns.
…………… . ….
…………….. …
2/16
Are more complex models better?
With large number of features p, accidental correlations can obscure genuine patterns.
In minimising (say) SSR you may be trying to reduce residuals to “noise” at the expense of “signal”. Eg., linear model yn = f(xn; w) := ∑pi=1 wixni
n=1
∑N ∑ (rn)2 =
rn =
(yn − f(xn; w))2
yn −(w0 +w1x1 +w2x2 +···+wpxp)
…………… . ….
…………….. …
3/16
Are more complex models better?
With large number of features p, accidental correlations can obscure genuine patterns.
In minimising (say) SSR you may be trying to reduce residuals to “noise” at the expense of “signal”. Eg., linear model yn = f(xn; w) := ∑pi=1 wixni
∑N ∑
(rn)2 = (yn − f(xn; w))2
n=1
rn = yn −(w0 +w1x1 +w2x2 +···+wpxp)
In trying to minimise the residuals, you may be adjusting coefficients wk that correspond to features irrelevant to the actual signal.
…………… . ….
…………….. …
4/16
Are more complex models better?
With large number of features p, accidental correlations can obscure genuine patterns.
In minimising (say) SSR you may be trying to reduce residuals to “noise” at the expense of “signal”. Eg., linear model yn = f(xn; w) := ∑pi=1 wixni
∑N ∑
(rn)2 = (yn − f(xn; w))2
n=1
rn = yn −(w0 +w1x1 +w2x2 +···+wpxp)
In trying to minimise the residuals, you may be adjusting coefficients wk that correspond to features irrelevant to the actual signal.
This is the p > N problem
…………… . ….
…………….. …
5/16
Revision: linear regression
Linear model: y = Xw, X design matrix, w = (XT X)−1 XT y; also w0 is the difference between averages of inputs and outputs:
∑N ∑ ∑( ) 0 = rn = yn − Nw0 − w1xn1 + · · · + wpxnp
n=1 n n
∑p i=1
=⇒w0 = ⟨y⟩−
wi⟨xi⟩
…………… . ….
…………….. …
6/16
Revision: linear regression
Linear model: y = Xw, X design matrix, w = (XT X)−1 XT y; also w0 is the difference between averages of inputs and outputs:
∑N ∑ ∑( ) 0 = rn = yn − Nw0 − w1xn1 + · · · + wpxnp
n=1 n n
∑p i=1
=⇒w0 = ⟨y⟩−
Centering: yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones
wi⟨xi⟩
from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩.
…………… . ….
…………….. …
7/16
Revision: linear regression
Linear model: y = Xw, X design matrix, w = (XT X)−1 XT y; also w0 is the difference between averages of inputs and outputs:
∑N ∑ ∑( ) 0 = rn = yn − Nw0 − w1xn1 + · · · + wpxnp
n=1 n n
∑p i=1
=⇒w0 = ⟨y⟩−
Centering: yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones
wi⟨xi⟩
from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩. 1X ̃T(y−X ̃w)=0impliesw=(1 (X ̃TX ̃))−1(1X ̃Ty),so
w = [cov(X)−1][cov(X, y)].
…………… . ….
…………….. …
NNN
8/16
Covariance as trace suggests reduction of matrix size
yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩.
…………… . ….
…………….. …
9/16
Covariance as trace suggests reduction of matrix size
yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩.
(n × n) and (p × p) matrices formed out of X ̃. ()∑p ∑p
X ̃X ̃T
()∑N ∑N
= (X ̃)mj(X ̃T)jn = (xmj −⟨xj⟩)(xnj −⟨xj⟩) j=1 j=1
mn
X ̃TX ̃ ij = (X ̃T)in(X ̃)nj = n=1
(xni −⟨xi⟩)(xnj −⟨xj⟩) n=1
…………… . ….
…………….. …
10/16
Covariance as trace suggests reduction of matrix size
yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩.
(n × n) and (p × p) matrices formed out of X ̃. ()∑p ∑p
X ̃X ̃T
()∑N ∑N
mn
X ̃TX ̃ ij =
(X ̃T)in(X ̃)nj =
(xni −⟨xi⟩)(xnj −⟨xj⟩)
= (X ̃)mj(X ̃T)jn = (xmj −⟨xj⟩)(xnj −⟨xj⟩) j=1 j=1
n=1
Trace of (X ̃ X ̃ T ) equals trace(X ̃ T X ̃ ):
n=1
∑N ∑p n=1 j=1
(X ̃)mj(X ̃T)jnδmn =
∑p ∑N j=1 n=1
(X ̃T)in(X ̃)njδij
…………… . ….
…………….. …
11/16
Covariance as trace suggests reduction of matrix size
yn = yn − ⟨y⟩, xni = xni − ⟨xi⟩ and drop the column of ones from design matrix to form X ̃ with matrix elements (X ̃ )nj = xnj − ⟨xj⟩.
(n × n) and (p × p) matrices formed out of X ̃. ()∑p ∑p
X ̃X ̃T
()∑N ∑N
= (X ̃)mj(X ̃T)jn = (xmj −⟨xj⟩)(xnj −⟨xj⟩) j=1 j=1
mn
(X ̃T)in(X ̃)nj = Trace of (X ̃ X ̃ T ) equals trace(X ̃ T X ̃ ):
X ̃TX ̃ ij =
(xni −⟨xi⟩)(xnj −⟨xj⟩)
n=1
∑N ∑p
(X ̃)mj(X ̃T)jnδmn =
n=1 j=1
Work with p-dim matrix not N-dim.
n=1
∑p ∑N (X ̃T)in(X ̃)njδij
j=1 n=1
…………… . ….
…………….. …
12/16
Use of SVD in low rank approximation gives PCA
TotalvarianceofdataXisthesumofeigenvaluesof 1tr(X ̃TX ̃) N
…………… . ….
…………….. …
13/16
Use of SVD in low rank approximation gives PCA
TotalvarianceofdataXisthesumofeigenvaluesof 1tr(X ̃TX ̃) N
IfX ̃ =USVT,y=USVTwandw=V(S)−1UTy.
S = diag(σ1 , . . . , σmin(p,N) ), U , V contain singular vectors of size N and p respectively.
min(p,N)
∑ uTky
w= σvk k=1 k
…………… . ….
…………….. …
14/16
Use of SVD in low rank approximation gives PCA
TotalvarianceofdataXisthesumofeigenvaluesof 1tr(X ̃TX ̃) N
IfX ̃ =USVT,y=USVTwandw=V(S)−1UTy.
S = diag(σ1 , . . . , σmin(p,N) ), U , V contain singular vectors of size N and p respectively.
Total variance is Ntr(X X) = N
σk
min(p,N)
∑ uTky
w= σvk k=1 k
1 ̃T ̃1∑2
k=1
min(p,N)
…………… . ….
…………….. …
15/16
Use of SVD in low rank approximation gives PCA
TotalvarianceofdataXisthesumofeigenvaluesof 1tr(X ̃TX ̃) N
IfX ̃ =USVT,y=USVTwandw=V(S)−1UTy.
S = diag(σ1 , . . . , σmin(p,N) ), U , V contain singular vectors of size N and p respectively.
min(p,N)
∑ uTky
w= σvk k=1 k
min(p,N) 1 ̃T ̃1∑2
Total variance is Ntr(X X) = N
σk
k=1
Discard small values of σk, keep largest r singular vectors. Fraction of
variation in data explained by r components is
r min(p,N) ∑2∑2
( σk)/( k=1
σk) k=1
…………… . ….
…………….. …
16/16
Use of SVD in low rank approximation gives PCA
TotalvarianceofdataXisthesumofeigenvaluesof 1tr(X ̃TX ̃) N
IfX ̃ =USVT,y=USVTwandw=V(S)−1UTy.
S = diag(σ1 , . . . , σmin(p,N) ), U , V contain singular vectors of size N and p respectively.
min(p,N)
∑ uTky
w= σvk k=1 k
min(p,N) 1 ̃T ̃1∑2
Total variance is Ntr(X X) = N
σk
k=1
Discard small values of σk, keep largest r singular vectors. Fraction of
variation in data explained by r components is r min(p,N)
∑2∑2 ( σk)/( σk)
k=1 k=1
v1, . . . , vr are the principal components (of variation). …………….. …
…………… . ….
17/16
SVD is a low rank approximation to any matrix
Single image as matrix M
Reconstruction for r = 1, . . . , 10
…………… . ….
…………….. …
18/16
SVD is a low rank approximation to any matrix
Single image as matrix M
M1200×1200
Reconstruction for r = 1, . . . , 10
…………… . ….
…………….. …
19/16
SVD is a low rank approximation to any matrix
Single image as matrix M
M1200×1200
SVD M = UΣV T
Reconstruction for r = 1, . . . , 10
…………… . ….
…………….. …
20/16
SVD is a low rank approximation to any matrix
Single image as matrix M
M1200×1200
SVD M = UΣV T
Reconstruction for r = 1, . . . , 10
Reconstruction:
M = ∑ r i σ i u i v i T
…………… . ….
…………….. …
21/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X.
…………… . ….
…………….. …
22/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw.
…………… . ….
…………….. …
23/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw. var(Xw) = ⟨wT XT Xw⟩ = wT Sw where S = var(X).
…………… . ….
…………….. …
24/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw. var(Xw) = ⟨wT XT Xw⟩ = wT Sw where S = var(X).
Seek components of linear combination w = that maximise the variance
wT Sw of the combined data points zn, n = 1, . . . , N. Normalise w. Seek
argmax{wTSw−λ(wTw−1)} =⇒ Sw=λw. w
…………… . ….
…………….. …
25/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw. var(Xw) = ⟨wT XT Xw⟩ = wT Sw where S = var(X).
Seek components of linear combination w = that maximise the variance
wT Sw of the combined data points zn, n = 1, . . . , N. Normalise w. Seek argmax{wTSw−λ(wTw−1)} =⇒ Sw=λw.
w
Order eigenvalues (λ1, . . . , λp). w1 first principal component.
…………… . ….
…………….. …
26/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw. var(Xw) = ⟨wT XT Xw⟩ = wT Sw where S = var(X).
Seek components of linear combination w = that maximise the variance
wT Sw of the combined data points zn, n = 1, . . . , N. Normalise w. Seek argmax{wTSw−λ(wTw−1)} =⇒ Sw=λw.
w
Order eigenvalues (λ1, . . . , λp). w1 first principal component.
Linear combinations zn1 = w1,1 xn1 + w1,2 xn2 + · · · + w1,p xnp constitute representation of data in terms of first principal component. Instead of p components (xn1 , . . . , xnp ), one component zn1 represents xn.
…………… . ….
…………….. …
27/16
First principal component captures greatest variation
Data (mean subtracted) xn ∈ Rp i = n, . . . , N arranged in data matrix X. Linear combinations of data vectors zn = ∑pj=1 wjxnj written as Xw. var(Xw) = ⟨wT XT Xw⟩ = wT Sw where S = var(X).
Seek components of linear combination w = that maximise the variance
wT Sw of the combined data points zn, n = 1, . . . , N. Normalise w. Seek argmax{wTSw−λ(wTw−1)} =⇒ Sw=λw.
w
Order eigenvalues (λ1, . . . , λp). w1 first principal component.
Linear combinations zn1 = w1,1 xn1 + w1,2 xn2 + · · · + w1,p xnp constitute representation of data in terms of first principal component. Instead of p components (xn1 , . . . , xnp ), one component zn1 represents xn.
Variance of {zn1 } is λ1.
…………… . ….
…………….. …
28/16
PCA: principal components from variance decomposition
Other eigenvectors wk (from Swk = λkwk) represent combinations that create znk = wk,1 xn1 + wk,2 xn2 + · · · + wk,p xnp decorrelated from z1 . cov(zni ,znj )=0fori̸=j.
…………… . ….
…………….. …
29/16
PCA: principal components from variance decomposition
Other eigenvectors wk (from Swk = λkwk) represent combinations that create znk = wk,1 xn1 + wk,2 xn2 + · · · + wk,p xnp decorrelated from z1 . cov(zni ,znj )=0fori̸=j.
Largest r eigenvalues give new “co-ordinates” (zn1 , . . . , znr ) of the data and account for fraction of variation
∑r (
k=1
λk)/(
∑p k=1
λk)
…………… . ….
…………….. …
30/16
PCA: principal components from variance decomposition
Other eigenvectors wk (from Swk = λkwk) represent combinations that create znk = wk,1 xn1 + wk,2 xn2 + · · · + wk,p xnp decorrelated from z1 . cov(zni ,znj )=0fori̸=j.
Largest r eigenvalues give new “co-ordinates” (zn1 , . . . , znr ) of the data and account for fraction of variation
∑r (
k=1
λk)/(
w1, . . . , wr are the principal components (of variation).
…………… . ….
…………….. …
∑p k=1
λk)
31/16
Scale independence
To make PCA independent of the scale of the data xn, they might have to be first standardised such that all data entries xnk have equal range, eg by transforming them to unit variance.
…………… . ….
…………….. …
32/16
Eigenfaces as features
We will express arbitrary data vectors as linear combinations vn = ∑ri=1 αni wi. of a set of eigenvectors {wi}.
These are obtained from a certain data dependent matrix. Often most of the wi are small (except for a small set of ’important’ eigenvectors). In such a case we can represent a high dimensional data vector vn effectively by a much smaller (and more robust) set of new ’coordinates’ αnj .
For a face recognition problem the vector vn and the dominant eigenfaces may look like this:
…………… . ….
…………….. …
33/16
Mean and eigenfaces
Mean face eigenface 1 eigenface 2 eigenface 30 Higher eigenfaces show only some random structure.
…………… . ….
…………….. …
34/16
PCA as low-rank regression: input as target output
In regression y = Xw.
…………… . ….
…………….. …
35/16
PCA as low-rank regression: input as target output
In regression y = Xw.
Take output y is matrix of xn, n = 1,…,N. Instead of vector w we have
matrix W (one column for each data point):
W = ( X T X ) − 1 X T X = I
and the “predicted” output is XW = X.
…………… . ….
…………….. …
36/16
PCA as low-rank regression: input as target output
In regression y = Xw.
Take output y is matrix of xn, n = 1,…,N. Instead of vector w we have
matrix W (one column for each data point):
W = ( X T X ) − 1 X T X = I
and the “predicted” output is XW = X.
Choose a set of q orthonormal vectors {vi} whose linear combinations zn = ∑qi=1 αni vi minimise ∥xn − zn∥2.
…………… . ….
…………….. …
37/16
PCA as low-rank regression: input as target output
In regression y = Xw.
Take output y is matrix of xn, n = 1,…,N. Instead of vector w we have
matrix W (one column for each data point):
W = ( X T X ) − 1 X T X = I
and the “predicted” output is XW = X.
Choose a set of q orthonormal vectors {vi} whose linear combinations
zn = ∑qi=1 αni vi minimise ∥xn − zn∥2.
If q < p, dimensional reduction. Instead of basis vectors
e = (0,··· ,0, 1 ,0,...,0)T, k = 1,...,p, use v , i = 1,...,q. The k i
k co-ordinates are αni
............... . ....
................. ...
38/16
Exercise
Minimising with respect to v1 and v2 (1)
v1 v2
,
(2)
) ( α ( n ) ) 2
() ()
( x ( n ) ) ( v
1 − 1;1 2;1 1
v
x(n) v1;2 v2;2 α(n)
22 (in vj;i, j the label for the vector, i the row index),
(α(n)) (v v )T (x(n)) 1 = 1;1 2;1 1 .
α(n) v1;2 v2;2 x(n) 22
............... . ....
................. ...
39/16
Exercise
Minimising with respect to v1 and v2 (1)
v1 v2
,
(2)
) ( α ( n ) ) 2
() ()
( x ( n ) ) ( v
1 − 1;1 2;1 1
v
x(n) v1;2 v2;2 α(n)
22 (in vj;i, j the label for the vector, i the row index),
(α(n)) (v v )T (x(n)) 1 = 1;1 2;1 1 .
α(n) v1;2 v2;2 x(n) 22
Extend to q-dimensional representation of xn ∈ Rp,
α(n) v v ··· v Tx(n)
1 1;12;1 q;1 1
. . . = . . . . . . . . . . . . . . . .
α(n) v1;p v2;p · · · vq;p x(n) q p
columns of matrix Vq are orthogonal vectors
............... . ....
................. ...
40/16
PCA: Low rank projection
q-dimensional representation of xn ∈ Rp, with orthonormal vectors as
columns of Vq:
α(n) x(n) 1. T1.
. =Vq . .
α(n) x(n) qp
............... . ....
................. ...
41/16
PCA: Low rank projection
q-dimensional representation of xn ∈ Rp, with orthonormal vectors as
columns of Vq:
α(n) x(n) 1. T1.
. =Vq . . α(n) x(n)
qp
Since (VqVqT )(VqVqT ) = (VqVqT ), Pq ≜ VqVqT is a projection onto q-dim subspace spanned by columns of Vq.
............... . ....
................. ...
42/16
PCA: Low rank projection
q-dimensional representation of xn ∈ Rp, with orthonormal vectors as
columns of Vq:
α(n) x(n) 1. T1.
. =Vq . . α(n) x(n)
qp
Since (VqVqT )(VqVqT ) = (VqVqT ), Pq ≜ VqVqT is a projection onto
q-dim subspace spanned by columns of Vq.
Y = (X)q = VqTX is the matrix of N q-dim vectors zn ∈ Rq.
............... . ....
................. ...
43/16
PCA: Low rank projection
q-dimensional representation of xn ∈ Rp, with orthonormal vectors as
columns of Vq:
α(n) x(n) 1. T1.
. =Vq . . α(n) x(n)
qp
Since (VqVqT )(VqVqT ) = (VqVqT ), Pq ≜ VqVqT is a projection onto
q-dim subspace spanned by columns of Vq.
Y = (X)q = VqTX is the matrix of N q-dim vectors zn ∈ Rq.
Among all rank q matrices A, quantity ∥X − A∥2 is minimised is A = VqVqT X, q leading vectors (in order of singular values σi in Σ = diag(σ1,...,σminp,N in V in X = UΣV T
............... . ....
................. ...
44/16
PCA: Low rank projection
q-dimensional representation of xn ∈ Rp, with orthonormal vectors as
columns of Vq:
α(n) x(n) 1. T1.
. =Vq . . α(n) x(n)
qp
Since (VqVqT )(VqVqT ) = (VqVqT ), Pq ≜ VqVqT is a projection onto
q-dim subspace spanned by columns of Vq.
Y = (X)q = VqTX is the matrix of N q-dim vectors zn ∈ Rq.
Among all rank q matrices A, quantity ∥X − A∥2 is minimised is A = VqVqT X, q leading vectors (in order of singular values σi in Σ = diag(σ1,...,σminp,N in V in X = UΣV T
∥X − VqVqT X∥2 is the minimum residual.
............... . ....
................. ...
45/16
Compare PCA pre-processed regression with regularisation
Regularisation using the 2-norm of w involves minimising the loss function l(w):
wridge =⇒ wridge
= argminw ∥y − Xw∥2 + λ∥w∥2
= (XTX+λI)−1XTy
= V(S2+λI)−1SUTy min(p,N)
=
u k y σ 2k + λ v k
∑ ( T ) σk
k=1
............... . ....
................. ...
46/16
Compare PCA pre-processed regression with regularisation
Regularisation using the 2-norm of w involves minimising the loss function l(w):
wridge =⇒ wridge
= argminw ∥y − Xw∥2 + λ∥w∥2
= (XTX+λI)−1XTy
= V(S2+λI)−1SUTy min(p,N)
=
u k y σ 2k + λ v k
∑ ( T ) σk
k=1
PCA drops all terms with small variance contributions, ridge regression performs a “soft” re-weighting.
............... . ....
................. ...
47/16
Compare PCA pre-processed regression with regularisation
Regularisation using the 2-norm of w involves minimising the loss function l(w):
wridge =⇒ wridge
= argminw ∥y − Xw∥2 + λ∥w∥2
= (XTX+λI)−1XTy
= V(S2+λI)−1SUTy min(p,N)
∑ ( T ) σk
u k y σ 2k + λ v k
PCA drops all terms with small variance contributions, ridge regression
=
k=1 performs a “soft” re-weighting.
Lasso: regularisation method that drops components
wlasso =argminw∥y−Xw∥2s.t.λ∑ |wi|