Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 821
Part IX
Principal Component Analysis
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
326of 821
Motivation: Pre-training Deep Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
327of 821
Empirical observations – pre 2006:
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training of each layer; deeper models possible
Usually based on auto-encoders (tomorrow’s lecture) Similar in spirit to PCA (today’s lecture)
Motivation: Exploratory Analysis
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
328of 821
Given a dataset of numerical features:
Low dimensional data may be easy to plot High dimensional data is challenging Dimensionality reduction (e.g. PCA)
Try to explain with fewer dimensions
Enables visualisation
The new basis may yield insights
Aside: can simplify/speed up subsequent analysis e.g. regression
Supervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
329of 821
Givenarepairsofdataxi ∈X andtargetsti ∈T inthe form (xi,ti), where i = 1…N.
Learn a mapping between the data X and the target t which generalises well to new data.
Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
330of 821
Givenonlythedataxi ∈X.
Discover (=learn) some interesting structure inherent in the data X.
Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
331of 821
Givenonlythedataxi ∈X.
Discover (=learn) some interesting structure inherent in the data.
Testing – Supervised versus Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
332of 821
Recall: Fisher’s Linear Discriminant
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
333of 821
Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces.
44
22
00
−2 −2
−2 2 6
−2 2 6
Eigenvectors
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
334of 821
Every square matrix A ∈ Rn×n has an Eigenvector decomposition
Ax = λx
0 1
−1 0 x=λx
λ = {−ı, ı}
ı −ı
x=1,1
wherex∈Rn andλ∈C. Example:
Eigenvectors
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
335of 821
How many eigenvalue/eigenvector pairs?
is equivalent to
Ax = λx (A − λI)x = 0
Has only non-trivial solution for det {A − λI} = 0 polynom of nth order; at most n distinct solutions
Real Eigenvalues
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
336of 821
How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n. Transposition is replaced by Hermitian adjoint, e.g.
1+ı2 3+ı4H 1−ı2 5−ı6 5+ı6 7+ı8 = 3−ı4 7−ı8
Denote the complex conjugate of a complex number λ by λ.
Real Eigenvalues
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
337of 821
How can we enforce real eigenvalues? Let’s assume A ∈ Cn×n, Hermitian (AH = A). Calculate
xHAx = λxHx
for an eigenvector x ∈ Cn of A. Another possibility to calculate xHAx
xHAx = xHAHx = (xHAx)H = (λxHx)H
= λxHx and therefore
(A is Hermitian) (reverse order) (eigenvalue)
λ=λ (λisreal).
If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric, then all eigenvalues are real.
Singular Value Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
338of 821
Every matrix A ∈ Rn×p can be decomposed into a product of three matrices
A = UΣVT
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices ( UTU=IandVTV=I),andΣ∈Rn×p hasnonnegative numbers on the diagonal.
Linear Curve Fitting – Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
339of 821
N = 10
x ≡ (x , . . . , x )T
8 7 6 5 4 3 2 1 0
0 2 4 6 8 10
x
1N
t ≡ (t1,…,tN)T
xi ∈R i=1,…,N ti ∈R i=1,…,N
y(x, w) = w1x + w0 X≡[x 1]
w∗ =(XTX)−1XTt
t
Inverse of a matrix
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
340of 821
Assume a full rank symmetric real matrix A. Then A = UTΛU where
Λ is a diagonal matrix with real eigenvalues U contains the eigenvectors
A−1 = (UT ΛU)−1
= U−1Λ−1U−T inverse changes order = UTΛ−1U UTU = I
The inverse of a diagonal matrix is the inverse of its elements.
Dimensionality Reduction
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
341of 821
Main goal of Principal Component Analysis:
dimensionality reduction
Many applications in visualisation, feature extraction, signal processing, data compression . . .
Example: Use hand-written digits (binary data) and place them into a larger frame (100 × 100) varying the position and the rotation angle.
Data space size = 10 000.
But data live on a three-dimensional manifold (x, y, and the rotation angle).
FYI only: this manifold is not linear and requires bleeding edge models like capsule networks (Hinton 2017); still we can locally approximate with PCA.
Principal Component Analysis (PCA)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
342of 821
Idea: Linearly project the data points onto a lower dimensional subspace such that
the variance of the projected data is maximised, or the distortion error from the projection is minimised.
Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace. x2
u1
xn
xn
x1
Principal Component Analysis (PCA)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
343of 821
Given N observations xn ∈ RD, n = 1,…,N.
Project onto a space with dimensionality M < D while
maximising the variance.
More advanced : How to calculate M from the data. Therefore here: M is fixed.
Consider a 1-dimensional subspace spanned by some unit
vectoru1 ∈RD,uT1u1 =1. x2
xn
xn
u1
x1
PCA - Maximise Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
344of 821
Each data point xn is then projected onto a scalar value u T1 x n .
The mean of the projected data is uT1 ̄x where ̄x is the sample mean
1 N ̄x=N xn.
n=1
The variance of the projected data is then
N
1 uT1xn −uT1 ̄x2 =uT1Su1
N n=1
with the covariance matrix 1 N
S= N
n=1
(xn − ̄x)(xn − ̄x)T.
PCA - Maximise Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation
345of 821
PCA - Maximise Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation
L(u1, λ1) = uT1 Su1 + λ1(1 − uT1 u1) which has a stationary point
346of 821
PCA - Maximise Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
347of 821
Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation
L(u1,λ1)=uT1Su1 +λ1(1−uT1u1)
which has a stationary point if u1 is an eigenvector of S
with eigenvalue λ1.
The variance is then uT1 Su1 = λ1.
Variance is maximised if u1 is the eigenvector of the covariance S with the largest eigenvalue.
Su1 = λ1u1.
PCA - Maximise Variance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
348of 821
Continue maximising the variance amongst all possible directions orthogonal to those already considered.
The optimal linear projection onto a M-dimensional space for which the variance is maximised is defined by the M eigenvectorsu1,...,uM ofthecovariancematrixS corresponding to the M largest eigenvalues λ1 , . . . , λM .
Is this subspace always uniquely defined? Not if λM = λM+1.
PCA - Minimise Distortion Error
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
349of 821
The distortion between data points xn and their projection
xn
1 N
J = N n = 1 ∥ x n − x n ∥ 2
is minimised if the variance is maximised. The distortion error is then
D
J = λi
i=M+1
where λi, i = M + 1, . . . , D are the smallest eigenvalues of
the covariance matrix S.
In signal processing we speak of the signal space (principal subspace) and the noise space (orthogonal to the principal subspace).
PCA - Applications
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
350of 821
The eigenvectors of the covariance matrix are elements of the original vector space ui ∈ RD.
If the input data are images, the eigenvectors are also images.
Mean λ1 =3.4·105 λ2 =2.8·105 λ3 =2.4·105 λ4 =1.6·105
The mean and the first four eigenvectors u1, . . . , u4 of a set of handwritten digits of ‘three’.
Blue corresponds to positive values, white is zero and yellow corresponds to negative values.
PCA - Applications
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
351of 821
The eigenvalues of the covariance matrix express the variance of the data set in the direction of the corresponding eigenvectors.
5 x 10
3
λi 2
1
0
0 200 400 600i
(a)
Plot of the eigenvalue spectrum for the digits of three data set.
PCA - Applications
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
352of 821
The sum of the eigenvalues of the covariance matrix of the discarded directions express the distortion error.
1 N
J = N n = 1 ∥ x n − x n ∥ 2
6 x 10
3
J
2
1
0
0 200 400 600M
(b)
Plot of the distortion error versus the number of dimension of the subspace considered for projection.
PCA - Compression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
353of 821
The approximated data vector xn can be written in the form
M
x n = ̄x + u Ti ( x n − ̄x ) u i
i=1
Codebook : M + 1 vectors of dimension D ( ̄x and ui).
Compressed xn : M factors uTi (xn − ̄x)
Original
M = 1 M = 10 M = 50 M = 250
Reconstruction of an image retaining M principal components.
PCA - Data Preprocessing
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
354of 821
Standardise certain features of a data set (for instance as a preprocessing step to subsequent algorithms expecting these features).
Usually, individual standardisation: each variable (dimension) has zero mean and unit variance. But variables are still correlated.
PCA can do more: create decorrelated data (covariance is the identity; also called whitening or sphering of the data)
Write the eigenvector equation for the covariance matrix S SU = UL
where L is the diagonal matrix of (positive!) eigenvalues. Transform the original data by
y n = L − 1 / 2 U T ( x n − ̄x )
The set {yn} has mean zero and covariance given by the identity.
PCA - Data Preprocessing
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
355of 821
Transform the original data by
y n = L − 1 / 2 U T ( x n − ̄x ) Mean of the set {yn}
1 N
N
n=1
1 N
y n = N L − 1 / 2 U T ( x n − ̄x )
n=1
= L U N Covariance of the set {yn}
1 N
N
n=1
1 N n=1
n=1
− 1 / 2 T 1 N
( x n − ̄x ) = 0
ynyTn =N L−1/2UT(xn− ̄x)(xn− ̄x)TUL−1/2
= L−1/2 UT SUL−1/2 = L−1/2 UT ULL−1/2
=I
PCA - The Effect of Whitening
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
356of 821
Compare standardising and whitening of a data set.
(b) also shows the principal axis of the normalised data set
plotted as red lines over the range ±λ1/2. i
100 90 80 70 60 50 40
22
0 0
−2 −2 −2 0 2
Standardising to zero mean and unit variance.
2 4 6
−2 0 2
Whitening to achieve unit covariance.
Original data (note the different axis).
Extensions of PCA
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
357of 821
Kernel PCA
Use Φ(x) as features, and express in terms of kernel matrix K
The covariance matrix S and the (centered) kernel matrix K has the same eigenvalues.
Probabilistic PCA
Explicitly model latent variable z ∼ N (z|0, I).
Mean value of observed variable is given by Wz + μ Conditional distribution of observed variable
x ∼ N x|Wz + μ, σ2I
Independence versus Uncorrelatedness
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
358of 821
Independence
p(x1, x2) = p(x1) p(x2) Uncorrelatedness (defined via a zero covariance)
E [x1x2] − E [x1] E [x2] = 0
Independence implies Uncorrelatedness (prove it!). BUT Uncorrelatedness does NOT imply Independence.
Example: Draw the pair (x1, x2) with equal probability from the set {(0, 1), (0, −1), (1, 0), (−1, 0)}.
Then x1 and x2 are uncorrelated because E[x1] = E[x2] = E[x1x2] = 0 .
But x1 and x2 are NOT independent p(x1 =0,x2 =−1)= 1
4
p(x1 = 0) p(x2 = −1) = 1 × 1 24
Independent Component Analysis - Overview
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
359of 821
Assume we have K signals and K recordings, each recording containing a mixture of the signals.
‘Cocktail party’ problem : K people speak at the same time in a room, and K microphones pickup a mixture of what they say.
Given unknown source signals S ∈ RN×K and an unknown mixing matrix A, producing the observed data X ∈ RN×K
X=SA
Can we recover the original signals (Blind Source Separation)?
Yes, under the assumption that
at most one of the signals is Gaussian distributed.
we don’t care for the amplitude (including the sign).
we don’t care for the order of the recovered signals.
we have at least as many observed mixtures as signals, the matrix A has full rank and can be inverted.
Independent Component Analysis - Overview
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
360of 821
Uncorrelated variables are not necessarily independent.
ICA maximises the statistical independence of the estimated components.
Find A in such a way that the columns of S = XA−1
are maximally independent.
Several definitions for statistical independence possible.
Central Limit Theorem: The distribution of a sum of independent random variables tends toward a Gaussian distribution (under certain conditions).
FastICA algorithm.