Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
326of 821
Part IX
Principal Component Analysis
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
327of 821
Motivation: Pre-training Deep Neural Networks
Empirical observations – pre 2006:
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good
generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training of each layer; deeper
models possible
Usually based on auto-encoders (tomorrow’s lecture)
Similar in spirit to PCA (today’s lecture)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
328of 821
Motivation: Exploratory Analysis
Given a dataset of numerical features:
Low dimensional data may be easy to plot
High dimensional data is challenging
Dimensionality reduction (e.g. PCA)
Try to explain with fewer dimensions
Enables visualisation
The new basis may yield insights
Aside: can simplify/speed up subsequent analysis e.g.
regression
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
329of 821
Supervised Learning
Given are pairs of data xi ∈ X and targets ti ∈ T in the
form (xi, ti), where i = 1 . . .N.
Learn a mapping between the data X and the target t
which generalises well to new data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
330of 821
Unsupervised Learning
Given only the data xi ∈ X .
Discover (=learn) some interesting structure inherent in
the data X.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
331of 821
Unsupervised Learning
Given only the data xi ∈ X .
Discover (=learn) some interesting structure inherent in
the data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
332of 821
Testing – Supervised versus Unsupervised
Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
333of 821
Recall: Fisher’s Linear Discriminant
Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
334of 821
Eigenvectors
Every square matrix A ∈ Rn×n has an Eigenvector
decomposition
Ax = λx
where x ∈ Rn and λ ∈ C.
Example: [
0 1
−1 0
]
x = λx
λ = {−ı, ı}
x =
{[
ı
1
]
,
[
−ı
1
]}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
335of 821
Eigenvectors
How many eigenvalue/eigenvector pairs?
Ax = λx
is equivalent to
(A− λI)x = 0
Has only non-trivial solution for det {A− λI} = 0
polynom of nth order; at most n distinct solutions
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
336of 821
Real Eigenvalues
How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n.
Transposition is replaced by Hermitian adjoint, e.g.[
1 + ı2 3 + ı4
5 + ı6 7 + ı8
]H
=
[
1− ı2 5− ı6
3− ı4 7− ı8
]
Denote the complex conjugate of a complex number λ by
λ.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
337of 821
Real Eigenvalues
How can we enforce real eigenvalues?
Let’s assume A ∈ Cn×n, Hermitian (AH = A).
Calculate
xHAx = λxHx
for an eigenvector x ∈ Cn of A.
Another possibility to calculate xHAx
xHAx = xHAHx (A is Hermitian)
= (xHAx)H (reverse order)
= (λxHx)H (eigenvalue)
= λxHx
and therefore
λ = λ (λ is real).
If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric,
then all eigenvalues are real.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
338of 821
Singular Value Decomposition
Every matrix A ∈ Rn×p can be decomposed into a product of
three matrices
A = UΣVT
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (
UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegative
numbers on the diagonal.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
339of 821
Linear Curve Fitting – Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
340of 821
Inverse of a matrix
Assume a full rank symmetric real matrix A.
Then A = UTΛU where
Λ is a diagonal matrix with real eigenvalues
U contains the eigenvectors
A−1 = (UTΛU)−1
= U−1Λ−1U−T inverse changes order
= UTΛ−1U UTU = I
The inverse of a diagonal matrix is the inverse of its
elements.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
341of 821
Dimensionality Reduction
Main goal of Principal Component Analysis:
dimensionality reduction
Many applications in visualisation, feature extraction,
signal processing, data compression . . .
Example: Use hand-written digits (binary data) and place
them into a larger frame (100 × 100) varying the position
and the rotation angle.
Data space size = 10 000.
But data live on a three-dimensional manifold (x, y, and the
rotation angle).
FYI only: this manifold is not linear and requires bleeding
edge models like capsule networks (Hinton 2017); still we
can locally approximate with PCA.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
342of 821
Principal Component Analysis (PCA)
Idea: Linearly project the data points onto a lower
dimensional subspace such that
the variance of the projected data is maximised, or
the distortion error from the projection is minimised.
Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace.
x2
x1
xn
x̃n
u1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
343of 821
Principal Component Analysis (PCA)
Given N observations xn ∈ RD, n = 1, . . . ,N.
Project onto a space with dimensionality M < D while
maximising the variance.
More advanced : How to calculate M from the data.
Therefore here: M is fixed.
Consider a 1-dimensional subspace spanned by some unit
vector u1 ∈ RD, uT1 u1 = 1.
x2
x1
xn
x̃n
u1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
344of 821
PCA - Maximise Variance
Each data point xn is then projected onto a scalar value
uT1 xn.
The mean of the projected data is uT1 x̄ where x̄ is the
sample mean
x̄ =
1
N
N∑
n=1
xn.
The variance of the projected data is then
1
N
N∑
n=1
{
uT1 xn − uT1 x̄
}2
= uT1 Su1
with the covariance matrix
S =
1
N
N∑
n=1
(xn − x̄)(xn − x̄)T .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
345of 821
PCA - Maximise Variance
Maximising uT1 Su1 under the constraint u
T
1 u1 = 1 (why do
we need to bound u1 ?) leads to the Lagrange equation
L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1)
which has a stationary point if u1 is an eigenvector of S
with eigenvalue λ1.
Su1 = λ1u1.
The variance is then uT1 Su1 = λ1.
Variance is maximised if u1 is the eigenvector of the
covariance S with the largest eigenvalue.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
346of 821
PCA - Maximise Variance
Maximising uT1 Su1 under the constraint u
T
1 u1 = 1 (why do
we need to bound u1 ?) leads to the Lagrange equation
L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1)
which has a stationary point
if u1 is an eigenvector of S
with eigenvalue λ1.
Su1 = λ1u1.
The variance is then uT1 Su1 = λ1.
Variance is maximised if u1 is the eigenvector of the
covariance S with the largest eigenvalue.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
347of 821
PCA - Maximise Variance
Maximising uT1 Su1 under the constraint u
T
1 u1 = 1 (why do
we need to bound u1 ?) leads to the Lagrange equation
L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1)
which has a stationary point if u1 is an eigenvector of S
with eigenvalue λ1.
Su1 = λ1u1.
The variance is then uT1 Su1 = λ1.
Variance is maximised if u1 is the eigenvector of the
covariance S with the largest eigenvalue.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
348of 821
PCA - Maximise Variance
Continue maximising the variance amongst all possible
directions orthogonal to those already considered.
The optimal linear projection onto a M-dimensional space
for which the variance is maximised is defined by the M
eigenvectors u1, . . . ,uM of the covariance matrix S
corresponding to the M largest eigenvalues λ1, . . . , λM.
Is this subspace always uniquely defined?
Not if λM = λM+1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
349of 821
PCA - Minimise Distortion Error
The distortion between data points xn and their projection
x̃n
J =
1
N
N∑
n=1
‖xn − x̃n‖2
is minimised if the variance is maximised.
The distortion error is then
J =
D∑
i=M+1
λi
where λi, i = M + 1, . . . ,D are the smallest eigenvalues of
the covariance matrix S.
In signal processing we speak of the signal space
(principal subspace) and the noise space (orthogonal to
the principal subspace).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
350of 821
PCA - Applications
The eigenvectors of the covariance matrix are elements of
the original vector space ui ∈ RD.
If the input data are images, the eigenvectors are also
images.
Mean λ1 = 3.4 · 105 λ2 = 2.8 · 105 λ3 = 2.4 · 105 λ4 = 1.6 · 105
The mean and the first four eigenvectors u1, . . . ,u4 of a set of
handwritten digits of ‘three’.
Blue corresponds to positive values, white is zero and yellow
corresponds to negative values.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
351of 821
PCA - Applications
The eigenvalues of the covariance matrix express the
variance of the data set in the direction of the
corresponding eigenvectors.
i
λi
(a)
0 200 400 600
0
1
2
3
x 10
5
Plot of the eigenvalue spectrum for the digits of three data set.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
352of 821
PCA - Applications
The sum of the eigenvalues of the covariance matrix of the
discarded directions express the distortion error.
J =
1
N
N∑
n=1
‖xn − x̃n‖2
M
J
(b)
0 200 400 600
0
1
2
3
x 10
6
Plot of the distortion error versus the number of dimension of
the subspace considered for projection.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
353of 821
PCA - Compression
The approximated data vector x̃n can be written in the form
x̃n = x̄ +
M∑
i=1
(
uTi (xn − x̄)
)
ui
Codebook : M + 1 vectors of dimension D (x̄ and ui).
Compressed xn : M factors uTi (xn − x̄)
Original M = 1 M = 10 M = 50 M = 250
Reconstruction of an image retaining M principal components.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
354of 821
PCA - Data Preprocessing
Standardise certain features of a data set (for instance as
a preprocessing step to subsequent algorithms expecting
these features).
Usually, individual standardisation: each variable
(dimension) has zero mean and unit variance. But
variables are still correlated.
PCA can do more: create decorrelated data (covariance is
the identity; also called whitening or sphering of the data)
Write the eigenvector equation for the covariance matrix S
SU = UL
where L is the diagonal matrix of (positive!) eigenvalues.
Transform the original data by
yn = L−1/2 UT(xn − x̄)
The set {yn} has mean zero and covariance given by the
identity.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
355of 821
PCA - Data Preprocessing
Transform the original data by
yn = L−1/2 UT(xn − x̄)
Mean of the set {yn}
1
N
N∑
n=1
yn =
1
N
N∑
n=1
L−1/2 UT(xn − x̄)
= L−1/2 UT
1
N
N∑
n=1
(xn − x̄) = 0
Covariance of the set {yn}
1
N
N∑
n=1
ynyTn =
1
N
N∑
n=1
L−1/2 UT(xn − x̄)(xn − x̄)TUL−1/2
= L−1/2 UTSUL−1/2
= L−1/2 UTULL−1/2
= I
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
356of 821
PCA - The Effect of Whitening
Compare standardising and whitening of a data set.
(b) also shows the principal axis of the normalised data set
plotted as red lines over the range ±λ1/2i .
2 4 6
40
50
60
70
80
90
100
Original data
(note the different
axis).
−2 0 2
−2
0
2
Standardising to
zero mean and unit
variance.
−2 0 2
−2
0
2
Whitening to
achieve unit
covariance.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
357of 821
Extensions of PCA
Kernel PCA
Use Φ(x) as features, and express in terms of kernel matrix
K
The covariance matrix S and the (centered) kernel matrix K
has the same eigenvalues.
Probabilistic PCA
Explicitly model latent variable z ∼ N (z|0, I).
Mean value of observed variable is given by Wz + µ
Conditional distribution of observed variable
x ∼ N
(
x|Wz + µ, σ2I
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
358of 821
Independence versus Uncorrelatedness
Independence
p(x1, x2) = p(x1) p(x2)
Uncorrelatedness (defined via a zero covariance)
E [x1x2]− E [x1] E [x2] = 0
Independence implies Uncorrelatedness (prove it!).
BUT Uncorrelatedness does NOT imply Independence.
Example: Draw the pair (x1, x2) with equal probability from
the set {(0, 1), (0,−1), (1, 0), (−1, 0)}.
Then x1 and x2 are uncorrelated because
E [x1] = E [x2] = E [x1x2] = 0 .
But x1 and x2 are NOT independent
p(x1 = 0, x2 = −1) =
1
4
p(x1 = 0) p(x2 = −1) =
1
2
× 1
4
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
359of 821
Independent Component Analysis - Overview
Assume we have K signals and K recordings, each
recording containing a mixture of the signals.
‘Cocktail party’ problem : K people speak at the same time
in a room, and K microphones pickup a mixture of what
they say.
Given unknown source signals S ∈ RN×K and an unknown
mixing matrix A, producing the observed data X ∈ RN×K
X = S A
Can we recover the original signals (Blind Source
Separation)?
Yes, under the assumption that
at most one of the signals is Gaussian distributed.
we don’t care for the amplitude (including the sign).
we don’t care for the order of the recovered signals.
we have at least as many observed mixtures as signals, the
matrix A has full rank and can be inverted.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Eigenvectors
Singular Value
Decomposition
Principal Component
Analysis
Principal Component
Analysis (PCA)
Independent Component
Analysis
360of 821
Independent Component Analysis - Overview
Uncorrelated variables are not necessarily independent.
ICA maximises the statistical independence of the
estimated components.
Find A in such a way that the columns of
S = XA−1
are maximally independent.
Several definitions for statistical independence possible.
Central Limit Theorem: The distribution of a sum of
independent random variables tends toward a Gaussian
distribution (under certain conditions).
FastICA algorithm.
Neural Networks 2
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks