Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
361of 821
Part X
Neural Network 3
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
362of 821
Number of layers
expression of a function is compact when it has few
computational elements, i.e. few degrees of freedom that
need to be tuned by learning
for a fixed number of training examples, expect that
compact representations of the target function would yield
better generalization
Example representations
affine operations, sigmoid⇒ logistic regression has depth
1, fixed number of units (a.k.a. neurons)
fixed kernel, affine operations⇒ kernel machine (e.g. SVM)
has two levels, with as many units as data points
stacked neural network of multiple “linear transformation
followed by a non-linearity”⇒ deep neural network has
arbitrary depth with arbitrary number of units per layer
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
363of 821
Easier to represent with more layers
An old result:
functions that can be compactly represented by a depth k
architecture might require an exponential number of
computational elements to be represented by a depth
k − 1 architecture
Example, the d bit parity function
parity : (b1, . . . , bd) ∈ {0, 1}d 7→
{
1 if
∑d
i=1 bi is even
0 otherwise
Theorem: d-bit parity circuits of depth 2 have exponential
size
Analogous in modern deep learning:
“Shallow networks require exponentially more parameters
for the same number of modes” — Canadian deep
learning mafia.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
364of 821
Recall: Multi-layer Neural Network Architecture
yk(x,w) = g
M∑
j=0
w(2)kj h
(
D∑
i=0
w(1)ji xi
)
where w now contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
We could add more hidden layers
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
365of 821
Empirical observations – pre 2006
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good
generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training, find distributed
representation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
366of 821
Deep representation – intuition
Bengio, “Learning Deep Architectures for AI”, 2009
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
367of 821
Deep representation – practice
AlexNet / VGG-F network visualized by mNeuron.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
368of 821
Recall: PCA
Idea: Linearly project the data points onto a lower
dimensional subspace such that
the variance of the projected data is maximised, or
the distortion error from the projection is minimised.
Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace.
x2
x1
xn
x̃n
u1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
369of 821
Multiple PCA layers? – linear transforms
Principle Component Analysis is a linear transformation
(because it is a projection)
The composite of two linear transformations is linear
Linear transformations M : Rm → Rn are matrices
Let S and T be matrices of appropriate dimension such
that ST is defined
ST(X + X′) = ST(X) + ST(X′)
Similarly for multiplication with a scalar
⇒ multiple PCA layers pointless
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
370of 821
Multiple PCA layers? – projection
Let XTX = UΛUT be the eigenvalue decomposition of the
covariance matrix (what is assumed about the mean?).
Define Uk to be the matrix of the first k columns of U, for
the k largest eigenvalues. Define Λk similarly
Consider the features formed by projecting onto the
principal components
Z = XUk
We perform PCA a second time, ZTZ = VΛZVT .
By the definition of Z and XTX, and the orthogonality of U
ZTZ = (XUk)
T(XUk) = U
T
k X
TXUk = U
T
k UΛU
TUk = Λk
Hence ΛZ = Λk and V is the identity, therefore the second
PCA has no effect
⇒ again, multiple PCA layers pointless
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
371of 821
Autoencoder
An autoencoder is trained to encode the input x into some
representation c(x) so that the input can be reconstructed
from that representation
the target output of the autoencoder is the autoencoder
input itself
With one linear hidden layer and the mean squared error
criterion, the k hidden units learn to project the input in the
span of the first k principal components of the data
If the hidden layer is nonlinear, the autoencoder behaves
differently from PCA, with the ability to capture multimodal
aspects of the input distribution
Let f be the decoder. We want to minimise the
reconstruction error
N∑
n=1
` (xn, f (c(xn)))
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
372of 821
Cost function
Recall: f (c(x)) is the reconstruction produced by the
network
Minimisation of the negative log likelihood of the
reconstruction, given the encoding c(x)
RE = − log P(x|c(x))
If x|c(x) is Gaussian, we recover the familiar squared error
If the inputs xi are either binary or considered to be
binomial probabilities, then the loss function would be the
cross entropy
− log P(x|c(x)) = −xi log fi(c(x)) + (1− xi) log(1− fi(c(x)))
where fi(·) is the ith component of the decoder
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
373of 821
Undercomplete Autoencoder
Consider a small number of hidden units.
c(x) is viewed as a lossy compression of x
Cannot have small loss for all x, so focus on training
examples
Hope code c(x) is a distributed representation that
captures the main factors of variation in the data
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
374of 821
Stacking autoencoders
Let cj and fj be the encoder and corresponding decoder of
the jth layer
Let zj be the representation after the encoder cj
We can define multiple layers of autoencoders recursively.
For example, let z1 = c1(x), and z2 = c2(z1),
the corresponding decoding is given by f1(f2(z2))
Because of non-linear activation functions, the latent
feature z2 can capture more complex patterns than z1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
375of 821
Higher level image features – faces
codingplayground.blogspot.com
codingplayground.blogspot.com
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
376of 821
Pre-training supervised neural networks
Latent features zj in layer j can capture high level patterns
zj = cj(cj−1(· · · c2(c1(x)) · · · ))
These features may also be useful for supervised learning
tasks.
In contrast to the feed forward network, the features zj are
constructed in an unsupervised fashion.
Discard the decoding layers, and directly use zj with a
supervised training method, such as logistic regression.
Various such pre-trained networks are available on-line,
e.g VGG-19.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
377of 821
Xavier Initialisation / ReLU
Layer-wise unsupervised pre-training helps by extracting
useful features for subsequent supervised backprop.
Pre-training also avoids saturation (large magnitude
arguments to, say, sigmoidal units).
Simpler Xavier initialization can also avoid saturation.
Let the inputs xi ∼ N (0, 1), weights wi ∼ N (0, σ2) and
activation z =
∑m
i=1 xiwi. Then:
VAR[z] = E[(z− E[z])2] = E[z2] = E[(
m∑
i=1
xiwi)
2]
=
m∑
i=1
E[(xiwi)2] =
m∑
i=1
E[x2i ]E[w
2
i ] = mσ
2.
So we set σ = 1/
√
m to have “nice” activations.
Glorot initialization takes care to have nice
back-propagated signals — see the auto-encoder lab.
ReLU activations h(x) = max(x, 0) also help in practice.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
378of 821
Higher dimensional hidden layer
if there is no other constraint, then an autoencoder with
d-dimensional input and an encoding of dimension at least
d could potentially just learn the identity function
Avoid by:
Regularisation
Early stopping of stochastic gradient descent
Add noise in the encoding
Sparsity constraint on code c(x).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
379of 821
Denoising autoencoder
Add noise to input, keeping perfect example as output
Autoencoder tries to:
1 preserve information of input
2 undo stochastic corruption process
Reconstruction log likelihood
− log P(x|c(x̂))
where x noise free, x̂ corrupted
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
380of 821
Image denoising
Images with Gaussian noise added.
Denoised using Stacked Sparse Denoising Autoencoder
Images from Xie et. al. NIPS 2012
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
381of 821
Inpainting – 1
Free a bird
Image from http:
//cimg.eu/greycstoration/demonstration.shtml
http://cimg.eu/greycstoration/demonstration.shtml
http://cimg.eu/greycstoration/demonstration.shtml
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
382of 821
Inpainting – 2
Undo text over image
Image from Bach et. al. ICCV tutorial 2009
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
383of 821
Recall: Basis functions
For fixed basis functions φ(x), we use domain knowledge
for encoding features
Neural networks use data to learn a set of transformations.
φi(x) is the ith component of the feature vector, and is
learned by the network.
The transformations φi(·) for a particular dataset may no
longer be orthogonal, and furthermore may be minor
variations of each other.
We collect all the transformed features into a matrix Φ.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
384of 821
Sparse representations
Idea: Have many hidden nodes, but only a few active for a
particular code c(x).
Student t prior on codes
`1 penalty on coefficients α
Given bases in matrix Φ, look for codes by choosing α such
that input signal x is reconstructed with low `2 reconstruction
error, while w is sparse
min
α
N∑
n=1
1
2
‖xn − Φαn‖22 + λ‖α‖1
Φ is overcomplete, no longer orthogonal
Sparse⇒ small number of non-zero αi.
Exact recovery under certain conditions (coherence):
`1 → `0.
`1 regulariser ∼ Laplace prior p(αi) = λ2 exp(−λ|αi|).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
385of 821
The image denoising problem
y︸︷︷︸
measurements
= xorig︸︷︷︸
original image
+ w︸︷︷︸
noise
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
386of 821
Sparsity assumption
Only have noisy measurements
y︸︷︷︸
measurements
= xorig︸︷︷︸
original image
+ w︸︷︷︸
noise
Given Φ ∈ Rm×p, find α such that
‖α‖0 is small for x ≈ Φα
where ‖ · ‖0 is the number of non-zero elements of α.
Φ is not necessarily features constructed from training
data.
Minimise reconstruction error
min
α
N∑
n=1
1
2
‖xn − Φαn‖22 + λ‖α‖0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
387of 821
Convex relaxation
Want to minimise number of components
min
α
N∑
n=1
1
2
‖xn − Φαn‖22 + λ‖α‖0
but ‖ · ‖0 is hard to optimise
Relax to a convex norm
min
α
N∑
n=1
1
2
‖xn − Φαn‖22 + λ‖α‖1
where ‖α‖1 =
∑
n |αn|.
In some settings does minimisation with `1 regularisation
give the same solution as minimisation with `0
regularisation (exact recovery)?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
388of 821
Mutual coherence
Expect to be ok when columns of Φ “not too parallel”
Assume columns of Φ are normalised to unit norm
Let K = ΦΦT be the Gram matrix, then K(i, j) is the value
of the inner product between φi and φj.
Define the mutual coherence
M = M(Φ) = max
i6=j
|K(i, j)|
If we have an orthogonal basis, then Φ is an orthogonal
matrix, hence K(i, j) = 0 when i 6= j.
However, if we have very similar columns, then M ≈ 1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Motivation
Autoencoder
389of 821
Exact recovery conditions
If a minimiser of the true `0 problem, α∗ satisfies
‖α∗‖0 < 1 M then it is the unique sparsest solution. If α∗ satisfies the stronger condition ‖α∗‖0 < 1 2 ( 1 + 1 M ) then the minimiser of the `1 relaxation has the same sparsity pattern as α∗. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Autoencoder 390of 821 References Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning, 2009 http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/ autoencoders.html Fuchs, "On Sparse Representations in Arbitrary Redundant Bases", IEEE Trans. Info. Theory, 2004 Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010. http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/autoencoders.html http://www.deeplearningbook.org/contents/autoencoders.html Principal Component Analysis Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis