Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
270of 825
Part VII
Neural Networks 1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
271of 825
Basis Functions
−1 0 1
0
0.25
0.5
0.75
1
Play a crucial role in the algorithms explored so far.
Previously (e.g. Linear Regression and Linear
Classification): were fixed before learning starts.
Now for Neural Networks: number of basis functions fixed,
parameters of the basis functions are adaptive
Later in kernel methods: center basis functions on the
data / have an infinite number of effective basis functions
(e.g. Support Vector Machines).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
272of 825
Outline
The functional form of the network model (including
special parametrisation of the basis functions).
How to determine the network parameters within the
maximum likelihood framework?
(Solution of a nonlinear optimisation problem.)
Error backpropagation : efficiently evaluate the derivatives
of the log likelihood function with respect to the network
parameters.
Various approaches to regularise neural networks.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
273of 825
Feed-forward Network Functions
Same goal as before: e.g. for regression, decompose
t(x) = y(x,w) + �
where � is the noise.
(Generalised) Linear Model
y(x,w) = f
M∑
j=0
wjφj(x)
where φ = (φ0, . . . , φM)T is the fixed model basis and
w = (w0, . . . ,wM)T are the model parameter.
For regression: f (·) is the identity function.
For classification: f (·) is a nonlinear activation function.
Goal : Let φj(x) depend on parameters, and then adjust
these parameters together with w.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
274of 825
Feed-forward Network Functions
Goal : Let φj(x) depend on parameters, and then adjust
these parameters together with w.
Many ways to do this.
Neural networks use basis functions which follow the
same form as the (generalised) linear model.
EACH basis function is itself a nonlinear function of an
adaptive linear combination of the inputs.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
275of 825
Functional Transformations
Construct M linear combinations of the input variables
x1, . . . , xD in the form
aj︸︷︷︸
activations
=
D∑
i=1
w(1)ji︸︷︷︸
weights
xi + w
(1)
j0︸︷︷︸
bias
j = 1, . . . ,M
Apply a differentiable, nonlinear activation function h(·) to
get the output of the hidden units
zj = h(aj)
h(·) is typically sigmoid, tanh, or more recently
ReLU(x) = max(x, 0).
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
276of 825
Functional Transformations
Outputs of the hidden units are again linearly combined
ak =
M∑
j=1
w(2)kj zj + w
(2)
k0 k = 1, . . . ,K
Apply again a differentiable, nonlinear activation function
g(·) to get the network outputs yk
yk = g(ak)
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
277of 825
Functional Transformations
The activation function g(·) is determined by the nature of
the data and the distribution of the target variables.
For standard regression: g(·) is the identity so yk = ak.
For multiple binary classification, g(·) is a logistic sigmoid:
yk = σ(ak) =
1
1 + exp(−ak)
Recall from generative classification model perspective:
ak(x) = ln
p(x, Ck1)
p(x, Ck2)
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
278of 825
Functional Transformations
Combine all transformations into one formula
yk(x,w) = g
M∑
j=1
w(2)kj h
(
D∑
i=1
w(1)ji xi + w
(1)
j0
)
+ w(2)k0
where w contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
279of 825
Functional Transformations
As before, the biases can be absorbed into the weights by
introducing an extra input x0 = 1 and a hidden unit z0 = 1.
yk(x,w) = g
M∑
j=0
w(2)kj h
(
D∑
i=0
w(1)ji xi
)
where w now contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
280of 825
Comparison to Perceptron
A neural network looks like a multilayer perceptron.
But perceptron’s nonlinear activation function was a step
function — neither smooth nor differentiable.
f (a) =
{
+1, a ≥ 0
−1, a < 0
The activation functions h(·) and g(·) of a neural network
are smooth and differentiable.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
281of 825
Linear Activation Functions
If all activation functions are linear functions then there
exists an equivalent network without hidden units.
(Composition of linear functions is a linear function.)
But if the number of hidden units in this case is smaller
than the number of input or output units, the resulting
linear function are not the most general.
Dimensionality reduction.
c.f. Principal Component Analysis (upcoming lecture).
Generally, most neural networks use nonlinear activation
functions as the goal is to approximate a nonlinear
mapping from the input space to the outputs.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
282of 825
Neural Networks as Universal Function
Approximators
Feed-forward neural networks are universal
approximators.
Example: A two-layer neural network with linear outputs
can uniformly approximate any continuous function on a
compact input domain to arbitrary accuracy if it has
enough hidden units.
Holds for a wide range of hidden unit activation functions
Remaining big question : Where do we get the appropriate
settings for the weights from? With other words, how do
we learn the weights from training examples?
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
283of 825
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = x2
Two-layer network with 3 hidden units (tanh activation
functions) and linear outputs trained on 50 data points sampled
from the interval (−1, 1). Red: resulting output. Dashed:
Output of the hidden units.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
284of 825
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = sin(x)
Two-layer network with 3 hidden units (tanh activation
functions) and linear outputs trained on 50 data points sampled
from the interval (−1, 1). Red: resulting output. Dashed:
Output of the hidden units.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
285of 825
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = |x|
Two-layer network with 3 hidden units (tanh activation
functions) and linear outputs trained on 50 data points sampled
from the interval (−1, 1). Red: resulting output. Dashed:
Output of the hidden units.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
286of 825
Aproximation Capabilities of Neural Networks
Neural network approximating Heaviside function
f (x) =
{
1, x ≥ 0
0, x < 0
Two-layer network with 3 hidden units (tanh activation
functions) and linear outputs trained on 50 data points sampled
from the interval (−1, 1). Red: resulting output. Dashed:
Output of the hidden units.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
287of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis
functions
-10
-5
x2
0
5
10
-10
-5
0
x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 1.0, 0.1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
288of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis
functions
-10
-5
x2
0
5
10
-10
-5
0
x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 0.1, 1.0)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
289of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis
functions
-10
-5
x2
0
5
10
-10
-5
0
x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0,−0.5, 0.5)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
290of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis
functions
-10
-5
x2
0
5
10
-10
-5
0
x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (10.0,−0.5, 0.5)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
291of 825
Aproximation Capabilities of Neural Networks
Neural network for two-class classification.
2 inputs, 2 hidden units with tanh activation function, 1
output with logistic sigmoid activation function.
−2 −1 0 1 2
−2
−1
0
1
2
3
Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hidden
unit contours. Green: Optimal decision boundary from the
known data distribution.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
292of 825
Weight-space Symmetries
Given a set of weights w. This fixes a mapping from the
input space to the output space.
Does there exist another set of weights realising the same
mapping?
Assume tanh activation function for the hidden units. As
tanh is an odd function: tanh(−a) = − tanh(a).
Change the sign of all inputs to a hidden unit and outputs
of this hidden unit: Mapping stays the same.
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
293of 825
Weight-space Symmetries
M hidden units, therefore 2M equivalent weight vectors.
Furthermore, exchange all of the weights going into and
out of a hidden unit with the corresponding weights of
another hidden unit. Mapping stays the same. M!
symmetries.
Overall weight space symmetry : M! 2M
M 1 2 3 4 5 6 7
M! 2M 2 8 48 384 3840 46080 645120
x0
x1
xD
z0
z1
zM
y1
yK
w
(1)
MD
w
(2)
KM
w
(2)
10
hidden units
inputs outputs
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
294of 825
Parameter Optimisation
Assume the error E(w) is a smooth function of the weights.
Smallest value will occur at a critical point for which
∇E(w) = 0.
This could be a minimum, maximum, or saddle point.
Furthermore, because of symmetry in weight space, there
are at least M! 2M other critical points with the same value
for the error.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
295of 825
Parameter Optimisation
Definition (Global Minimum)
A point w∗ for which the error E(w∗) is smaller than any other
error E(w).
Definition (Local Minimum)
A point w∗ for which the error E(w∗) is smaller than any other
error E(w) in some neighbourhood of w∗.
w1
w2
E(w)
wA wB wC
∇E
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
296of 825
Parameter Optimisation
Finding the global minimium is difficult in general (would
have to check everywhere) unless the error function
comes from a special class (e.g. smooth convex functions
have only one local minimum).
Error functions for neural networks are not convex
(symmetries!).
But finding a local minimum might be sufficient.
Use iterative methods with weight vector update ∆w(τ) to
find a local minimum.
w(τ+1) = w(τ) + ∆w(τ)
w1
w2
E(w)
wA wB wC
∇E
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
297of 825
Local Quadratic Approximation
Around a stationary point w∗ we can approximate
E(w) ' E(w∗) +
1
2
(w− w∗)TH(w− w∗),
where the Hessian H is evaluated at w∗ so that
Hij =
∂2
∂wi∂wj
∣∣∣∣
w=w∗
E(w).
Using a set {ui} of orthonormal eigenvectors of H,
Hui = λiui,
to expand
w− w∗ =
∑
i
αiui.
We get
E(w) = E(w∗) +
1
2
∑
i
λiα
2
i .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
298of 825
Local Quadratic Approximation
Around a minimum w∗ we can approximate
E(w) = E(w∗) +
1
2
∑
i
λiα
2
i .
w1
w2
λ
−1/2
1
λ
−1/2
2
u1
w?
u2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
299of 825
Local Quadratic Approximation
Around a minimum w∗, the Hessian H must be positive
definite if evaluated at w∗.
w1
w2
λ
−1/2
1
λ
−1/2
2
u1
w?
u2
This explains why the Laplace approximation always yields
a valid covariance matrix.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
300of 825
Gradient Information improves Performances
Hessian is symmetric and contains W(W + 1)/2
independent entries where W is the total number of
weights in the network.
If we use function evaluations only:
Need to gather this O(W2) pieces of information by doing
O(W2) number of function evaluations each of which cost
O(W) time, for an overall cost of order O(W3).
If we use gradients of the function:
Surprisingly the gradient ∇E also costs only O(W) time,
although it provides W pieces of information.
We now need only O(W) steps, so the order of time
complexity is reduced to O(W2).
FYI only: In general we have the “cheap gradient principle”.
See (Griewank, A., 2000. Evaluating Derivatives: Principles
and Techniques of Algorithmic Differentiation, Section 5.1).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
301of 825
Gradient Descent Optimisation
Batch processing : Update the weight vector with a small
step in the direction of the negative gradient
w(τ+1) = w(τ) − η∇E(w(τ))
where η is the learning rate.
After each step, re-evaluate the gradient ∇E(w(τ)) again.
Gradient Descent has problems in ’long valleys’.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
302of 825
Gradient Descent Optimisation
Gradient Descent has problems in ’long valleys’.
Example of zig-zag of Gradient Descent Algorithm.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
303of 825
Nonlinear Optimisation
Use Conjugate Gradient Descent instead of Gradient
Descent to avoid zig-zag behaviour.
Use Newton method which also calculates the inverse
Hessian in each iteration (but inverting the Hessian is
usually costly).
Use Quasi-Newton methods (e.g. BFGS) which also
calculates an estimate of the inverse Hessian while
iterating.
Even simpler are momentum based strategies.
Run the algorithm from a set of starting points to find the
smallest local minimum.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
304of 825
Nonlinear Optimisation
Remaining big problem: Error function is defined over the
whole training set. Therefore, need to process the whole
training set for each calculation of the gradient ∇E(w(τ)).
If the error function is a sum of errors for each data point
E(w) =
N∑
n=1
En(w)
we can use on-line gradient descent (also called
sequential gradient descent or stochastic gradient
descent) to update the weights by one data point at a time
w(τ+1) = w(τ) − η∇En(w(τ)).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent
Optimisation
305of 825
FYI Only: NN Extensions
Add more hidden layers (deep learning). To make it work
we need many of the following tricks:
Clever weight initialisation to ensure the gradient is flowing
through the entire network.
Some links may additively skip over one or several
subsequent layer(s).
Favour ReLU over e.g. the sigmoid, to avoid vanishing
gradients.
Clever regularisation methods such as dropout.
Specific architectures, not further considered here:
Parameters may be shared, notably as in convolutional
neural networks for images.
A state space model with neural network transitions is a
recurrent neural network.
Attention mechanisms learn to focus on specific parts of an
input.
Neural Networks 1
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent Optimisation