Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825
Part VII
Neural Networks 1
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
270of 825
Basis Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
271of 825
1
0.75
0.5
0.25
0
−1 0 1
Play a crucial role in the algorithms explored so far. Previously (e.g. Linear Regression and Linear Classification): were fixed before learning starts.
Now for Neural Networks: number of basis functions fixed, parameters of the basis functions are adaptive
Later in kernel methods: center basis functions on the data / have an infinite number of effective basis functions (e.g. Support Vector Machines).
Outline
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
272of 825
The functional form of the network model (including special parametrisation of the basis functions).
How to determine the network parameters within the maximum likelihood framework?
(Solution of a nonlinear optimisation problem.)
Error backpropagation : efficiently evaluate the derivatives of the log likelihood function with respect to the network parameters.
Various approaches to regularise neural networks.
Feed-forward Network Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
273of 825
Same goal as before: e.g. for regression, decompose t(x) = y(x, w) + ε
where ε is the noise. (Generalised) Linear Model
y(x, w) = f
where φ = (φ0,…,φM)T is the fixed model basis and
w = (w0,…,wM)T are the model parameter.
For regression: f (·) is the identity function.
For classification: f (·) is a nonlinear activation function.
Goal : Let φj(x) depend on parameters, and then adjust these parameters together with w.
M
wjφj(x) j=0
Feed-forward Network Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
274of 825
Goal : Let φj(x) depend on parameters, and then adjust these parameters together with w.
Many ways to do this.
Neural networks use basis functions which follow the same form as the (generalised) linear model.
EACH basis function is itself a nonlinear function of an adaptive linear combination of the inputs.
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Functional Transformations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
275of 825
Construct M linear combinations of the input variables x1,…,xD in the form
j=1,…,M Apply a differentiable, nonlinear activation function h(·) to
get the output of the hidden units zj = h(aj)
h(·) is typically sigmoid, tanh, or more recently ReLU(x) = max(x, 0).
D
a = w(1) x +w(1)
j jiij0
i=1 activations weights
bias
hidden units
zM
wMD (2)
(1)
xD inputs x1 x0
wKM
yK
outputs
y1
z1 (2) w10
z0
Functional Transformations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
276of 825
Outputs of the hidden units are again linearly combined
M
ak =w(2)zj +w(2) k=1,…,K
kj k0 j=1
Apply again a differentiable, nonlinear activation function g(·) to get the network outputs yk
yk = g(ak)
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Functional Transformations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
277of 825
The activation function g(·) is determined by the nature of the data and the distribution of the target variables.
For standard regression: g(·) is the identity so yk = ak. For multiple binary classification, g(·) is a logistic sigmoid:
yk =σ(ak)= 1
1 + exp(−ak)
Recall from generative classification model perspective:
ak(x) = ln p(x, Ck1 ) p(x, Ck2 )
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Functional Transformations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
278of 825
Combine all transformations into one formula
MD y (x,w)=g w(2)h w(1)x +w(1) +w(2)
k kj jiij0 k0 j=1 i=1
where w contains all weight and bias parameters.
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Functional Transformations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
279of 825
As before, the biases can be absorbed into the weights by introducing an extra input x0 = 1 and a hidden unit z0 = 1.
M D y(x,w)=g w(2)h w(1)x
k kj jii j=0 i=0
where w now contains all weight and bias parameters.
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Comparison to Perceptron
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
280of 825
A neural network looks like a multilayer perceptron.
But perceptron’s nonlinear activation function was a step
function — neither smooth nor differentiable.
are smooth and differentiable.
+1, a≥0 −1, a<0
f (a) =
The activation functions h(·) and g(·) of a neural network
hidden units
zM
wM D w(2)
KM
yK
(1)
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0
Linear Activation Functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
281of 825
If all activation functions are linear functions then there exists an equivalent network without hidden units. (Composition of linear functions is a linear function.)
But if the number of hidden units in this case is smaller than the number of input or output units, the resulting linear function are not the most general.
Dimensionality reduction.
c.f. Principal Component Analysis (upcoming lecture).
Generally, most neural networks use nonlinear activation functions as the goal is to approximate a nonlinear mapping from the input space to the outputs.
Neural Networks as Universal Function Approximators
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
282of 825
Feed-forward neural networks are universal approximators.
Example: A two-layer neural network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy if it has enough hidden units.
Holds for a wide range of hidden unit activation functions
Remaining big question : Where do we get the appropriate settings for the weights from? With other words, how do we learn the weights from training examples?
Aproximation Capabilities of Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
283of 825
Neural network approximating
f(x) = x2
Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval (−1, 1). Red: resulting output. Dashed: Output of the hidden units.
Aproximation Capabilities of Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
284of 825
Neural network approximating
f (x) = sin(x)
Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval (−1, 1). Red: resulting output. Dashed: Output of the hidden units.
Aproximation Capabilities of Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
285of 825
Neural network approximating
f (x) = |x|
Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval (−1, 1). Red: resulting output. Dashed: Output of the hidden units.
Aproximation Capabilities of Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
286of 825
Neural network approximating Heaviside function
1, x≥0 0, x<0
f (x) =
Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval (−1, 1). Red: resulting output. Dashed: Output of the hidden units.
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
z = σ(w0 + w1x1 + w2x2) for (w0, w1, w2) = (0.0, 1.0, 0.1)
-10
5 10
-5
x2
0
5
-5
0
x1
10 -10
287of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
z = σ(w0 + w1x1 + w2x2) for (w0, w1, w2) = (0.0, 0.1, 1.0)
-10
5 10
-5
x2
0
5
-5
0
x1
10 -10
288of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
z = σ(w0 + w1x1 + w2x2) for (w0, w1, w2) = (0.0, −0.5, 0.5)
-10
5 10
-5
x2
0
5
-5
0
x1
10 -10
289of 825
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basis functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
z = σ(w0 + w1x1 + w2x2) for (w0, w1, w2) = (10.0, −0.5, 0.5)
-10
5 10
-5
x2
0
5
-5
0
x1
10 -10
290of 825
Aproximation Capabilities of Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
291of 825
Neural network for two-class classification.
2 inputs, 2 hidden units with tanh activation function, 1 output with logistic sigmoid activation function.
3 2 1 0
−1 −2
Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hidden unit contours. Green: Optimal decision boundary from the known data distribution.
−2 −1 0 1 2
Weight-space Symmetries
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent Optimisation
292of 825
Given a set of weights w. This fixes a mapping from the input space to the output space.
Does there exist another set of weights realising the same mapping?
Assume tanh activation function for the hidden units. As tanh is an odd function: tanh(−a) = − tanh(a).
Change the sign of all inputs to a hidden unit and outputs of this hidden unit: Mapping stays the same.
hidden units
zM
MD w(2)
KM
yK
w(1) xD
inputs
x1 x0
outputs
y1
z1 w(2) 10
z0
Weight-space Symmetries
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient Descent Optimisation
293of 825
M hidden units, therefore 2M equivalent weight vectors.
Furthermore, exchange all of the weights going into and out of a hidden unit with the corresponding weights of another hidden unit. Mapping stays the same. M! symmetries.
Overall weight space symmetry : M! 2M
M
1
2
3
4
5
6
7
M!2M
2
8
48
384
3840
46080
645120
hidden units
zM
wMD (2)
(1)
xD inputs x1 x0
wKM
yK
outputs
y1
z1 w(2) 10
z0
Parameter Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
294of 825
Assume the error E(w) is a smooth function of the weights. Smallest value will occur at a critical point for which
∇E(w) = 0.
This could be a minimum, maximum, or saddle point.
Furthermore, because of symmetry in weight space, there are at least M! 2M other critical points with the same value for the error.
Parameter Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
295of 825
Definition (Global Minimum)
A point w∗ for which the error E(w∗) is smaller than any other error E(w).
Definition (Local Minimum)
A point w∗ for which the error E(w∗) is smaller than any other error E(w) in some neighbourhood of w∗.
w2
E(w)
wA
wB
w1 wC
∇E
Parameter Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
296of 825
Finding the global minimium is difficult in general (would have to check everywhere) unless the error function comes from a special class (e.g. smooth convex functions have only one local minimum).
Error functions for neural networks are not convex (symmetries!).
But finding a local minimum might be sufficient.
Use iterative methods with weight vector update ∆w(τ ) to find a local minimum.
w(τ+1) = w(τ) + ∆w(τ) E(w)
w2
∇E
wA
wB
w1 wC
Local Quadratic Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
297of 825
Around a stationary point w∗ we can approximate E(w) ≃ E(w∗) + 1(w − w∗)TH(w − w∗),
2
where the Hessian H is evaluated at w∗ so that Hij= ∂2 E(w).
∂wi∂wj w=w∗
Using a set {ui} of orthonormal eigenvectors of H,
to expand
We get
Hui = λiui, w−w∗ =αiui.
i
E ( w ) = E ( w ∗ ) + 1 λ i α i2 . 2i
Local Quadratic Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
298of 825
Around a minimum w∗ we can approximate E ( w ) = E ( w ∗ ) + 1 λ i α i2 .
2i u2
w2
u1
λ−1/2 2
w⋆
λ−1/2
1
w1
Local Quadratic Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
299of 825
Around a minimum w∗, the Hessian H must be positive definite if evaluated at w∗.
w2
u2
u1
λ−1/2 2
w⋆
λ−1/2
1
This explains why the Laplace approximation always yields a valid covariance matrix.
w1
Gradient Information improves Performances
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
300of 825
Hessian is symmetric and contains W(W + 1)/2 independent entries where W is the total number of weights in the network.
If we use function evaluations only:
Need to gather this O(W2) pieces of information by doing O(W2) number of function evaluations each of which cost O(W) time, for an overall cost of order O(W3).
If we use gradients of the function:
Surprisingly the gradient ∇E also costs only O(W) time,
although it provides W pieces of information.
We now need only O(W) steps, so the order of time complexity is reduced to O(W2).
FYI only: In general we have the “cheap gradient principle”. See (Griewank, A., 2000. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Section 5.1).
Gradient Descent Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
301of 825
Batch processing : Update the weight vector with a small step in the direction of the negative gradient
w(τ+1) =w(τ) −η∇E(w(τ))
where η is the learning rate.
After each step, re-evaluate the gradient ∇E(w(τ)) again. Gradient Descent has problems in ’long valleys’.
Gradient Descent Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
302of 825
Gradient Descent has problems in ’long valleys’.
Example of zig-zag of Gradient Descent Algorithm.
Nonlinear Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
303of 825
Use Conjugate Gradient Descent instead of Gradient Descent to avoid zig-zag behaviour.
Use Newton method which also calculates the inverse Hessian in each iteration (but inverting the Hessian is usually costly).
Use Quasi-Newton methods (e.g. BFGS) which also calculates an estimate of the inverse Hessian while iterating.
Even simpler are momentum based strategies.
Run the algorithm from a set of starting points to find the smallest local minimum.
Nonlinear Optimisation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
304of 825
Remaining big problem: Error function is defined over the whole training set. Therefore, need to process the whole training set for each calculation of the gradient ∇E(w(τ)).
If the error function is a sum of errors for each data point
N
E(w) = En(w)
n=1
we can use on-line gradient descent (also called sequential gradient descent or stochastic gradient descent) to update the weights by one data point at a time
w(τ+1) = w(τ) − η∇En(w(τ)).
FYI Only: NN Extensions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Neural Networks Weight-space Symmetries Parameter Optimisation
Gradient Descent Optimisation
305of 825
Add more hidden layers (deep learning). To make it work we need many of the following tricks:
Clever weight initialisation to ensure the gradient is flowing through the entire network.
Some links may additively skip over one or several subsequent layer(s).
Favour ReLU over e.g. the sigmoid, to avoid vanishing gradients.
Clever regularisation methods such as dropout. Specific architectures, not further considered here:
Parameters may be shared, notably as in convolutional neural networks for images.
A state space model with neural network transitions is a recurrent neural network.
Attention mechanisms learn to focus on specific parts of an input.