Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 821
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
793of 821
Part XXIII
Discussion and Summary
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
794of 821
Flavour of this course
Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Calculus to identify good parameters
Probabilistic inference
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
795of 821
Probability Theory
Frequentist vs. Bayes approach
Conditional Probability
Bayes Theorem
Discrete vs. continuous random variables
Distributions (Gaussian, Bernoulli, Binomial)
Multivariate Distributions
Change of Variables
Conjugate Priors
(Chapter 2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
796of 821
Linear Algebra and Calculus
Vector Space
Matrix-Vector Multiplication = Linear Combination
Projection
Positive (Semi)-definite Matrix
Rank, Determinant, Trace, Inverse
Eigenvectors, Eigenvalues
Eigenvector Decomposition
Singular Value Decomposition
Gradient Calculation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
797of 821
Optimisation
Gradient descent
Stochastic (On-line) Gradient Descent
Global vs. Local extremum
Lagrange Multipliers
Quadratic programming (quadratic objective function,
linear constraints)
Dynamic programming (e.g. Hidden Markov Model)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
798of 821
Probabilistic Graphical Models
Joint Probability factorises.
Conditional Independence
Independence Structure or “the absence of edges”
Directed, Undirected and Factor Graphs
Bayesian Network, Blocked Path and d-separation
Markov Random Field, (maximal) Cliques
Factor Graphs are Bipartite Graphs
(Chapter 8)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
799of 821
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
800of 821
Learning
Examples
General Setup
Inductive Bias
Restricted Hypothesis Space
Importance of understanding the restrictions and whether
they are appropriate
Do not train on the test set
(Chapter 1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
801of 821
Strategy in this course
Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data
2 Propose a mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
802of 821
Tasks
Regression
Classification: binary and multiclass
Clustering and density estimation
Sequence prediction
Dimensionality reduction
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
803of 821
Models for Decision Problems
Given the input space V, input data x ∈ V, and a set of classes
C = {C1, C2, . . . , CK}.
Discriminant Function f (x)
f : V→ C
Discriminant Model p(Ck | x), then use decision theory
Generative Model p(x, Ck), then use decision theory
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
804of 821
Probability/Density Estimation
Maximum Likelihood (ML)
θ? = argmax
θ
p(D |θ)
Maximum a Posteriori (MAP)
θ? = argmax
θ
p(θ | D) ∝ p(D |θ) p(θ)
Bayesian
p(θ | D(n)) ∝ p(x(n) |θ) p(θ | D(n−1))
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
805of 821
The Role of Training Data
Parametric Methods : Learn the model parameter from the
training data, then discard training data.
Nonparametric methods: Use training data for prediction
Histogram method
k-nearest neighbours
Parzen probability density model: set of function centered
on the data
Kernel methods: Use linear combination of functions
evaluated at the training data.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
806of 821
Likelihood and Loss Views
Bayes’ Theorem
p(θ | D) =
p(D |θ) p(θ)
p(D)
Regularized Risk
N∑
n=1
`(θ,Dn) +
λ
2
‖θ‖2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
807of 821
Linear Regression
General Regression Setup
Closed-form solution
Maximum Likelihood and Least Squares
Geometry of Least Squares
Sequential Learning (on-line)
Choice of basis function
Regularisation
Powerful with nonlinear feature mappings
Bias-Variance Decomposition
(Chapter 3.1, 3.2, 3.3)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
808of 821
Bayesian Linear Regression
Closed-form solution
Predictive Distribution
p(t | x, x, t) =
∫
p(t,w | x, x, t) dw =
∫
p(t |w, x) p(w | x, t) dw
Conjugate Prior
Limitations of Linear Basis Function Models
Curse of dimensionality
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
809of 821
Linear Classification
General Classification Setup
Input space versus Feature space
Binary and Multiclass Labels
Maximum Likelihood solution
θ? = argmax
θ
p(X, t |θ)
Naive Bayes : all features conditioned on the class are
independent of each other
(Chapter 4)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
810of 821
Logistic Regression
Smooth logistic sigmoid acting on a linear feature vector
Compare to perceptron
Error as negative log likelihood (cross-entropy error)
Gradient of error is target deviation times basis function
(linear)
Laplace approximation
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
811of 821
Flexible Basis Functions
Neural Networks
Multilayer Perceptron with differentiable activation function
The basis functions can now adopt to the data.
Weight space symmetries.
Error Backpropagation.
Regularisation in Neural Networks.
(Chapter 5)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
812of 821
Dimensionality Reduction – PCA
Maximise Variance
Find the eigenvectors of the covariance corresponding to
the largest eigenvalues.
PCA and Compression
Data Standardisation
Data Whitening
(Chapter 12.1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
813of 821
Autoencoders
For feedforward neural networks, multiple layers can be
advantageous
Multiple PCA layers is equivalent to one single PCA
Want to minimise reconstruction error, add nonlinear
hidden layer
Undercomplete autoencoder – lossy compression
Pre-training of supervised learning (with unlabelled data)
Denoising autoencoder
Overcomplete autoencoder – sparse representations
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
814of 821
Mixture Models
Joint Probability over observed variables does not longer
factorise.
Introduce discrete latent variables to model complex
marginal distributions over the observed variables by
simpler distributions over observed and latent variables.
K-means clustering
Data compression
Mixture of Bernoulli
Mixture of Gaussians
p(x) =
∑
z
p(z) p(x | z) =
∑
z
K∏
k=1
π
zk
k
K∏
k=1
N (x |µk,Σk)
zk
=
K∑
k=1
πkN (x |µk,Σk)
(Chapter 9.1, 9.2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
815of 821
Expectation Maximisation (EM)
Evaluate the responsibilities, then maximise the
parameters.
E step: Find p(Z |X,θold).
M step: Find θnew = argmaxθ Q(θ,θ
old) where
Q(θ,θold) =
∑
Z
p(Z |X,θold) ln p(X,Z |θ)
Kullback-Leibler Divergence
(Chapter 9.3, 9.4)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
816of 821
Kernels
Inner Product→ Kernel
Kernels are a kind of similarity measure
Sparse Kernel Machines
Support Vector Machines
Lagrange multipliers
How do we get to the relevant data points?
Overlapping class distribution
Output are decisions, not posterior probabilities.
(Chapter 6.1, 6.2 and 7.1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
817of 821
Sum Rule – Integration
Sum Rule
p(X) =
∑
Y
p(X,Y)
Four possible options to do
∑
Y
Brute force – exhaustive
Sampling
Sum product
Variational Inference
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
818of 821
Sampling
Central task: Find p(Z |X,θ).
Stochastic Approximation
Sampling from the uniform distribution.
Sampling from standard distributions via the inversion of
the cumulative distribution function
Rejection Sampling
Adaptive Rejection Sampling
Importance Sampling : calculate the expectation of
function f (z) under distribution p(z) using some simpler
q(z).
Markov Chain Monte Carlo
Metropolis Hastings Algorithm
Gibbs Sampling
(Chapter 11)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
819of 821
Sum-Product Algorithm
Independence structure – Local computations
Sum-Product Algorithm
Message passing
Distributive Law
(Chapter 8.4)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
820of 821
Variational Inference
Deterministic Approximation
Assume an analytical approximation of posterior
E.g. Assume posterior factorizes
Convert integration problem into optimization problem over
functions
Use exponential family form, optimize with KL divergence
(Chapter 10)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
821of 821
Sequential Data
Stationary vs. Nonstationary Sequential Distributions
Markov Model of order M = 0, 1, . . .
State Space Model using latent variables
Hidden Markov Model (HMM): Latent variables are
discrete.
Homogenuous HMM
Left-to-right HMM
Viterbi algorithm
(Chapter 13.1, 13.2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
822of 821
The topics
The Language
Vectors, geometry, linear algebra, calculus, probability,
graphical models
The Tool
Construct model, take the gradient, set it to zero, solve
Supervised learning
Regression, classification and sequence prediction
Unsupervised learning
Clustering, density estimation and dimensionality
reduction
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
823of 821
Flavour of this course
Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Calculus to identify good parameters
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
824of 821
What we did not cover (in detail)
Other learning paradigms
Neural Networks
Evolutionary Methods (e.g. Genetic Algorithms)
Frequent item mining
Expert Systems / Rule based learning
Theory
Information Theory
Convex Optimisation
Generalised Linear Models
Dynamical Systems
Reinforcement Learning
Artificial Intelligence
Applications
Natural Language Processing
Computer Vision
Computational Social Science
Robotics
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
The Language
Problem Setting
Linear Regression and
Classification
Neural Networks
Non-Factorising
Distributions
Kernels
Sum Rule
Factorising Distributions
Sequential Data
Where to go from here?
825of 821
What is Machine Learning?
Shameless plug: mml-book.com
Sequential Data 2
A Simple Example
Maximum Likelihood for HMM
Forward-Backward HMM
Conditional Independence
Alpha-Beta HMM
How to train a HMM using EM?
HMM – Viterbi Algorithm