CS计算机代考程序代写 python chain Bayesian flex information theory Hidden Markov Mode Bayesian network algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

793of 821

Part XXIII

Discussion and Summary

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

794of 821

Flavour of this course

Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Calculus to identify good parameters
Probabilistic inference
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

795of 821

Probability Theory

Frequentist vs. Bayes approach
Conditional Probability
Bayes Theorem
Discrete vs. continuous random variables
Distributions (Gaussian, Bernoulli, Binomial)
Multivariate Distributions
Change of Variables
Conjugate Priors

(Chapter 2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

796of 821

Linear Algebra and Calculus

Vector Space
Matrix-Vector Multiplication = Linear Combination
Projection
Positive (Semi)-definite Matrix
Rank, Determinant, Trace, Inverse
Eigenvectors, Eigenvalues
Eigenvector Decomposition
Singular Value Decomposition
Gradient Calculation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

797of 821

Optimisation

Gradient descent
Stochastic (On-line) Gradient Descent
Global vs. Local extremum
Lagrange Multipliers
Quadratic programming (quadratic objective function,
linear constraints)
Dynamic programming (e.g. Hidden Markov Model)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

798of 821

Probabilistic Graphical Models

Joint Probability factorises.
Conditional Independence
Independence Structure or “the absence of edges”
Directed, Undirected and Factor Graphs
Bayesian Network, Blocked Path and d-separation
Markov Random Field, (maximal) Cliques
Factor Graphs are Bipartite Graphs

(Chapter 8)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

799of 821

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

800of 821

Learning

Examples
General Setup
Inductive Bias
Restricted Hypothesis Space
Importance of understanding the restrictions and whether
they are appropriate
Do not train on the test set

(Chapter 1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

801of 821

Strategy in this course

Estimate best predictor = training = learning
Given data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data
2 Propose a mathematical model for fw
3 Design an objective function or likelihood
4 Calculate the optimal parameter (w)
5 Model uncertainty using the Bayesian approach
6 Implement and compute (the algorithm in python)
7 Interpret and diagnose results

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

802of 821

Tasks

Regression
Classification: binary and multiclass
Clustering and density estimation
Sequence prediction
Dimensionality reduction

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

803of 821

Models for Decision Problems

Given the input space V, input data x ∈ V, and a set of classes
C = {C1, C2, . . . , CK}.

Discriminant Function f (x)

f : V→ C

Discriminant Model p(Ck | x), then use decision theory
Generative Model p(x, Ck), then use decision theory

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

804of 821

Probability/Density Estimation

Maximum Likelihood (ML)

θ? = argmax
θ

p(D |θ)

Maximum a Posteriori (MAP)

θ? = argmax
θ

p(θ | D) ∝ p(D |θ) p(θ)

Bayesian

p(θ | D(n)) ∝ p(x(n) |θ) p(θ | D(n−1))

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

805of 821

The Role of Training Data

Parametric Methods : Learn the model parameter from the
training data, then discard training data.
Nonparametric methods: Use training data for prediction

Histogram method
k-nearest neighbours
Parzen probability density model: set of function centered
on the data

Kernel methods: Use linear combination of functions
evaluated at the training data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

806of 821

Likelihood and Loss Views

Bayes’ Theorem

p(θ | D) =
p(D |θ) p(θ)

p(D)

Regularized Risk

N∑
n=1

`(θ,Dn) +
λ

2
‖θ‖2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

807of 821

Linear Regression

General Regression Setup
Closed-form solution
Maximum Likelihood and Least Squares
Geometry of Least Squares
Sequential Learning (on-line)
Choice of basis function
Regularisation
Powerful with nonlinear feature mappings
Bias-Variance Decomposition

(Chapter 3.1, 3.2, 3.3)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

808of 821

Bayesian Linear Regression

Closed-form solution
Predictive Distribution

p(t | x, x, t) =

p(t,w | x, x, t) dw =

p(t |w, x) p(w | x, t) dw

Conjugate Prior
Limitations of Linear Basis Function Models
Curse of dimensionality

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

809of 821

Linear Classification

General Classification Setup
Input space versus Feature space
Binary and Multiclass Labels
Maximum Likelihood solution

θ? = argmax
θ

p(X, t |θ)

Naive Bayes : all features conditioned on the class are
independent of each other

(Chapter 4)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

810of 821

Logistic Regression

Smooth logistic sigmoid acting on a linear feature vector
Compare to perceptron
Error as negative log likelihood (cross-entropy error)
Gradient of error is target deviation times basis function
(linear)
Laplace approximation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

811of 821

Flexible Basis Functions

Neural Networks
Multilayer Perceptron with differentiable activation function
The basis functions can now adopt to the data.
Weight space symmetries.
Error Backpropagation.
Regularisation in Neural Networks.

(Chapter 5)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

812of 821

Dimensionality Reduction – PCA

Maximise Variance
Find the eigenvectors of the covariance corresponding to
the largest eigenvalues.
PCA and Compression
Data Standardisation
Data Whitening

(Chapter 12.1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

813of 821

Autoencoders

For feedforward neural networks, multiple layers can be
advantageous
Multiple PCA layers is equivalent to one single PCA
Want to minimise reconstruction error, add nonlinear
hidden layer
Undercomplete autoencoder – lossy compression
Pre-training of supervised learning (with unlabelled data)
Denoising autoencoder
Overcomplete autoencoder – sparse representations

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

814of 821

Mixture Models

Joint Probability over observed variables does not longer
factorise.
Introduce discrete latent variables to model complex
marginal distributions over the observed variables by
simpler distributions over observed and latent variables.
K-means clustering
Data compression
Mixture of Bernoulli
Mixture of Gaussians

p(x) =

z

p(z) p(x | z) =

z

K∏
k=1

π
zk
k

K∏
k=1

N (x |µk,Σk)
zk

=

K∑
k=1

πkN (x |µk,Σk)

(Chapter 9.1, 9.2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

815of 821

Expectation Maximisation (EM)

Evaluate the responsibilities, then maximise the
parameters.
E step: Find p(Z |X,θold).
M step: Find θnew = argmaxθ Q(θ,θ

old) where

Q(θ,θold) =

Z

p(Z |X,θold) ln p(X,Z |θ)

Kullback-Leibler Divergence

(Chapter 9.3, 9.4)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

816of 821

Kernels

Inner Product→ Kernel
Kernels are a kind of similarity measure
Sparse Kernel Machines
Support Vector Machines
Lagrange multipliers
How do we get to the relevant data points?
Overlapping class distribution
Output are decisions, not posterior probabilities.

(Chapter 6.1, 6.2 and 7.1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

817of 821

Sum Rule – Integration

Sum Rule
p(X) =


Y

p(X,Y)

Four possible options to do

Y
Brute force – exhaustive
Sampling
Sum product
Variational Inference

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

818of 821

Sampling

Central task: Find p(Z |X,θ).
Stochastic Approximation
Sampling from the uniform distribution.
Sampling from standard distributions via the inversion of
the cumulative distribution function
Rejection Sampling
Adaptive Rejection Sampling
Importance Sampling : calculate the expectation of
function f (z) under distribution p(z) using some simpler
q(z).
Markov Chain Monte Carlo
Metropolis Hastings Algorithm
Gibbs Sampling

(Chapter 11)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

819of 821

Sum-Product Algorithm

Independence structure – Local computations
Sum-Product Algorithm
Message passing
Distributive Law

(Chapter 8.4)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

820of 821

Variational Inference

Deterministic Approximation
Assume an analytical approximation of posterior
E.g. Assume posterior factorizes
Convert integration problem into optimization problem over
functions
Use exponential family form, optimize with KL divergence

(Chapter 10)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

821of 821

Sequential Data

Stationary vs. Nonstationary Sequential Distributions
Markov Model of order M = 0, 1, . . .
State Space Model using latent variables
Hidden Markov Model (HMM): Latent variables are
discrete.
Homogenuous HMM
Left-to-right HMM
Viterbi algorithm

(Chapter 13.1, 13.2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

822of 821

The topics

The Language
Vectors, geometry, linear algebra, calculus, probability,
graphical models
The Tool
Construct model, take the gradient, set it to zero, solve
Supervised learning
Regression, classification and sequence prediction
Unsupervised learning
Clustering, density estimation and dimensionality
reduction

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

823of 821

Flavour of this course

Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Calculus to identify good parameters
Design and analysis of algorithms
Numerical algorithms in python
Understand the choices when designing machine learning
methods

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

824of 821

What we did not cover (in detail)

Other learning paradigms
Neural Networks
Evolutionary Methods (e.g. Genetic Algorithms)
Frequent item mining
Expert Systems / Rule based learning

Theory
Information Theory
Convex Optimisation
Generalised Linear Models
Dynamical Systems
Reinforcement Learning
Artificial Intelligence

Applications
Natural Language Processing
Computer Vision
Computational Social Science
Robotics

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

The Language

Problem Setting

Linear Regression and
Classification

Neural Networks

Non-Factorising
Distributions

Kernels

Sum Rule

Factorising Distributions

Sequential Data

Where to go from here?

825of 821

What is Machine Learning?

Shameless plug: mml-book.com

Sequential Data 2
A Simple Example
Maximum Likelihood for HMM
Forward-Backward HMM
Conditional Independence
Alpha-Beta HMM
How to train a HMM using EM?
HMM – Viterbi Algorithm