Statistical Machine Learning
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods
Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis
Autoencoders
Graphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research Group
CSIRO Data61
and
College of Engineering and Computer Science
The Australian National University
Canberra
Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
235of 825
Part VI
Linear Classification 2
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
236of 825
Three Models for Decision Problems
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the
classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the
classes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
237of 825
Data Generating Process
Given:
class prior p(t)
class-conditional p(x | t)
to generate data from the model we may do the following:
1 Sample the class label from the class prior p(t).
2 Sample the data features from the class-conditional
distribution p(x | t).
(more about sampling later — this is called ancestral sampling)
Thinking about the data generating process is a useful
modelling step, especially when we have more prior
knowledge.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
238of 825
Probabilistic Generative Models
Generative approach: model class-conditional densities
p(x | Ck) and class priors (not parameter priors!) p(Ck) to
calculate the posterior probability for class C1
p(C1 | x) =
p(x | C1)p(C1)
p(x | C1)p(C1) + p(x | C2)p(C2)
=
1
1 + exp(−a(x))
≡ σ(a(x))
where a and the logistic sigmoid function σ(a) are given by
a(x) = ln
p(x | C1) p(C1)
p(x | C2) p(C2)
= ln
p(x, C1)
p(x, C2)
σ(a) =
1
1 + exp(−a)
.
One point of this re-writing: we may learn a(x) directly as
e.g. a deep neural network.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
239of 825
Logistic Sigmoid
The logistic sigmoid function is called a “squashing
function” because it squashes the real axis into a finite
interval (0, 1).
Well known properties (derive them):
Symmetry: σ(−a) = 1− σ(a)
Derivative: ddaσ(a) = σ(a)σ(−a) = σ(a) (1− σ(a))
Inverse of σ is called the logit function
-10 -5 5 10
a
0.2
0.4
0.6
0.8
1.0
ΣHaL
Sigmoid σ(a) = 11+exp(−a)
0.2 0.4 0.6 0.8 1.0
Σ
-6
-4
-2
2
4
aHΣL
Logit a(σ) = ln
(
σ
1−σ
)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
240of 825
Probabilistic Generative Models – Multiclass
The normalised exponential is given by
p(Ck | x) =
p(x | Ck) p(Ck)∑
j p(x | Cj) p(Cj)
=
exp(ak)∑
j exp(aj)
where
ak = ln(p(x | Ck) p(Ck)).
Usually called the softmax function as it is a smoothed
version of the argmax function, in particular:
ak � aj ∀j 6= k⇒
(
p(Ck | x) ≈ 1 ∧ p(Cj | x) ≈ 0
)
So, softargmax is a more descriptive though less common
name.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
241of 825
Probabil. Generative Model – Continuous Input
Assume class-conditional probabilities are Gaussian, with
the same covariance and different mean:
Let’s characterise the posterior probabilities.
We may separate the quadratic and linear term in x:
p(x | Ck)
=
1
(2π)D/2
1
|Σ|1/2
exp
{
−
1
2
(x− µk)
TΣ−1(x− µk)
}
=
1
(2π)D/2
1
|Σ|1/2
exp
{
−
1
2
xTΣ−1x + µTk Σ
−1x−
1
2
µTk Σ
−1µk
}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
242of 825
Probabil. Generative Model – Continuous Input
For two classes
p(C1 | x) = σ(a(x))
and a(x) is linear because the quadratic terms in x cancel
(c.f. the previous slide):
a(x) = ln
p(x | C1) p(C1)
p(x | C2) p(C2)
= ln
exp
{
µT1Σ
−1x− 12µ
T
1Σ
−1µ1
}
exp
{
µT2Σ
−1x− 12µ
T
2Σ
−1µ2
} + ln p(C1)
p(C2)
Therefore
p(C1 | x) = σ(wTx + w0)
where
w = Σ−1(µ1 − µ2)
w0 = −
1
2
µT1 Σ
−1µ1 +
1
2
µT2 Σ
−1µ2 + ln
p(C1)
p(C2)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
243of 825
Probabil. Generative Model – Continuous Input
Class-conditional densities for two classes (left). Posterior
probability p(C1 | x) (right). Note the logistic sigmoid of a linear
function of x.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
244of 825
General Case – K Classes, Shared Covariance
Use the normalised exponential
p(Ck | x) =
p(x | Ck)p(Ck)∑
j p(x | Cj)p(Cj)
=
exp(ak)∑
j exp(aj)
where
ak = ln (p(x | Ck)p(Ck)) .
to get a linear function of x
ak(x) = wTk x + wk0.
where
wk = Σ
−1µk
wk0 = −
1
2
µTk Σ
−1µk + p(Ck).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
245of 825
General Case – K Classes, Different Covariance
If the class-conditional distributions have different
covariances, the quadratic terms − 12 x
TΣ−1x do not cancel
out.
We get a quadratic discriminant.
−2 −1 0 1 2
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
246of 825
Parameter Estimation
Given the functional form of the class-conditional densities
p(x | Ck), how can we determine the parameters µ and Σ
and the class prior ?
Simplest is maximum likelihood.
Given also a data set (xn, tn) for n = 1, . . . ,N. (Using the
coding scheme where tn = 1 corresponds to class C1 and
tn = 0 denotes class C2.
Assume the class-conditional densities to be Gaussian
with the same covariance, but different mean.
Denote the prior probability p(C1) = π, and therefore
p(C2) = 1− π.
Then
p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
247of 825
Parameter Estimation
Given the functional form of the class-conditional densities
p(x | Ck), how can we determine the parameters µ and Σ
and the class prior ?
Simplest is maximum likelihood.
Given also a data set (xn, tn) for n = 1, . . . ,N. (Using the
coding scheme where tn = 1 corresponds to class C1 and
tn = 0 denotes class C2.
Assume the class-conditional densities to be Gaussian
with the same covariance, but different mean.
Denote the prior probability p(C1) = π, and therefore
p(C2) = 1− π.
Then
p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
248of 825
Maximum Likelihood Solution
Thus the likelihood for the whole data set X and t is given
by
p(t,X |π,µ1,µ2,Σ)
=
N∏
n=1
[πN (xn |µ1,Σ)]
tn × [(1− π)N (xn |µ2,Σ)]
1−tn
Maximise the log likelihood
The term depending on π is
N∑
n=1
(
tn lnπ + (1− tn) ln(1− π)
)
which is maximal for (derive it)
π =
1
N
N∑
n=1
tn =
N1
N
=
N1
N1 + N2
where N1 is the number of data points in class C1.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
249of 825
Maximum Likelihood Solution
Similarly, we can maximise the likelihood
p(t,X |π,µ1,µ2,Σ) w.r.t. the means µ1 and µ2, to get
µ1 =
1
N1
N∑
n=1
tn xn
µ2 =
1
N2
N∑
n=1
(1− tn) xn
For each class, this are the means of all input vectors
assigned to this class.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
250of 825
Maximum Likelihood Solution
Finally, the log likelihood ln p(t,X |π,µ1,µ2,Σ) can be
maximised for the covariance Σ resulting in
Σ =
N1
N
S1 +
N2
N
S2
Sk =
1
Nk
∑
n∈Ck
(xn − µk)(xn − µk)
T
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
251of 825
Discrete Features – Naïve Bayes
Assume the input space consists of discrete features, in
the simplest case xi ∈ {0, 1}.
For a D-dimensional input space, a general distribution
would be represented by a table with 2D entries.
Together with the normalisation constraint, this are 2D − 1
independent variables.
Grows exponentially with the number of features.
The Naïve Bayes assumption is that, given the class Ck,
the features are independent of each other:
p(x | Ck) =
D∏
i=1
p(xi | C)
=
D∏
i=1
µ
xi
ki(1− µki)
1−xi
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
252of 825
Discrete Features – Naïve Bayes
With the naïve Bayes
p(x | Ck) =
D∏
i=1
µ
xi
ki(1− µki)
1−xi
we can then again find the factors ak in the normalised
exponential
p(Ck | x) =
p(x | Ck)p(Ck)∑
j p(x | Cj)p(Cj)
=
exp(ak)∑
j exp(aj)
as a linear function of the xi
ak(x) =
D∑
i=1
{xi lnµki + (1− xi) ln(1− µki)}+ ln p(Ck).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
253of 825
Three Models for Decision Problems
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the
classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the
classes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
254of 825
Probabilistic Discriminative Models
Discriminative training: learn only to discriminate between
the classes.
Maximise a likelihood function defined through the
conditional distribution p(Ck | x) directly.
Typically fewer parameters to be determined.
As we learn the posteriror p(Ck | x) directly, prediction may
be better than with a generative model where the
class-conditional density assumptions p(x | Ck) poorly
approximate the true distributions.
But: discriminative models can not create synthetic data,
as p(x) is not modelled.
As an aside: certain theoretical analyses show that
generative models converge faster to their — albeit worse
— asymptotic classification performance and are superior
in some regimes.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
255of 825
Original Input versus Feature Space
So far in classification, we used direct input x.
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space.
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
256of 825
Original Input versus Feature Space
Linear decision boundaries in the feature space generally
correspond to nonlinear boundaries in the input space.
Classes which are NOT linearly separable in the input
space may become linearly separable in the feature space:
x1
x2
−1 0 1
−1
0
1
φ1
φ2
0 0.5 1
0
0.5
1
If classes overlap in input space, they will also overlap in
feature space — nonlinear features φ(x) can not remove
the overlap; but they may increase it.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
257of 825
Original Input versus Feature Space
Fixed basis functions do not adapt to the data and
therefore have important limitations (see discussion in
Linear Regression).
Understanding of more advanced algorithms becomes
easier if we introduce the feature space now and use it
instead of the original input space.
Some applications use fixed features successfully by
avoiding the limitations.
We will therefore use φ instead of x from now on.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
258of 825
Logistic Regression
Two classes where the posterior of class C1 is a logistic
sigmoid σ() acting on a linear function of the input:
p(C1 |φ) = y(φ) = σ(wTφ)
p(C2 |φ) = 1− p(C1 |φ)
Model dimension is equal to dimension of the feature
space M.
Compare this to fitting two Gaussians, which has a
quadratic number of parameters in M:
2M︸︷︷︸
means
+ M(M + 1)/2︸ ︷︷ ︸
shared covariance
For larger M, the logistic regression model has a clear
advantage.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
259of 825
Logistic Regression
Determine the parameter via maximum likelihood for data
(φn, tn), n = 1, . . . ,N, where φn = φ(xn). The class
membership is coded as tn ∈ {0, 1}.
Likelihood function
p(t |w) =
N∏
n=1
ytnn (1− yn)
1−tn
where yn = p(C1 |φn).
Error function : negative log likelihood resulting in the
cross-entropy error function
E(w) = − ln p(t |w) = −
N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
260of 825
Logistic Regression
Error function (cross-entropy loss)
E(w) = −
N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
yn = p(C1 |φn) = σ(wTφn)
We obtain the gradient of the error function using the chain
rule and the sigmoid result dσda = σ(1− σ) (derive it):
∇E(w) =
N∑
n=1
(yn − tn)φn
for each data point error is product of deviation yn − tn and
basis function φn.
We can now use gradient descent.
We may easily modify this to reduce over-fitting by using
regularised error or MAP (how?).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
261of 825
Laplace Approximation
Given a continous distribution p(x) which is not Gaussian,
can we approximate it by a Gaussian q(x) ?
Need to find a mode of p(x). Try to find a Gaussian with
the same mode:
−2 −1 0 1 2 3 4
0
0.2
0.4
0.6
0.8
p.d.f. of :
Non-Gaussian (yellow) and
Gaussian approximation (red).
−2 −1 0 1 2 3 4
0
10
20
30
40
negative log p.d.f. of :
Non-Gaussian (yellow) and
Gaussian approxmation. (red).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
262of 825
Laplace Approximation
Cheap and nasty but sometimes effective.
Assume p(x) can be written as
p(z) =
1
Z
f (z)
with normalisation Z =
∫
f (z) dz.
We do not even need to know Z to find the Laplace
approximation.
A mode of p(z) is at a point z0 where p′(z0) = 0.
Taylor expansion of ln f (z) at z0
ln f (z) ‘ ln f (z0)−
1
2
A(z− z0)2
where
A = −
d2
dz2
ln f (z) |z=z0
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
263of 825
Laplace Approximation
Exponentiating
ln f (z) ‘ ln f (z0)−
1
2
A(z− z0)2
we get
f (z) ‘ f (z0) exp{−
A
2
(z− z0)2}.
And after normalisation we get the Laplace approximation
q(z) =
(
A
2π
)1/2
exp{−
A
2
(z− z0)2}.
Only defined for precision A > 0 as only then p(z) has a
maximum.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
264of 825
Laplace Approximation – Vector Space
Approximate p(z) for z ∈ RM
p(z) =
1
Z
f (z).
we get the Taylor expansion
ln f (z) ‘ ln f (z0)−
1
2
(z− z0)TA(z− z0)
where the Hessian A is defined as
A = −∇∇ ln f (z) |z=z0 .
The Laplace approximation of p(z) is then
q(z) ∝ exp
{
−
1
2
(z− z0)TA(z− z0)
}
⇒ q(z) = N (z | z0,A−1)
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
265of 825
Bayesian Logistic Regression
Exact Bayesian inference for the logistic regression is
intractable.
Why? Need to normalise a product of prior probabilities
and likelihoods which itself are a product of logistic
sigmoid functions, one for each data point.
Evaluation of the predictive distribution also intractable.
Therefore we will use the Laplace approximation.
The predictive distribution remains intractible even under
the Laplace approximation to the posterior distribution, but
it can be approximated.
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
266of 825
Bayesian Logistic Regression
Assume a Gaussian prior:
p(w) = N (w |m0,S0)
for fixed hyperparameters m0 and S0.
Hyperparameters are parameters of a prior distribution. In
contrast to the model parameters w, they are not learned.
For a set of training data (xn, tn), where n = 1, . . . ,N, the
posterior is given by
p(w | t) ∝ p(w)p(t |w)
where t = (t1, . . . , tN)T .
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
267of 825
Bayesian Logistic Regression
Using our previous result for the cross-entropy function
E(w) = − ln p(t |w) = −
N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
we can now calculate the log of the posterior
p(w | t) ∝ p(w)p(t |w)
using the notation yn = σ(wTφn) as
ln p(w | t) =−
1
2
(w−m0)TS−10 (w−m0)
+
N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
268of 825
Bayesian Logistic Regression
To obtain a Gaussian approximation to
ln p(w | t)
= −
1
2
(w−m0)TS−10 (w−m0) +
N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
1 Find wMAP which maximises ln p(w | t). This defines the
mean of the Gaussian approximation. (Note: This is a
nonlinear function in w because yn = σ(wTφn).)
2 Calculate the second derivative of the negative log likelihood
to get the inverse covariance of the Laplace approximation
SN = −∇∇ ln p(w | t) = S−10 +
N∑
n=1
yn(1− yn)φnφ
T
n .
Nowadays the gradient and Hessian would be computed with
automatic differentiation; one need only implement ln p(w | t).
Statistical Machine
Learning
c©2020
Ong & Walder & Webers
Data61 | CSIRO
The Australian National
University
Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
269of 825
Bayesian Logistic Regression
The approximated Gaussian (via Laplace approximation)
of the posterior distribution is now
q(w |φ) = N (w |wMAP,SN)
where
SN = −∇∇ ln p(w | t) = S−10 +
N∑
n=1
yn(1− yn)φnφ
T
n .
Linear Classification 2
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression