CS计算机代考程序代写 scheme chain Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

235of 825

Part VI

Linear Classification 2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

236of 825

Three Models for Decision Problems

In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models

1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of the
classes.

Generative Models
1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the

classes.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

237of 825

Data Generating Process

Given:
class prior p(t)
class-conditional p(x | t)

to generate data from the model we may do the following:
1 Sample the class label from the class prior p(t).
2 Sample the data features from the class-conditional

distribution p(x | t).
(more about sampling later — this is called ancestral sampling)

Thinking about the data generating process is a useful
modelling step, especially when we have more prior
knowledge.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

238of 825

Probabilistic Generative Models

Generative approach: model class-conditional densities
p(x | Ck) and class priors (not parameter priors!) p(Ck) to
calculate the posterior probability for class C1

p(C1 | x) =
p(x | C1)p(C1)

p(x | C1)p(C1) + p(x | C2)p(C2)

=
1

1 + exp(−a(x))
≡ σ(a(x))

where a and the logistic sigmoid function σ(a) are given by

a(x) = ln
p(x | C1) p(C1)
p(x | C2) p(C2)

= ln
p(x, C1)
p(x, C2)

σ(a) =
1

1 + exp(−a)
.

One point of this re-writing: we may learn a(x) directly as
e.g. a deep neural network.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

239of 825

Logistic Sigmoid

The logistic sigmoid function is called a “squashing
function” because it squashes the real axis into a finite
interval (0, 1).
Well known properties (derive them):

Symmetry: σ(−a) = 1− σ(a)
Derivative: ddaσ(a) = σ(a)σ(−a) = σ(a) (1− σ(a))

Inverse of σ is called the logit function

-10 -5 5 10
a

0.2

0.4

0.6

0.8

1.0

ΣHaL

Sigmoid σ(a) = 11+exp(−a)

0.2 0.4 0.6 0.8 1.0
Σ

-6

-4

-2

2

4

aHΣL

Logit a(σ) = ln
(

σ
1−σ

)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

240of 825

Probabilistic Generative Models – Multiclass

The normalised exponential is given by

p(Ck | x) =
p(x | Ck) p(Ck)∑

j p(x | Cj) p(Cj)
=

exp(ak)∑
j exp(aj)

where
ak = ln(p(x | Ck) p(Ck)).

Usually called the softmax function as it is a smoothed
version of the argmax function, in particular:

ak � aj ∀j 6= k⇒
(

p(Ck | x) ≈ 1 ∧ p(Cj | x) ≈ 0
)

So, softargmax is a more descriptive though less common
name.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

241of 825

Probabil. Generative Model – Continuous Input

Assume class-conditional probabilities are Gaussian, with
the same covariance and different mean:

Let’s characterise the posterior probabilities.
We may separate the quadratic and linear term in x:

p(x | Ck)

=
1

(2π)D/2
1

|Σ|1/2
exp

{

1
2
(x− µk)

TΣ−1(x− µk)
}

=
1

(2π)D/2
1

|Σ|1/2
exp

{

1
2

xTΣ−1x + µTk Σ
−1x−

1
2
µTk Σ

−1µk

}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

242of 825

Probabil. Generative Model – Continuous Input

For two classes
p(C1 | x) = σ(a(x))

and a(x) is linear because the quadratic terms in x cancel
(c.f. the previous slide):

a(x) = ln
p(x | C1) p(C1)
p(x | C2) p(C2)

= ln
exp

{
µT1Σ

−1x− 12µ
T

−1µ1

}
exp

{
µT2Σ

−1x− 12µ
T

−1µ2

} + ln p(C1)
p(C2)

Therefore
p(C1 | x) = σ(wTx + w0)

where

w = Σ−1(µ1 − µ2)

w0 = −
1
2
µT1 Σ

−1µ1 +
1
2
µT2 Σ

−1µ2 + ln
p(C1)
p(C2)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

243of 825

Probabil. Generative Model – Continuous Input

Class-conditional densities for two classes (left). Posterior
probability p(C1 | x) (right). Note the logistic sigmoid of a linear

function of x.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

244of 825

General Case – K Classes, Shared Covariance

Use the normalised exponential

p(Ck | x) =
p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)
=

exp(ak)∑
j exp(aj)

where
ak = ln (p(x | Ck)p(Ck)) .

to get a linear function of x

ak(x) = wTk x + wk0.

where

wk = Σ
−1µk

wk0 = −
1
2
µTk Σ

−1µk + p(Ck).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

245of 825

General Case – K Classes, Different Covariance

If the class-conditional distributions have different
covariances, the quadratic terms − 12 x

TΣ−1x do not cancel
out.
We get a quadratic discriminant.

−2 −1 0 1 2
−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

246of 825

Parameter Estimation

Given the functional form of the class-conditional densities
p(x | Ck), how can we determine the parameters µ and Σ
and the class prior ?

Simplest is maximum likelihood.
Given also a data set (xn, tn) for n = 1, . . . ,N. (Using the
coding scheme where tn = 1 corresponds to class C1 and
tn = 0 denotes class C2.
Assume the class-conditional densities to be Gaussian
with the same covariance, but different mean.
Denote the prior probability p(C1) = π, and therefore
p(C2) = 1− π.
Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

247of 825

Parameter Estimation

Given the functional form of the class-conditional densities
p(x | Ck), how can we determine the parameters µ and Σ
and the class prior ?
Simplest is maximum likelihood.
Given also a data set (xn, tn) for n = 1, . . . ,N. (Using the
coding scheme where tn = 1 corresponds to class C1 and
tn = 0 denotes class C2.
Assume the class-conditional densities to be Gaussian
with the same covariance, but different mean.
Denote the prior probability p(C1) = π, and therefore
p(C2) = 1− π.
Then

p(xn, C1) = p(C1)p(xn | C1) = πN (xn |µ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1− π)N (xn |µ2,Σ)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

248of 825

Maximum Likelihood Solution

Thus the likelihood for the whole data set X and t is given
by

p(t,X |π,µ1,µ2,Σ)

=

N∏
n=1

[πN (xn |µ1,Σ)]
tn × [(1− π)N (xn |µ2,Σ)]

1−tn

Maximise the log likelihood
The term depending on π is

N∑
n=1

(
tn lnπ + (1− tn) ln(1− π)

)
which is maximal for (derive it)

π =
1
N

N∑
n=1

tn =
N1
N

=
N1

N1 + N2

where N1 is the number of data points in class C1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

249of 825

Maximum Likelihood Solution

Similarly, we can maximise the likelihood
p(t,X |π,µ1,µ2,Σ) w.r.t. the means µ1 and µ2, to get

µ1 =
1

N1

N∑
n=1

tn xn

µ2 =
1

N2

N∑
n=1

(1− tn) xn

For each class, this are the means of all input vectors
assigned to this class.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

250of 825

Maximum Likelihood Solution

Finally, the log likelihood ln p(t,X |π,µ1,µ2,Σ) can be
maximised for the covariance Σ resulting in

Σ =
N1
N

S1 +
N2
N

S2

Sk =
1

Nk


n∈Ck

(xn − µk)(xn − µk)
T

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

251of 825

Discrete Features – Naïve Bayes

Assume the input space consists of discrete features, in
the simplest case xi ∈ {0, 1}.
For a D-dimensional input space, a general distribution
would be represented by a table with 2D entries.
Together with the normalisation constraint, this are 2D − 1
independent variables.
Grows exponentially with the number of features.
The Naïve Bayes assumption is that, given the class Ck,
the features are independent of each other:

p(x | Ck) =
D∏

i=1

p(xi | C)

=

D∏
i=1

µ
xi
ki(1− µki)

1−xi

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

252of 825

Discrete Features – Naïve Bayes

With the naïve Bayes

p(x | Ck) =
D∏

i=1

µ
xi
ki(1− µki)

1−xi

we can then again find the factors ak in the normalised
exponential

p(Ck | x) =
p(x | Ck)p(Ck)∑

j p(x | Cj)p(Cj)
=

exp(ak)∑
j exp(aj)

as a linear function of the xi

ak(x) =
D∑

i=1

{xi lnµki + (1− xi) ln(1− µki)}+ ln p(Ck).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

253of 825

Three Models for Decision Problems

In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models

1 Solve the inference problem of determining the posterior
class probabilities p(Ck | x).

2 Use decision theory to assign each new x to one of the
classes.

Generative Models
1 Solve the inference problem of determining the

class-conditional probabilities p(x | Ck).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck) directly.
5 Use decision theory to assign each new x to one of the

classes.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

254of 825

Probabilistic Discriminative Models

Discriminative training: learn only to discriminate between
the classes.
Maximise a likelihood function defined through the
conditional distribution p(Ck | x) directly.
Typically fewer parameters to be determined.
As we learn the posteriror p(Ck | x) directly, prediction may
be better than with a generative model where the
class-conditional density assumptions p(x | Ck) poorly
approximate the true distributions.
But: discriminative models can not create synthetic data,
as p(x) is not modelled.
As an aside: certain theoretical analyses show that
generative models converge faster to their — albeit worse
— asymptotic classification performance and are superior
in some regimes.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

255of 825

Original Input versus Feature Space

So far in classification, we used direct input x.
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).
Example: Use two Gaussian basis functions centered at
the green crosses in the input space.

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

256of 825

Original Input versus Feature Space

Linear decision boundaries in the feature space generally
correspond to nonlinear boundaries in the input space.
Classes which are NOT linearly separable in the input
space may become linearly separable in the feature space:

x1

x2

−1 0 1

−1

0

1

φ1

φ2

0 0.5 1

0

0.5

1

If classes overlap in input space, they will also overlap in
feature space — nonlinear features φ(x) can not remove
the overlap; but they may increase it.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

257of 825

Original Input versus Feature Space

Fixed basis functions do not adapt to the data and
therefore have important limitations (see discussion in
Linear Regression).
Understanding of more advanced algorithms becomes
easier if we introduce the feature space now and use it
instead of the original input space.
Some applications use fixed features successfully by
avoiding the limitations.
We will therefore use φ instead of x from now on.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

258of 825

Logistic Regression

Two classes where the posterior of class C1 is a logistic
sigmoid σ() acting on a linear function of the input:

p(C1 |φ) = y(φ) = σ(wTφ)

p(C2 |φ) = 1− p(C1 |φ)
Model dimension is equal to dimension of the feature
space M.
Compare this to fitting two Gaussians, which has a
quadratic number of parameters in M:

2M︸︷︷︸
means

+ M(M + 1)/2︸ ︷︷ ︸
shared covariance

For larger M, the logistic regression model has a clear
advantage.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

259of 825

Logistic Regression

Determine the parameter via maximum likelihood for data
(φn, tn), n = 1, . . . ,N, where φn = φ(xn). The class
membership is coded as tn ∈ {0, 1}.
Likelihood function

p(t |w) =
N∏

n=1

ytnn (1− yn)
1−tn

where yn = p(C1 |φn).
Error function : negative log likelihood resulting in the
cross-entropy error function

E(w) = − ln p(t |w) = −
N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

260of 825

Logistic Regression

Error function (cross-entropy loss)

E(w) = −
N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

yn = p(C1 |φn) = σ(wTφn)
We obtain the gradient of the error function using the chain
rule and the sigmoid result dσda = σ(1− σ) (derive it):

∇E(w) =
N∑

n=1

(yn − tn)φn

for each data point error is product of deviation yn − tn and
basis function φn.
We can now use gradient descent.
We may easily modify this to reduce over-fitting by using
regularised error or MAP (how?).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

261of 825

Laplace Approximation

Given a continous distribution p(x) which is not Gaussian,
can we approximate it by a Gaussian q(x) ?
Need to find a mode of p(x). Try to find a Gaussian with
the same mode:

−2 −1 0 1 2 3 4
0

0.2

0.4

0.6

0.8

p.d.f. of :
Non-Gaussian (yellow) and

Gaussian approximation (red).

−2 −1 0 1 2 3 4
0

10

20

30

40

negative log p.d.f. of :
Non-Gaussian (yellow) and

Gaussian approxmation. (red).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

262of 825

Laplace Approximation

Cheap and nasty but sometimes effective.
Assume p(x) can be written as

p(z) =
1
Z

f (z)

with normalisation Z =

f (z) dz.
We do not even need to know Z to find the Laplace
approximation.
A mode of p(z) is at a point z0 where p′(z0) = 0.
Taylor expansion of ln f (z) at z0

ln f (z) ‘ ln f (z0)−
1
2

A(z− z0)2

where

A = −
d2

dz2
ln f (z) |z=z0

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

263of 825

Laplace Approximation

Exponentiating

ln f (z) ‘ ln f (z0)−
1
2

A(z− z0)2

we get

f (z) ‘ f (z0) exp{−
A
2
(z− z0)2}.

And after normalisation we get the Laplace approximation

q(z) =
(

A

)1/2
exp{−

A
2
(z− z0)2}.

Only defined for precision A > 0 as only then p(z) has a
maximum.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

264of 825

Laplace Approximation – Vector Space

Approximate p(z) for z ∈ RM

p(z) =
1
Z

f (z).

we get the Taylor expansion

ln f (z) ‘ ln f (z0)−
1
2
(z− z0)TA(z− z0)

where the Hessian A is defined as

A = −∇∇ ln f (z) |z=z0 .

The Laplace approximation of p(z) is then

q(z) ∝ exp
{

1
2
(z− z0)TA(z− z0)

}
⇒ q(z) = N (z | z0,A−1)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

265of 825

Bayesian Logistic Regression

Exact Bayesian inference for the logistic regression is
intractable.
Why? Need to normalise a product of prior probabilities
and likelihoods which itself are a product of logistic
sigmoid functions, one for each data point.
Evaluation of the predictive distribution also intractable.
Therefore we will use the Laplace approximation.
The predictive distribution remains intractible even under
the Laplace approximation to the posterior distribution, but
it can be approximated.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

266of 825

Bayesian Logistic Regression

Assume a Gaussian prior:

p(w) = N (w |m0,S0)

for fixed hyperparameters m0 and S0.
Hyperparameters are parameters of a prior distribution. In
contrast to the model parameters w, they are not learned.
For a set of training data (xn, tn), where n = 1, . . . ,N, the
posterior is given by

p(w | t) ∝ p(w)p(t |w)

where t = (t1, . . . , tN)T .

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

267of 825

Bayesian Logistic Regression

Using our previous result for the cross-entropy function

E(w) = − ln p(t |w) = −
N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

we can now calculate the log of the posterior

p(w | t) ∝ p(w)p(t |w)

using the notation yn = σ(wTφn) as

ln p(w | t) =−
1
2
(w−m0)TS−10 (w−m0)

+

N∑
n=1

{tn ln yn + (1− tn) ln(1− yn)}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

268of 825

Bayesian Logistic Regression

To obtain a Gaussian approximation to

ln p(w | t)

= −
1
2
(w−m0)TS−10 (w−m0) +

N∑
n=1

{tn ln yn + (1− tn) ln(1− yn)}

1 Find wMAP which maximises ln p(w | t). This defines the
mean of the Gaussian approximation. (Note: This is a
nonlinear function in w because yn = σ(wTφn).)

2 Calculate the second derivative of the negative log likelihood
to get the inverse covariance of the Laplace approximation

SN = −∇∇ ln p(w | t) = S−10 +
N∑

n=1

yn(1− yn)φnφ
T
n .

Nowadays the gradient and Hessian would be computed with
automatic differentiation; one need only implement ln p(w | t).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Probabilistic Generative
Models

Continuous Input

Discrete Features

Probabilistic
Discriminative Models

Logistic Regression

Iterative Reweighted
Least Squares

Laplace Approximation

Bayesian Logistic
Regression

269of 825

Bayesian Logistic Regression

The approximated Gaussian (via Laplace approximation)
of the posterior distribution is now

q(w |φ) = N (w |wMAP,SN)

where

SN = −∇∇ ln p(w | t) = S−10 +
N∑

n=1

yn(1− yn)φnφ
T
n .

Linear Classification 2
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression