CS计算机代考程序代写 Bayesian algorithm chain scheme Statistical Machine Learning

Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 825

Part VI
Linear Classification 2
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
235of 825

Three Models for Decision Problems
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
236of 825
In increasing order of complexity
Find a discriminant function f (x) which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the classes.
Generative Models
1 Solve the inference problem of determining the class-conditional probabilities p(x | Ck ).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

Data Generating Process
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
237of 825
Given:
class prior p(t)
class-conditional p(x | t)
to generate data from the model we may do the following:
1 Sample the class label from the class prior p(t).
2 Sample the data features from the class-conditional
distribution p(x | t).
(more about sampling later — this is called ancestral sampling)
Thinking about the data generating process is a useful modelling step, especially when we have more prior knowledge.

Probabilistic Generative Models
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
238of 825
Generative approach: model class-conditional densities p(x | Ck ) and class priors (not parameter priors!) p(Ck ) to calculate the posterior probability for class C1
p(C1 | x) = p(x | C1)p(C1)
p(x | C1)p(C1) + p(x | C2)p(C2)
= 1 ≡ σ(a(x)) 1 + exp(−a(x))
where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x|C1)p(C1) = ln p(x,C1)
p(x | C2) p(C2) p(x, C2) σ(a) = 1 .
One point of this re-writing: we may learn a(x) directly as e.g. a deep neural network.
1 + exp(−a)

Logistic Sigmoid
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
239of 825
The logistic sigmoid function is called a “squashing function” because it squashes the real axis into a finite interval (0, 1).
Well known properties (derive them):
Symmetry: σ(−a) = 1 − σ(a)
Derivative: d σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a))
da
Inverse of σ is called the logit function
Σ􏶖a􏶚 1.0
0.8
0.6
0.4
0.2
a􏶖Σ􏶚
4
2
􏲱2
Σ
􏲱4 5 10 􏲱6
1 1+exp(−a)
0.2 0.4
0.6 0.8 1.0
􏲱10 􏲱5
a
Sigmoid σ(a) =
Logit a(σ) = ln􏲏 σ 􏲐 1−σ

Probabilistic Generative Models – Multiclass
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
240of 825
The normalised exponential is given by
p(x | Ck) p(Ck) exp(ak)
p(Ck |x) = 􏱾j p(x|Cj)p(Cj) = 􏱾j exp(aj) where
ak =ln(p(x|Ck)p(Ck)).
Usually called the softmax function as it is a smoothed
version of the arg max function, in particular:
ak ≫aj ∀j̸=k⇒􏲏p(Ck|x)≈1∧p(Cj|x)≈0􏲐
So, softargmax is a more descriptive though less common name.

Probabil. Generative Model – Continuous Input
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
241of 825
Assume class-conditional probabilities are Gaussian, with the same covariance and different mean:
Let’s characterise the posterior probabilities.
We may separate the quadratic and linear term in x:
p(x | Ck)
11􏲞1T−1 􏲟
= (2π)D/2 |Σ|1/2 exp −2(x − μk) Σ (x − μk)
1 1 􏲞1T−1 T−1 1T−1􏲟 = (2π)D/2 |Σ|1/2 exp −2x Σ x+μkΣ x− 2μkΣ μk

Probabil. Generative Model – Continuous Input
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
242of 825
For two classes
and a(x) is linear because the quadratic terms in x cancel
p(C1 | x) = σ(a(x)) (c.f. the previous slide):
Therefore where
a(x) = ln p(x | C1) p(C1) p(x | C2) p(C2)
exp􏵪μTΣ−1x− 1μTΣ−1μ 􏵫 p(C ) 12111
= ln exp 􏵪μT Σ−1x − 1 μT Σ−1μ 􏵫 + ln p(C2) 2222
p(C1 | x) = σ(wT x + w0) w = Σ−1(μ1 − μ2)
w0 = −1μT1 Σ−1μ1 + 1μT2 Σ−1μ2 + ln p(C1) 2 2 p(C2)

Probabil. Generative Model – Continuous Input
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
243of 825
Class-conditional densities for two classes (left). Posterior probability p(C1 | x) (right). Note the logistic sigmoid of a linear function of x.

General Case – K Classes, Shared Covariance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
244of 825
Use the normalised exponential
p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = 􏱾j p(x | Cj)p(Cj) = 􏱾j exp(aj)
where
to get a linear function of x
where
ak =ln(p(x|Ck)p(Ck)). a k ( x ) = w Tk x + w k 0 .
wk = Σ−1μk
w k 0 = − 1 μ Tk Σ − 1 μ k + p ( C k ) . 2

General Case – K Classes, Different Covariance
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
245of 825
If the class-conditional distributions have different covariances, the quadratic terms −1xTΣ−1x do not cancel
2
out.
We get a quadratic discriminant.
2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5
−2 −1 0 1 2

Parameter Estimation
Given the functional form of the class-conditional densities p(x | Ck ), how can we determine the parameters μ and Σ and the class prior ?
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
246of 825

Parameter Estimation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
247of 825
Given the functional form of the class-conditional densities p(x | Ck ), how can we determine the parameters μ and Σ and the class prior ?
Simplest is maximum likelihood.
Given also a data set (xn,tn) for n = 1,…,N. (Using the coding scheme where tn = 1 corresponds to class C1 and tn = 0 denotes class C2.
Assume the class-conditional densities to be Gaussian with the same covariance, but different mean.
Denote the prior probability p(C1) = π, and therefore p(C2) = 1 − π.
Then
p(xn,C1) = p(C1)p(xn |C1) = πN(xn |μ1,Σ)
p(xn, C2) = p(C2)p(xn | C2) = (1 − π) N (xn | μ2, Σ)

Maximum Likelihood Solution
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
248of 825
Thus the likelihood for the whole data set X and t is given by
p(t,X|π,μ1,μ2,Σ) N
= 􏲠[πN(xn |μ1,Σ)]tn ×[(1−π)N(xn |μ2,Σ)]1−tn n=1
Maximise the log likelihood The term depending on π is
N
􏱿 􏲋tn ln π + (1 − tn) ln(1 − π)􏲌 n=1
which is maximal for (derive it)
π = N
where N1 is the number of data points in class C1.
1 􏱿N N 1 N 1
n=1
tn = N = N1 + N2

Maximum Likelihood Solution
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
249of 825
Similarly, we can maximise the likelihood
p(t, X | π, μ1, μ2, Σ) w.r.t. the means μ1 and μ2, to get
1 􏱿N μ1 = N
1 n=1 1 􏱿N
μ2 = N
2 n=1
tn xn (1−tn)xn
For each class, this are the means of all input vectors assigned to this class.

Maximum Likelihood Solution
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
250of 825
Finally, the log likelihood lnp(t,X|π,μ1,μ2,Σ) can be maximised for the covariance Σ resulting in
Σ = N1 S1 + N2 S2 NN
Sk = 1 􏱿(xn −μk)(xn −μk)T Nk n∈Ck

Discrete Features – Naïve Bayes
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
251of 825
Assume the input space consists of discrete features, in the simplest case xi ∈ {0, 1}.
For a D-dimensional input space, a general distribution would be represented by a table with 2D entries.
Together with the normalisation constraint, this are 2D − 1 independent variables.
Grows exponentially with the number of features.
The Naïve Bayes assumption is that, given the class Ck, the features are independent of each other:
D
p(x|Ck) = 􏲠p(xi |C)
i=1 D
= 􏲠 μxi (1 − μki)1−xi ki
i=1

Discrete Features – Naïve Bayes
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
252of 825
With the naïve Bayes
D
p(x | Ck) = 􏲠 μxi (1 − μki)1−xi ki
i=1
we can then again find the factors ak in the normalised
exponential
p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = 􏱾j p(x | Cj)p(Cj) = 􏱾j exp(aj)
as a linear function of the xi
D
ak(x)=􏱿{xilnμki +(1−xi)ln(1−μki)}+lnp(Ck).
i=1

Three Models for Decision Problems
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
253of 825
In increasing order of complexity
Find a discriminant function f (x) which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior class probabilities p(Ck | x).
2 Use decision theory to assign each new x to one of the classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilities p(x | Ck ).
2 Also, infer the prior class probabilities p(Ck).
3 Use Bayes’ theorem to find the posterior p(Ck | x).
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.

Probabilistic Discriminative Models
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
254of 825
Discriminative training: learn only to discriminate between the classes.
Maximise a likelihood function defined through the conditional distribution p(Ck | x) directly.
Typically fewer parameters to be determined.
As we learn the posteriror p(Ck | x) directly, prediction may be better than with a generative model where the class-conditional density assumptions p(x | Ck ) poorly approximate the true distributions.
But: discriminative models can not create synthetic data, as p(x) is not modelled.
As an aside: certain theoretical analyses show that generative models converge faster to their — albeit worse — asymptotic classification performance and are superior in some regimes.

Original Input versus Feature Space
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
255of 825
So far in classification, we used direct input x.
All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x).
Example: Use two Gaussian basis functions centered at the green crosses in the input space.
1
1
φ2
x2
0 0.5
−1
0
−1 0 x1 1 0 0.5 φ1 1

Original Input versus Feature Space
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
256of 825
Linear decision boundaries in the feature space generally correspond to nonlinear boundaries in the input space.
Classes which are NOT linearly separable in the input space may become linearly separable in the feature space:
1
1
φ2
x2
0 0.5
−1
0
−1 0 x1 1
0 0.5 φ1 1
If classes overlap in input space, they will also overlap in feature space — nonlinear features φ(x) can not remove the overlap; but they may increase it.

Original Input versus Feature Space
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
257of 825
Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear Regression).
Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space.
Some applications use fixed features successfully by avoiding the limitations.
We will therefore use φ instead of x from now on.

Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
258of 825
Two classes where the posterior of class C1 is a logistic sigmoid σ() acting on a linear function of the input:
p(C1 |φ) = y(φ) = σ(wTφ) p(C2 | φ) = 1 − p(C1 | φ)
Model dimension is equal to dimension of the feature space M.
Compare this to fitting two Gaussians, which has a quadratic number of parameters in M:
2M + M(M + 1)/2
􏲜􏲛􏲚􏲝
means
􏲜 􏲛􏲚 􏲝
shared covariance
For larger M, the logistic regression model has a clear advantage.

Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
259of 825
Determine the parameter via maximum likelihood for data (φn,tn), n = 1,…,N, where φn = φ(xn). The class membership is coded as tn ∈ {0, 1}.
Likelihood function
N
p(t|w) = 􏲠ytn(1−y )1−tn nn
n=1
where yn = p(C1 | φn).
Error function : negative log likelihood resulting in the
cross-entropy error function
N E(w)=−lnp(t|w)=−􏱿{tnlnyn +(1−tn)ln(1−yn)}
n=1

Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
260of 825
Error function (cross-entropy loss)
N
E(w)=−􏱿{tnlnyn +(1−tn)ln(1−yn)}
n=1
yn = p(C1 |φn) = σ(wTφn)
We obtain the gradient of the error function using the chain rule and the sigmoid result dσ = σ(1 − σ) (derive it):
da
N
∇E(w) = 􏱿(yn − tn)φn n=1
for each data point error is product of deviation yn − tn and basis function φn.
We can now use gradient descent.
We may easily modify this to reduce over-fitting by using regularised error or MAP (how?).

Laplace Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
261of 825
0.8
0.6
0.4
0.2
Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x) ?
Need to find a mode of p(x). Try to find a Gaussian with the same mode:
40
30
20
10
00
−2 −1 0 1 2 3 4
p.d.f. of : Non-Gaussian (yellow) and Gaussian approximation (red).
−2 −1 0 1 2 3 4
negative log p.d.f. of : Non-Gaussian (yellow) and Gaussian approxmation. (red).

Laplace Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
262of 825
Cheap and nasty but sometimes effective. Assume p(x) can be written as
p(z) = 1 f (z) Z
with normalisation Z = 􏲖 f (z) dz.
We do not even need to know Z to find the Laplace
approximation.
A mode of p(z) is at a point z0 where p′(z0) = 0. Taylor expansion of ln f (z) at z0
where
ln f (z) ≃ ln f (z0 ) − 1 A(z − z0 )2 2
d2
A = −dz2 lnf(z) |z=z0

Laplace Approximation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
263of 825
Exponentiating
we get
ln f (z) ≃ ln f (z0 ) − 1 A(z − z0 )2 2
f (z) ≃ f (z0 ) exp{− A (z − z0 )2 }. 2
And after normalisation we get the Laplace approximation 􏲇 A 􏲈1/2 A 2
q(z)= 2π exp{−2(z−z0) }.
Only defined for precision A > 0 as only then p(z) has a maximum.

Laplace Approximation – Vector Space
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
264of 825
Approximate p(z) for z ∈ RM
p(z) = 1 f (z).
Z
ln f (z) ≃ ln f (z0 ) − 1 (z − z0 )T A(z − z0 ) 2
where the Hessian A is defined as
A = −∇∇lnf(z) |z=z0.
we get the Taylor expansion
The Laplace approximation of p(z) is then 􏲞1T􏲟
q(z)∝exp −2(z−z0) A(z−z0) ⇒ q(z) = N(z|z0,A−1)

Bayesian Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
265of 825
Exact Bayesian inference for the logistic regression is intractable.
Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point.
Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation.
The predictive distribution remains intractible even under the Laplace approximation to the posterior distribution, but it can be approximated.

Bayesian Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
266of 825
Assume a Gaussian prior:
p(w) = N(w|m0,S0) for fixed hyperparameters m0 and S0.
Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned.
For a set of training data (xn,tn), where n = 1,…,N, the posterior is given by
p(w | t) ∝ p(w)p(t | w) where t = (t1,…,tN)T.

Bayesian Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
267of 825
Using our previous result for the cross-entropy function
N E(w)=−lnp(t|w)=−􏱿{tnlnyn +(1−tn)ln(1−yn)}
n=1
we can now calculate the log of the posterior
p(w | t) ∝ p(w)p(t | w) using the notation yn = σ(wT φn) as
ln p(w | t) = − 1 (w − m0 )T S−1 (w − m0 ) 20
N
+􏱿{tn lnyn +(1−tn)ln(1−yn)}
n=1

Bayesian Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
268of 825
To obtain a Gaussian approximation to
{tnlnyn +(1−tn)ln(1−yn)}
1 Find wMAP which maximises ln p(w | t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because yn = σ(wTφn).)
2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation
N
SN =−∇∇lnp(w|t)=S−1 +􏱿yn(1−yn)φnφTn.
Nowadays the gradient and Hessian would be computed with automatic differentiation; one need only implement ln p(w | t).
lnp(w|t)
1T−1 􏱿N
=−2(w−m0) S0 (w−m0)+
n=1
0
n=1

Bayesian Logistic Regression
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
269of 825
The approximated Gaussian (via Laplace approximation) of the posterior distribution is now
where
q(w|φ) = N(w|wMAP,SN) N
SN =−∇∇lnp(w|t)=S−1 +􏱿yn(1−yn)φnφTn. 0
n=1