CS计算机代考程序代写 scheme Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

1of 1

Statistical Machine Learning

Christian Walder + Lexing Xie

Machine Learning Research Group, CSIRO Data61
ANU Computer Science

Canberra
Semester One, 2021.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

2of 1

Part I

Linear Regression 2

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

3of 1

Linear Regression

Basis functions
Maximum Likelihood with Gaussian Noise
Regularisation
Bias variance decomposition

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

4of 1

Training and Testing:
(Non-Bayesian) Point Estimate

model with adjustable
parameter w

training
data x

training
targets

t

model with fixed
parameter w*

test
data x

test
target

t

Training Phase

Test Phase
fix the most appropriate w*

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

5of 1

Bayesian Regression

Bayes Theorem

posterior =
likelihood× prior

normalisation
p(w | t) =

p(t |w) p(w)
p(t)

where we left out the conditioning on x (always assumed),
and β, which is assumed to be constant.
I.i.d. regression likelihood for additive Gaussian noise is

p(t |w) =
N∏

n=1

N (tn | y(xn,w), β−1)

=

N∏
n=1

N (tn |w>φ(xn), β−1)

= const× exp{−β
1
2

(t−Φw)>(t−Φw)}

= N (t |Φw, β−1I)

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

6of 1

How to choose a prior?

The choice of prior affords an intuitive control over our
inductive bias
All inference schemes have such biases, and often arise
more opaquely than the prior in Bayes’ rule.
Can we find a prior for the given likelihood which

makes sense for the problem at hand
allows us to find a posterior in a ’nice’ form

An answer to the second question:

Definition ( Conjugate Prior)

A class of prior probability distributions p(w) is conjugate to a
class of likelihood functions p(x |w) if the resulting posterior
distributions p(w | x) are in the same family as p(w).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

7of 1

Examples of Conjugate Prior Distributions

Table: Discrete likelihood distributions

Likelihood Conjugate Prior
Bernoulli Beta
Binomial Beta
Poisson Gamma

Multinomial Dirichlet

Table: Continuous likelihood distributions

Likelihood Conjugate Prior
Uniform Pareto

Exponential Gamma
Normal Normal (mean parameter)

Multivariate normal Multivariate normal (mean parameter)

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

8of 1

Conjugate Prior to a Gaussian Distribution

Example : If the likelihood function is Gaussian, choosing
a Gaussian prior for the mean will ensure that the
posterior distribution is also Gaussian.
Given a marginal distribution for x and a conditional
Gaussian distribution for y given x in the form

p(x) = N (x |µ,Λ−1)
p(y | x) = N (y |Ax + b,L−1)

we get

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)
p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

where Σ = (Λ + A>L A)−1.

Note that the covariance Σ does not involve y.

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

9of 1

Conjugate Prior to a Gaussian Distribution
(intuition)

Given

p(x) = N (x |µ,Λ−1)

p(y | x) = N (y |Ax + b,L−1)⇔ y = Ax + b +N (0,L−1)

We have E[y] = E[Ax + b] = Aµ + b and by the easily proven Bienaymé
formula for the variance of the sum of uncorrelated variables,

cov[y] = cov[Ax + b]︸ ︷︷ ︸
=E[Ax(Ax)>]=AE[xx>]A>=AΛ−1A>

+ cov[N (0,L−1)]︸ ︷︷ ︸
=L−1

.

So y is Gaussian with

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)

Then letting Σ = (Λ + A>L A)−1 and

p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

⇔ x = Σ{A>L(y− b) + Λµ}+N (0,Σ)

yields the correct moments for x, since

E[x] = E[Σ{A>L(y− b) + Λµ}] = Σ{A>L(Aµ + b− b) + Λµ}

= Σ{A>LAµ + Λµ} = (Λ + A>L A)−1{A>LA + Λ}µ = µ,

and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

10of 1

Bayesian Regression

Choose a Gaussian prior with mean m0 and covariance S0

p(w) = N (w |m0,S0)

Same likelihood as before (here written in vector form):

p(t|w, β) = N (t |Φw, β−1I)

Given N data pairs (xn, tn), the posterior is

p(w | t) = N (w |mN ,SN)

where

mN = SN(S−10 m0 + βΦ
>t)

S−1N = S
−1
0 + βΦ

(derive this with the identities on the previous slides)

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

11of 1

Bayesian Regression: Zero Mean, Isotropic Prior

For simplicity we proceed with m0 = 0 and S0 = α−1I, so

p(w |α) = N (w | 0, α−1I)

The posterior becomes p(w | t) = N (w |mN ,SN) with

mN = βSNΦ>t

S−1N = αI + βΦ

For α� β we get

mN → wML = (Φ>Φ)−1Φ>t

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −
β

2
(t−Φw)>(t−Φw)−

α

2
w>w + const

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

12of 1

Bayesian Regression

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −β
1
2
‖t−Φw‖2︸ ︷︷ ︸

sum-of-squares-error


α

2
‖w‖2︸︷︷︸

regulariser

+ const.

The maximum a posteriori estimator

wm.a.p. = arg max
w

p(w|t)

corresponds to minimising the sum-of-squares error
function with quadratic regularisation coefficient λ = α/β.
The posterior is Gaussian so mode = mean: wm.a.p. = mN .
For α� β the we recover unregularised least squares
(equivalently m.a.p. approaches maximum likelihood), for
example in case of

an infinitely broad prior with α→ 0
an infinitely precise likelihood with β →∞

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

13of 1

Bayesian Inference in General:
Sequential Update of Belief

If we have not yet seen any data point (N = 0), the
posterior is equal to the prior.

Sequential arrival of data points : the posterior given some
observed data acts as the prior for the future data.
Nicely fits a sequential learning framework.

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

14of 1

Sequential Update of the Posterior

Example of a linear basis function model
Single input x, single output t
Linear model y(x,w) = w0 + w1x.
True data distribution sampling procedure:

1 Choose an xn from the uniform distribution U(x | − 1,+1).
2 Calculate f (xn, a) = a0 + a1xn, where a0 = −0.3, a1 = 0.5.
3 Add Gaussian noise with standard deviation σ = 0.2,

tn = N (xn | f (xn, a), 0.04)

Set the precision of the uniform prior to α = 2.0.

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

15of 1

Sequential Update of the Posterior

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

16of 1

Sequential Update of the Posterior

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

17of 1

Predictive Distribution

In the training phase, data x and targets t are provided
In the test phase, a new data value x is given and the
corresponding target value t is asked for
Bayesian approach: Find the probability of the test target t
given the test data x, the training data x and the training
targets t

p(t | x, x, t)

This is the Predictive Distribution (c.f. the posterior
distribution, which is over the parameters).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

18of 1

How to calculate the Predictive Distribution?

Introduce the model parameter w via the sum rule

p(t | x, x, t) =

p(t,w | x, x, t)dw

=


p(t |w, x, x, t)p(w | x, x, t)dw

The test target t depends only on the test data x and the
model parameter w, but not on the training data and the
training targets

p(t |w, x, x, t) = p(t |w, x)

The model parameter w are learned with the training data
x and the training targets t only

p(w | x, x, t) = p(w | x, t)

Predictive Distribution

p(t | x, x, t) =

p(t |w, x)p(w | x, t)dw

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

19of 1

Proof of the Predictive Distribution

The predictive distribution is

p(t | x, x, t) =

p(t |w, x, x, t)p(w | x, x, t)dw

because∫
p(t |w, x, x, t)p(w | x, x, t)dw =


p(t,w, x, x, t)
p(w, x, x, t)

p(w, x, x, t)
p(x, x, t)

dw

=


p(t,w, x, x, t)

p(x, x, t)
dw

=
p(t, x, x, t)
p(x, x, t)

= p(t | x, x, t),

or simply∫
p(t |w, x, x, t)p(w | x, x, t)dw =


p(t,w | x, x, t)dw

= p(t | x, x, t).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

20of 1

Predictive Distribution with Simplified Prior

Find the predictive distribution

p(t | t, α, β) =

p(t |w, β) p(w | t, α, β)dw

(remember : conditioning on x is often suppressed to
simplify the notation.)
Now we know (neglecting as usual to notate conditioning
on x)

p(t |w, β) = N (t |w>φ(x), β−1)

and the posterior was

p(w | t, α, β) = N (w |mN ,SN)

where

mN = βSNΦ>t

S−1N = αI + βΦ

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

21of 1

Predictive Distribution with Simplified Prior

If we do the integral (it turns out to be the convolution of
the two Gaussians), we get for the predictive distribution

p(t | x, t, α, β) = N (t |m>N φ(x), σ
2
N(x))

where the variance σ2N(x) is given by

σ2N(x) =
1
β

+ φ(x)>SNφ(x).

This is more easily shown using a similar approach to the
earlier “intution” slide and again with the Bienaymé
formula, now using

t = w>φ(x) +N (0, β−1).

However this is a linear-Gaussian specific trick and in
general we need to integrate out the parameters.

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

22of 1

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 1.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

23of 1

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 2.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

24of 1

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 4.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

25of 1

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 25.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

26of 1

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 1.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

27of 1

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 2.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

28of 1

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 4.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

29of 1

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posterior
distribution over w. Number of data points N = 25.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

30of 1

Limitations of Linear Basis Function Models

Basis function φj(x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can
be exploited if the basis functions are not fixed :

Data lie close to a nonlinear manifold with intrinsic
dimension much smaller than D. Need algorithms which
place basis functions only where data are (e.g. kernel
methods / Gaussian processes).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (e.g. linear methods or shallow
neural networks).

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

31of 1

Curse of Dimensionality

Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner
product, metric), our intution gained from the
3-dimensional world around us may be wrong.
Example: Sphere of radius r = 1. What is the fraction of
the volume of the sphere in a D-dimensional space which
lies between radius r = 1 and r = 1− � ?
Volume scales like rD, therefore the formula for the volume
of a sphere is VD(r) = KDrD.

VD(1)− VD(1− �)
VD(1)

= 1− (1− �)D

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

32of 1

Curse of Dimensionality

Fraction of the volume of the sphere in a D-dimensional
space which lies between radius r = 1 and r = 1− �

VD(1)− VD(1− �)
VD(1)

= 1− (1− �)D

²

vo
lu

m
e
f
ra

ct
io

n

D = 1

D = 2

D = 5

D = 20

0 0.2 0.4 0.6 0.8 1
0

0.2

0.4

0.6

0.8

1

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

33of 1

Curse of Dimensionality

Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.

D = 1

D = 2

D = 20

r

p
(r
)

0 2 4
0

1

2

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

34of 1

Curse of Dimensionality

Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
Example: D = 2; assume µ = 0,Σ = I

N (x | 0, I) =
1


exp

{

1
2

x>x
}

=
1


exp

{

1
2

(x21 + x
2
2)

}
Coordinate transformation

x1 = r cos(φ) x2 = r sin(φ)

Probability in the new coordinates

p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |

where | J | = r is the determinant of the Jacobian for the
given coordinate transformation.

p(r, φ | 0, I) =
1


r exp

{

1
2

r2
}

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

35of 1

Curse of Dimensionality

Probability density with respect to radius r of a Gaussian
distribution for D = 2 (and µ = 0,Σ = I)

p(r, φ | 0, I) =
1


r exp

{

1
2

r2
}

Integrate over all angles φ

p(r | 0, I) =
∫ 2π

0

1

r exp
{

1
2

r2
}

dφ = r exp
{

1
2

r2
}

D = 1

D = 2

D = 20

r

p
(r
)

0 2 4
0

1

2

Statistical Machine
Learning

c©2021
Ong & Walder & Webers

& Xie
Data61 | CSIRO

ANU Computer Science

36of 1

Summary: Linear Regression

Basis functions
Maximum likelihood with Gaussian noise
Regularisation
Bias variance decomposition
Conjugate prior
Bayesian linear regression
Sequential update of the posterior
Predictive distribution
Curse of dimensionality