CS计算机代考程序代写 python Bayesian Bayesian network algorithm 1 Bayesian Sequential Update (?? marks)

1 Bayesian Sequential Update (?? marks)
In this section we will explore using Bayesian sequential updating for linear regression.

a) (1 mark) Suppose we estimate a weight vector w from data using a Gaussian prior and a Gaus-
sian likelihood. Write (with appropriate definitions) the prior and posterior for w given N data
points. Assume the prior is zero-mean and with diagonal covariance matrix 1

α
I for scalar α > 0.

b) (3 marks) Consider the following data generator, which returns an (xn, tn) pair, where xn ∈ R
and tn ∈ R.� �

import numpy as np

def s im one exa mple ( ) :
’ ’ ’

Genera te one s i n g l e ( x , t ) pa i r ,
where x i s drawn u n i f o r m l y from [−1 , 1 ) ,
y ( x , w) = w0 + w1 ∗ x , and
t = N( x | y ( x , w) , s igma ˆ 2 ) .

’ ’ ’
w0 , w1 , s igma = −0.2 , 0 . 8 , 0 . 0 4

x = np . random . un i fo rm (−1 , 1 )
y = w0 + w1 ∗ x
t = np . random . normal ( y , s igma )

re turn x , t� �
The following illustration contains 8 plots in 3 rows by 3 columns. The columns are the like-
lihood, the prior/posterior, and 5 samples from the posterior predictive distribution. On the
rightmost plot, the data is also shown as circles. The first row depicts the situation before ob-
serving any data, and the following two rows depict the situation after observing the first and
second (example, label) pair respectively.

Page 2 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

likelihood

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

prior/posterior

1.0 0.5 0.0 0.5 1.0
x

1.0

0.5

0.0

0.5

1.0

data space

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

1.0 0.5 0.0 0.5 1.0
x

1.0

0.5

0.0

0.5

1.0

y
1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

w
1

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

1.0 0.5 0.0 0.5 1.0
x

1.0

0.5

0.0

0.5

1.0

Discuss the plots, explaining why they make sense. Argue using the concrete example.

c) (1 mark) The plot below shows the figures after 5 samples. Identify the most recently added
data point in the right hand side figure. Explain your reasoning.

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

likelihood

1.0 0.5 0.0 0.5 1.0
w0

1.0

0.5

0.0

0.5

1.0

w
1

prior/posterior

1.0 0.5 0.0 0.5 1.0
x

1.0

0.5

0.0

0.5

1.0

data space

d) (2 marks) Write down the update equations for the mean vector and covariance matrix of the
(Gaussian) posterior distribution, when observing a single new example (xn, tn).

Page 3 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

2 Logistic Regression (?? marks)
In the following questions we consider logistic regression without quadratic regularisation. For the
questions which require code, use Python 3 and include comments which specify the input and output
formats for your functions. Note that marks are allocated for documentation.

You may assume the following preamble:� �
import numpy as np� �
a) (1 mark) For binary logistic regression (as discussed in the lecture and used in the tutorial), how

are the labels encoded?

b) (1 mark) There are three possible output types that a logistic regression classifier can produce,
corresponding to the linear model output, the probabilistic output and the classification output.
The three output types are called decision function, predict and predict proba in the
scikit-learn interface (in no particular order). Explain (using both text and equations) the
three output types.

c) (1 mark) Define and explain the purpose of a confusion matrix (using both text and equations).

d) (1 mark) Write down the mathematical definition of the sigmoid function y = σ(x).

e) (1 mark) Write down the mathematical definition of the cost function for binary logistic regres-
sion (average cross-entropy). Do not forget to define your notation.

f) (2 marks) Calculate the mathematical form of the gradient (with respect to the model parame-
ters) of the cost function above.

g) (2 marks) Consider a single (input, target) pair. Draw graphical models which depict

i) the logistic regression model, and

ii) the naive Bayes classifier model.

Page 4 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

3 Graphical Models – Comparison (?? marks)
(3 marks) Compare Bayesian networks and Markov random fields by describing three similarities and
three differences.

Page 5 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

4 Graphical Models – Alarm (?? marks)
John and Mary live in a house which is equipped with a burglar alarm system. The situation can
be described by the following Bayesian network with the random variables B (a Burglar entered the
house), E (an Earthquake occured), A (the Alarm in the house rings), R (Radio news report about an
earthquake), J (John calls the police), and M (Mary, who listens to the radio, calls the police).

The domain of each random variable is B = {0,1} encoding False (= 0) and True (= 1).

J M

(a) (1 mark) Write out the joint distribution p(B,E,A,R,J,M) in its factored form according to
this Bayesian network structure.

Express the answers to the following three queries only in terms of marginalisations “∑X∈X ” and
maximisations “argmaxX∈X ” , where each X is a random variable, and X the corresponding set of
possible values X can take. For example, the following identity is acceptable:

∑
B,A,R,J,M∈B

p(B,E = 1,A,R,J,M) = 1

Wherever possible, simplify your expressions by exploiting the structure of the above graphical model.

(b) (1 mark) The probability that the alarm will ring given no observations.

(d) (2 marks) The most probable values for the tuple (E,B) if both John and Mary called the police
and at least one of the events E,B did happen.

(e) (2 marks) Write down all conditional independence relations in the Bayesian network when
only A is observed.

(f) (2 marks) Write down all conditional independence relations in the Bayesian network when
only E is observed.

Page 6 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

5 Sampling – Gaussian Mixture (?? marks)
a) (1 mark) Define the setup for and the goal of rejection sampling.

b) (2 marks) Write down all steps of the rejection sampling algorithm. You are encouraged to
provide precise mathematical formulations and/or pseudo-code where appropriate, and also to
explain your answer.

c) (1 mark) What are some limitations of rejection sampling?

d) (1 mark) Define the setup for and the goal of ancestral sampling.

e) (2 marks) Write down all steps of the ancestral sampling algorithm. You are encouraged to
provide precise mathematical formulations and/or pseudo-code where appropriate, and also to
explain your answer.

f) (1 mark) What are some limitations of ancestral sampling?

g) Consider the mixture of three Gaussians,

p(x) =
3

10
N (x |5,0.5)+

3
10

N (x |9,2)+
4

10
N (x |2,20),

where N (x |µ,σ) is a Gaussian distribution with mean µ and standard deviation σ .

g1) (1 mark) Discuss which sampling method (rejection or ancestral) is more appropriate for
the above distribution, and why.

g2) (1 mark) Derive, explain and write python code to draw 1000 samples from the given
distribution using whichever of the two methods you deem most appropriate. Assume that
your code starts with� �
import numpy as np� �

Page 7 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

6 Principal Component Analysis (?? marks)
(3 marks) Given is a set of data points xi ∈ RD, i = 1, . . . ,N. Project all data points onto a one-
dimensional hyperplane. Derive the formula for the unit vector u representing the hyperplane in
which the projected data points xi have the largest variance.

(1 mark) Discuss under which circumstances u is not uniquely defined.

Page 8 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

7 Expectation Maximisation (EM) (?? marks)
Consider the Bernoulli mixture model defined by

p(z|π) =
K

∏
k=1

π
zk
k

p(x|z,µ1,µ2, . . . ,µK) =
K

∏
k=1

p(x|µk)zk

p(x|µ) =
D

∏
i=1

µ
xi
i (1−µi)

1−xi

where x ∈ {0,1}D, z ∈ {0,1}K , π ∈ [0,1]K , µk ∈ [0,1]D, and exactly one element of z is equal to
one, and the rest are equal to zero. Assume we are given N observations x1,x2, . . . ,xN , which are
independently and identically distributed as x above.

a) (1 mark) Describe the set up for and the goal of EM.

b) (1 mark) Draw the graphical model using plate notation, shading the observed variable(s).

c) (1 mark) Derive p(x|µ,π).

d) (2 marks) Derive log p(X ,Z|π,µ1,µ2, . . . ,µK). X is an N×D matrix of observations. Z is an
N×K matrix of the corresponding latent variables, a priori distributed as z above.

e) (1 mark) State the criteria optimised by the maximisation step of the EM algorithm for inferring
π,µ1,µ2, . . . ,µK given X .

f) (6 marks) Derive EM updates for π,µ1,µ2, . . . ,µK given X in readily implementable form.

g) (2 marks) Give two different solutions to which the EM algorithm may converge, for the case
N = 2, K = 2 with x1 6= x2.

Page 9 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

8 Local quadratic approximation (?? marks)
(4 marks) Given is a smooth and twice differentiable function E(w) mapping vectors w ∈ RD to R.
Show that in the vicinity of a critical (or stationary) point w? of E, the function can be approximated
by a quadratic form

E(w)≈ E(w?)+
1
2

∑
i=1

λi α
2
i ,

where λi and ui, are the D eigenvalues and unit norm eigenvectors of the Hessian of E at the critical
point w?, respectively. Each αi measures the distance between w and w? in the direction of the
eigenvector ui

uTi (w−w
?) = αi.

Hint: Use the Taylor expansion of E.

Page 10 of ?? – Statistical Machine Learning – COMP4670 / COMP8600

Related Posts