CS计算机代考程序代写 python Bayesian Bayesian network algorithm First Semester 2019

First Semester 2019

Statistical Machine Learning

(COMP4670/8600)

Writing period : 3 Hours duration
Study period : 15 Minutes duration

Permitted materials : One A4 sheet, written on one side

Total marks: 60.
Answer all questions.

All questions to be completed in the script book provided.
Please submit the A4 sheet with the script book.

1

1 Logistic Regression (14 marks total)
In the following questions we consider logistic regression without quadratic regularisation. For the
questions which require code, use Python 3 and include comments which specify the input and output
formats for your functions. Note that marks are allocated for documentation.

You may assume the following preamble: [frame=single, numbers=none] import numpy as np
(0.5 marks for having sufficient comments in the code)

a) (1 mark) For binary logistic regression (as discussed in the lecture and used in the tutorial), how
are the labels encoded?

b) (1.5 marks) There are three possible output types that a logistic regression classifier can pro-
duce, corresponding to the linear model, the probabilistic output and the classification. The
three output types are called decision function, predict and predict proba in the scikit-learn
interface (in no particular order). Explain (using both text and equations) the three output types.

c) (1 mark) Define and explain the purpose of a confusion matrix (using both text and equations).

d) (1 mark) Write down the mathematical definition of the sigmoid function y = σ(x).

e) (1 mark) Write down the mathematical definition of the cost function for binary logistic regres-
sion (average cross-entropy). Do not forget to define your notation.

f) (2 marks) Calculate the mathematical form of the gradient (with respect to the model parame-
ters) of the cost function above.

g) (1 mark) Write code for a function sigmoid which operates element-wise on its vector input.

h) (1 mark) Write code for a function cost calculating the cost function as defined above. Note:
ensure that the finite floating point precision of a real computer does not destroy your result.

i) (1 mark) Write code for a function costgradcalculatingthegradiento f thecost f unctionasde f inedabove.(1mark)Assumeyouhave f oundthebest parameterthetabest f orsometrainingdatawithinputXandtargetstusing[ f rame=
single]thetabest = opt. f minb f gs(cost, theta, f prime= costgrad,args=(X , t))Writecode f ora f unctionpredictwhichgivenaseto f testdataXtestreturnsanarrayo f predictions.

j)j) (2 marks) Consider a single (input, target) pair. Draw graphical models which depict

i) the logistic regression model, and

ii) the naive Bayes classifier model.

Page 2 of ?? – Statistical Machine Learning – ( COMP4670/8600 )

2 Kernel Linear Regression (14 marks total)
Consider the function f (x,w) = φ(x)>w where x ∈ RD is the input vector, w ∈ RK is the parameter
vector and φ is a feature function which maps from RD to RK .

Let the kernel function be given by k(x,x′) = φ(x)>φ(x′). Assume we are given N (input, target)
training pairs, D = {(x1, t1),(x2, t2), . . .(xN , tN)}.

For the questions which require code, use Python 3 and include comments which specify the
input and output formats for your functions. You may assume the following preamble: [frame=single,
numbers=none] import numpy as np

def xvalinds(ndata,n f olds) : ndata : thenumbero f datapointsn f olds : thenumbero f cross−validation f oldsreturns :
alisto f (trainind, testind)pairs(o f listso f integers)assertndatanchunk= int(ndata/n f olds)itest = [list(range(i∗
nchunk,(i+1)∗nchunk)) f oriinrange(int(n f olds))]itrain= [sorted(set(range(ndata)).di f f erence(set(i))) f oriinitest]returnlist(zip(itrain, itest))

a) (1 mark) Write the sum of squared errors objective J(w), assuming that our prediction for tn is
f (xn,w). Include the regularisation term λ2 w

>w.

b) (1.5 marks) Show that the parameter vector w∗ which minimises J takes the form ∑Nn=1 anφ(xn),
and give an in terms of w∗.

c) (1.5 marks) Write f (x,w∗) in terms of a = (a1,a2, . . . ,aN)> and k(·, ·), thereby eliminating
φ(·).

d) (1.5 marks) Write J(w∗) in terms of a and the kernel function k(·, ·) (or one or more matrices
defined in terms of the kernel function), thereby eliminating φ(·) and w∗.

e) (1.5 marks) Give a simplified, closed form expression for the optimal a .
f) (1 mark) Write a function fit which takes the D , k, and λ and returns the optimal a.
g) (1 mark) Write a function predict which takes the D , k, a and a matrix X test of test points,

and returns the values resulting from evaluating the regression function at each test point.

h) (2 marks) Write a function cross validate, which takes D , k, λ and nfolds, and returns
a cross validated estimate (with nfolds cross validation folds) of the expected mean squared
error on an independent test set. Use the function xvalindsabove.(3marks)Letf∈ RN . Spec-
ify a likelihood function p(f|x1,x2, . . . ,xN ,k,a) and corresponding prior distribution p(f|λ ) for
which the maximum a posteriori (MAP) f given D corresponds to the solution found above.
That is, the n-th element of the MAP f should equal f (xn,∑Nn=1 a


nφ(xn)) where a

∗ is the op-
timal solution found in ??, above. Where possible express your answer using well known
distribution(s) with standard parametrisation(s).

Page 3 of ?? – Statistical Machine Learning – ( COMP4670/8600 )

3 Graphical Models – Alarm (10 marks total)
John and Mary live in a house which is equipped with a burglar alarm system. The situation can
be described by the following Bayesian network with the random variables B (a Burglar entered the
house), E (an Earthquake occured), A (the Alarm in the house rings), R (Radio news report about an
earthquake), J (John calls the police), and M (Mary, who listens to the radio, calls the police).

The domain of each random variable is B = {0,1} encoding False (= 0) and True(= 1).

i)

A

EB

J M

R

(a) (2 mark) Write out the joint distribution p(B,E,A,R,J,M) in its factored form according to
this Bayesian network structure.

Express the answers to the following three queries only in terms of marginalisations “∑X∈X ” and
maximisations “X∈X ” , where each X is a random variable, and X the corresponding set of possible
values X can take. For example, the following identity is acceptable:


B,A,R,J,M∈B

p(B,E = 1,A,R,J,M) = 1

Wherever possible, simplify your expressions by exploiting the structure of the above graphical model.

(b) (1 mark) The probability that the alarm will ring given no observations.

(c) (1.5 marks) The joint probability that Mary called the police and John did not when there was
an earthquake observed.

(d) (1.5 marks) The most probable values for the tuple (E,B) if both John and Mary called the
police and at least one of the events E,B did happen.

(e) (2 marks) Write down all conditional independence relations in the Bayesian network when
only A is observed.

(f) (2 marks) Write down all conditional independence relations in the Bayesian network when
only E is observed.

Page 4 of ?? – Statistical Machine Learning – ( COMP4670/8600 )

4 Sampling – Gaussian Mixture (10 marks)
a) (1 marks) Define the setup for and the goal of rejection sampling.

b) (2 marks) Write down all steps of the rejection sampling algorithm. You are encouraged to
provide precise mathematical formulations and/or pseudo-code where appropriate, and also to
explain your answer.

c) (1 marks) What are some limitations of rejection sampling?

d) (1 marks) Define the setup for and the goal of ancestral sampling.

e) (2 marks) Write down all steps of the ancestral sampling algorithm. You are encouraged to
provide precise mathematical formulations and/or pseudo-code where appropriate, and also to
explain your answer.

f) (1 marks) What are some limitations of ancestral sampling?

g) Consider the mixture of three Gaussians,

p(x) =
3

10
x50.5+

3
10

x92+
4

10
x220,

where xµσ is a Gaussian distribution with mean µ and standard deviation σ .

g1) (1 mark) Discuss which sampling method (rejection or ancestral) is more appropriate for
the above distribution, and why.

g2) (1 mark) Derive, explain and write code in Python 3 to draw 1000 samples from the given
distribution using whichever of the two methods you deem most appropriate. Assume that
your code starts with [frame=single, numbers=none] import numpy as np

Page 5 of ?? – Statistical Machine Learning – ( COMP4670/8600 )

5 Approximate Inference (12 marks total)
a) (2.5 marks) In the context of Bayesian machine learning, describe i) the Laplace approximation

and ii) variational approximations. Compare and contrast the two methods.

b) (2 marks) Consider the problem of approximating some p(ZZZ) with the factorised distribution

q(ZZZ) =
N


i=1

q(ZZZi)

where
ZZZ = (ZZZ1,ZZZ2, . . . ,ZZZN).

By holding q(ZZZi), i 6= j fixed, derive expressions for each of the following variational approxi-
mation updates

q?L(ZZZ j)≡ arg min
q(ZZZ j)

KL
[
q(ZZZ)

∥∥ p(ZZZ)]
q?R(ZZZ j)≡ arg min

q(ZZZ j)
KL
[

p(ZZZ)
∥∥q(ZZZ)]

where

KL
[

p(x)
∥∥q(x)]≡ ∫ p(x)(log p(x)− logq(x))x.

c) (1 mark) Describe the Expectation Propagation algorithm, using the equations from the previous
question if needed.

d) (3 marks) Consider the distribution defined by

p(A,B,C) = f1(A,B) f2(B,C),

where A,B and C are discrete (categorical) random variables. Assume a factorising approxi-
mation for f1 and f2, so that f1(A,B)≈ f̃1(A,B)≡ f̃1A(A) f̃1B(B), and similarly for f̃2(B,C)≡
f̃2B(B) f̃2C(C). Consider the case where the first factor f̃1(A,B) is being refined. Derive the
Expectation Propagation updates using f̃1A(A), f̃1B(B), f̃2B(B),and f̃2C(C).

e) (1.5 marks) Consider the same p(A,B,C) (still with categorical A,B, and C) from the previous
question, but instead assume a factorising approximating distribution,

q(A,B,C)≡ q(A)q(B)q(C).

By optimising KL
[
q(A,B,C)

∥∥ p(A,B,C)] with respect to each of the three factors (holding the
other two fixed), derive the variational mean field updates.

f) (2 marks) In this question, we consider a Gaussian approximation to the unknown distribution
p(xxx) on continuous random variables. By treating the covariance Σ as fixed, derive an ex-
pression for the mean µµµ of the (multivariate normal, or multivariate Gaussian) approximating
distribution q(xxx)≡N (xxx|µµµ,Σ) which minimises the objective KL

[
p(xxx)

∥∥q(xxx)].
End of exam.

Page 6 of ?? – Statistical Machine Learning – ( COMP4670/8600 )