程序代写代做 flex algorithm graph chain In the following, bold symbols represent vectors. When asked to provide algorithms, you may provide equations or use numpy/MATLAB/Octave notation for matrix algebra. You may assume the availability of standard matrix functions such as inv, pinv, eig or svd. You may not assume statistical functions, including mean and cov. If you choose to use programming notation, you will not lose marks for syntactic errors, as long as your intention is clear and correct.

In the following, bold symbols represent vectors. When asked to provide algorithms, you may provide equations or use numpy/MATLAB/Octave notation for matrix algebra. You may assume the availability of standard matrix functions such as inv, pinv, eig or svd. You may not assume statistical functions, including mean and cov. If you choose to use programming notation, you will not lose marks for syntactic errors, as long as your intention is clear and correct.
You might find it useful to consult the following table of standard distributions:
Name
Gaussian (Normal)
Multivariate Normal Bernoulli
Binomial
Multinomial
Poisson Beta Exponential Gamma
Student-t Inverse-Wishart
Dirichlet
Domain
R RD
{0, 1} Z0−N
[Z0−N ]D Z0+
[0, 1] R+
R+
Symbol
x ∼ N 􏰁μ,σ2􏰂 x ∼ N (μ, Σ)
x ∼ Bernoulli(p) x ∼ Binom(p)
x ∼ Multinom(p) x ∼ Poisson(μ)
x ∼ Beta(α, β)
x ∼ Exp(λ)
x ∼ Gamma(α,β)
Density or Probability fn √ 1 e−1 (x−μ)2
2 σ2 2πσ2
|2πΣ|−1/2 e− 1 (x−μ)TΣ−1(x−μ) 2
px(1 − p)1−x
􏰅N􏰆 x N−x
x p(1−p) N! 􏰋D
x1! x2!…xD! μx e−μ /x!
pxd d
1 xα−1(1 − x)β−1 B(α,β)
x2􏰆−α+1 2
λe−λx
βα α−1 −βx
Γ(α)x e
d=1
Γ􏰁α+1􏰂 􏰅 2
R
√παΓ􏰁α􏰂 1+ α 2
x∼Student-t(α)
+ νp22
X∼InvWish(Ψ,ν) |Ψ|2 22 Γp(ν)
x ∼ Dirichlet(α)
ν
Sp [0, 1]
|X|−ν+p+1 e−1Tr[ΨX−1] Γ􏰁􏰈D α􏰂D
D
2
d=1 d 􏰋 αd−1
􏰉D Γ(αd) xd d=1 d=1

1. Mean field approximation
In this problem, we will compute a mean field approximation for an Ising model with cost function H. Given a set of random variables X1,…,Xn with values Xi ∈ {−1, +1}, the Ising model cost function is defined as
nn
φ(x)=−􏰊αijxixj −􏰊βixi forx=(x1,…,xn)∈{−1,1}n , (1)
i,j =1 i=1
where αij and βi for i, j ∈ {1, . . . , n} are constant weights. (In the example we discussed in class, αij was non-zero if i and j were neighbors on a grid.) The distribution of the Ising model is given by
P(x) = 1 e−φ(x) where Z(α,β) = 􏰊e−φ(x) . (2) Z(α,β) x
We approach the problem by approximating P by a factorial distribution
􏰋n i=1
Qi (xi)
where
Qi (xi) = Qi (xi) =
􏰇1+mi xi = 1
2 , (3)
Q(x) =
each with some parameter mi ∈ [−1, 1].
Hints: (i) Recall that tanh(x) = ex −e−x , with inverse arctanh (x) = 1 log 􏰃 1+x 􏰄. ex+e−x 2 1−x
(ii) Whereever possible, use the fact that E[Y Z] = E[Y ]E[Z] if Y and Z are inde- pendent.
(a) Recall that mean field approximation of a distribution P by Q minimizes the cost function KL[Q∥P]. Show that
F(m) := argmin KL[Q∥P] = argmin (−H(Q) + EQ[φ(X)]) (4) m∈[−1,1]n m∈[−1,1]n
where H(Q) is the entropy of Q and EQ denotes expectation under Q.
[3 marks]
(b) Show that
􏰊 Q(x)f(xi) = 􏰊 Qi (xi)f(xi) (5) X∈{−1,1}n xi ∈{−1,+1}
holds for any factorial distribution Q and any function f : {−1, 1} → R.
[3 marks]
(c) Show that the expected value of φ under Q is
nn
EQ[φ(X)]=− 􏰊 αijmimj −􏰊βimi . (6)
1−mi xi = −1 2
i,j =1 i=1
[3 marks]

(d) Show that the entropy of Q is
H[Q]=−􏰊n 􏰃1+mi log􏰃1+mi􏰄+1−mi log􏰃1−mi􏰄􏰄. (7)
i=1 2 2 2 2
(e) Show that the optimal distribution is given by the parameter values
n
m =tanh􏰃􏰊α m +β􏰄 (8) i ijji
j=1
[4 marks]
[4 marks]

2. Sampling algorithms
(a) Consider a distribution with a very simple “staircase” density p (the thick line in the left figure). We divide the area under p into three “slices”:
p(x)
p(x)
Slice 3
p
(xi, yi)
Slice 2
Slice 1
ti
xx axi−1baxi b
Figure A Figure B
We define a Markov chain that generate samples x1, x2, . . . as follows. Start with any x1 ∈ [a, b]. For each i > 1:
(i) Choose one of the slices that overlap the location of the previous sample xi−1 at random as follows: ‘
• ti ∼ Uniform[0, p(xi−1)]
• Select the slice k which contains (xi−1,ti) (In the right figure, this would be slice 2.)
(ii) Regard slice k as a box and sample a point (xi, yi) uniformly from this box. Discard the vertical coordinate yi and keep xi as the ith sample.
Is this a valid sampler for p? More precisely: If we assume that the previous sample xi−1 is already distributed according to p (on the grounds that the Markov chain has converged), is xi marginally distributed according to p?
[6 marks]
(b) Consider a 2-dimensional Gaussian N (μ, Σ) with
􏰅3􏰆 􏰅1 0􏰆
μ=3 and Σ=02.
Write down a Gibbs sampler for N (μ, Σ); that is, given xi = (xi,1, xi,2) ∈ R2,
specify how to sample xi+1 ∈ R2. [4 marks] (c) Consider a directed graphical model with graph:
X1
X6
X5
X2
X3
X4

For a set x1, . . . , xn, we write x−i for the set with the ith element removed, x−i = {x1,…,xi−1,xi+1,…,xn} .
If i = 1, we read Xi−1 as X6. Suppose each variable Xi takes values in {0, 1}, with conditional
P (Xi = 1|Xi−1) = σ(θiXi−1) for some θi ∈ [0, 1] ,
where σ(x) = ex is the sigmoid function. What are the full conditionals
ex +1
P (Xi |X−i )? Please express the unnormalized full conditional P ̃ (Xi |X−i ) in terms of σ, and explain how to determine the normalization constant for a given setting of X−i. [7 marks]

3. Stochastic (natural) gradients for conjugate-exponential variational Bayes
Consider fitting a conjugate-exponential (CE) latent variable model on n observa- tions X = {xi}n1, with corresponding latents Z = {zi}ni and natural parameters θ, by variational Bayes (VB):
P(θ) = h(ν,τ)g(θ)νeθTτ n
P(Z,X) = 􏰋f(zi,xi)g(θ)eθTT(zi,xi) i=1
where ν and τ are hyperparameters and T is the joint sufficient statistic on latent and observation.
(a) Making the VB approximation Q(θ, Z) = Q(θ)Q(Z), and assuming the general form of the VB interative updates, derive the specific updates for the CE model. In particular, show that the update for Q(θ) depends on ⟨􏰈i T(zi,xi)⟩Q(Z).
[5 marks]
If the number of observations is very large, then storing and computing ⟨􏰈i T (zi, xi)⟩Q(Z) on every iteration may be computationally expensive. An alternative is to use a ’minibatch’ stochastic gradient approach, even when a closed form update is possi- ble in principle. Let
Q(θ; ν ̃, τ ̃) = h(ν ̃, τ ̃)g(θ)ν ̃eθTτ ̃ .
(b) Show that this conjugate form can itself be written as an exponential family dis- tribution on θ. Identify the natural parameters η ̃θ and corresponding sufficient statistics. [2 marks]
(c) Show that the gradient of the VB free energy with respect to η ̃θ has the form ∇η ̃θF=(∇η ̃θμ ̃θ)T(η0θ+nηZθ −η ̃θ)
whereη0θarethenaturalparametersofthepriorp(θ),ηZθ isatermthatdepends ontheobservationsandQ(Z)andμ ̃θ arethemeanparametersofQ(θ;ν ̃,τ ̃).
[4 marks]
(d) Thus, show that the stochastic gradient computed using a random ‘minibatch’ (i.e. subset of the data) is an unbiased estimate of the true gradient, and so appropriate for stochastic gradient ascent. [2 marks]
(e) Recall (e.g., from our discussion of independent components analysis) that the natural gradient of a function of a distribution F(Pη) with respect to the pa- rameters η takes the form
∇ˇF(Pη) = ⟨−∇∇logPη⟩−1∇F(Pη)
where all gradients are with respect to η and ∇∇ is the Hessian matrix. Show that the natural gradient of F with respect to η ̃θ has a particularly simple form.
[4 marks]

4. Modelling earthquakes
The US geological service has collected a dataset of minor earthquake events from across the country, and would like to model how those rates vary across different faults and with the ambient temperature. Index the faults by i = 1, . . . , N and different (discretised) temperature bands by t = 1, . . . , T . Let Qit be the number of earthquakes on fault i when the temperature was in band t.
As a first pass, we might model the data by assuming that each fault i has an intrinsic earthquake frequency ri that is independent of the earthquake rates of other faults, and that in temperature band t the earthquake rates are amplified by a factor st. This suggests the model
ri ∼ Gamma(a, b)
st ∼ Gamma(c, d) Qit ∼ Poisson(rist)
for each fault i for each band t for each i, t
(a) Draw out the model as a directed graphical model. Use plates to make your representation as compact as possible. [3 marks]
(b) What conditional independence relationships are preserved in the posterior when all the counts Qit are observed? Draw the corresponding undirected graph induced on r1 …rN,s1,…sT. [3 marks]
(c) Derive a Gibbs sampler to infer all the ri and st when the Qit are observed. Specify the form and parameters of each of the conditional distributions in- volved. How would the samples be used to obtain a Monte Carlo estimate of the mean and variance of ri? [6 marks]
(d) Derive an expectation propagation (EP) algorithm to infer the ri and st. Ap- proximate the likelihood term for Qit ∼ Poisson(rist) by
Lit ≈raite−bitriscite−ditst it
where ait, bit, cit and dit are variational parameters to be fit using EP. You may assume that the KL projection step is tractable. [5 marks]

5. Loopy graphs, BP and EP
Consider an undirected graphical model over discrete-valued variables X = {Xi}
with singleton and pairwise factors:
P(X) = 1 􏰋 fi(Xi) 􏰋 fij(Xi,Xj)
Z
nodes i edges (ij)
We wish to compute the marginal distributions P(Xi) for all nodes i and pairwise
marginals P(Xi,Xj) for all edges (ij).
Recall that any distribution on discrete variables can be written in exponential family
form.
(a) Show that the marginal probabilities we seek are exactly the mean parameters of the distribution – that is, the expected values of the sufficient statistic functions. [3 marks]
This means that, in principle, we could find their values using the standard result for exponential families:
μ = argmax ηTμ − Ψ(μ)
where η are the natural parameters and Ψ is the negative entropy function, dual to
the log normalising constant.
(b) Give two reasons why this approach is intractable for a general (loopy) graph.
[3 marks]
wherewehaveintroducedapproximatefactorsf ̃ (X,X)=fi(X)fj(X). ij i j ij i ij j
(c) Derive the expectation propagation (EP) updates for f ̃ . [8 marks] ij
(d) Show that these are equivalent to the updates under belief propagation (BP), so that EP in this approximation corresponds to loopy BP. [2 marks]
Consider an approximation given by
P ̃ ( X ) = 1 􏰋 f ( X ) 􏰋 f i ( X ) f j ( X )
Zii ijiijj nodes i edges (ij)

6. A mixture of VAEs
In lecture we discussed the “structured variational autoencoder” (SVAE), which provides one way to learn a structured latent model combined with flexible output distributions. Here we consider another, simpler, approach.
Consider the following model for observations x, with continuous latent z ∈ RD and discrete latent s ∈ {1…K}:
s ∼ Discrete(π)
z ∼ N (0,I)
x|z, s ∼ N 􏰁g(z; γs), σs2I􏰂
The parameters are: Θ = {π,γ1 …,γK,σ12 …σK2 } with π a standard probability vector; γs, for each value of s, a parameter vector for a function g (these could be the weights of a neural network); and σs2 the variance of the observation noise for the sth component.
(a) Write out an explicit expression for the variational free energy for this model in terms of the variational distribution q(z, s) and the parameters defined above. Simplify it to identify the sufficient statistics needed for exact learning.
[5 marks]
Now assume a variational inferential approximation based on an s-dependent recog- nition model.
q(z, s) = q(z|s)q(s) = q(z; f(x; φs))q(s)
with q(z;f(x;φs)) a normal distribution parametrised by the output of f(x;φs) –
perhaps the means and diagonal elements of the covariances.
(b) Show that, under this assumption, the free energy has the form
F(Θ, q) = 􏰊 rmFm(Θm, φm) − KL[q(s)∥p(s)] m
where rm = q(s = m) are the mixture responsibilities and Θm = {γm, σm2 }.
[5 marks]
(c) GivendataX ={xi},introducereparametrisedsampleszt,m ∼q(zi;f(xi,φm)),t= i
1 . . . T for each i and m, to derive a learning algorithm that estimates: • q(s), π
• φm, γm and σm2 .
Provide explicit equations for updates in terms of current parameters and sam-
ples. [6 marks]

Related Posts