代写代考 CSCI-GA.2565-001 Machine Learning: Homework 3

CSCI-GA.2565-001 Machine Learning: Homework 3

Due 11.59 p.m. EST, Dec 19, 2022 on Gradescope

(fill in your name here)
(collaborators if any)

We encourage LATEX-typeset submissions but will accept quality scans of hand-written pages.

1 Variational Inference and Monte Carlo Gradients

In this question, we will review the details of variational inference (VI), in particular, we will implement the
gradient estimators that make VI tractable.

We consider the latent variable model p(z,x) =

i=1 p(xi|zi)p(zi) where xi, zi ∈ R
D. Recall that in VI, we

find an approximation qλ(z) to p(z|x).

(A) Let V1(λ) be the set of variational approximations {qλ : qλ(z) =

i=1 q(zi;λi)} where λi are parameters
learned for each datapoint xi. Now consider fλ(x) as a deep neural network with fixed architecture where
λ parametrizes the network. Let V2(λ) = {qλ : qλ(z) =

i q(zi; fλ(xi))}. Which of the two families (V1

or V2) is more expressive, i.e. approximates a larger set of distributions? Prove your answer.

Will your answer change if we let fλ represent variable architecture, e.g. if λ parametrizes the set of
multi-layered perceptrons of all sizes? Why or why not?

Solution. Write your solution for each question using the solution environment. Feel free to use style
packages to your convenience, e.g. highlighting parts of your solution that you still need to work on. ⊓⊔

(B) For variational inference to work, we need to compute unbiased estimates of the gradient of the ELBO.
In class, we learnt two such estimators: score function (REINFORCE) and pathwise (reparametrization)
gradients. Let us see this in practice for a simpler inference problem.

Consider the dataset of N = 100 one-dimensional data points {xi}Ni=1 in data.csv. Suppose we want to
minimize the following expectation with respect to a parameter µ:

(i) Write down the score function gradient for this problem. Using a suitable reparametrization, write
down the reparameterization gradient for this problem.

(ii) Using PyTorch and for each of these two gradient estimators, perform gradient descent using M =
{1, 10, 100, 1000} gradient samples for T = 10 trials. Plot the mean and variance of the final estimate
for µ for each value of M across the T trials.

You should have two graphs, one for each gradient estimator. Each of the graph should contain two
plots, one for the means and one for the variances. The x-axis should be M , hence each of these
plots will have four points.

i=1(xi − z)
2 in this case) for each of the two

gradient estimators to be valid? Do these apply to both continuous and discrete distributions p(z)?

2 Bayesian Parameters versus Latent Variables

(A) Consider the model yi ∼ N (w⊤xi, σ2) where the inverse-variance is distributed λ = 1/σ2 ∼ Gamma(α, β).
Show that the predictive distribution y⋆|w,x⋆, α, β for a datapoint x⋆ follows a generalized T distribution

T (t; ν, µ, θ) =

with degree ν = 2α, mean µ = w⊤x⋆ and scale θ =
β/α. You may use the property Γ(k) =∫∞

xk−1e−xdx.

(B) Using your expression in (A), write down the MLE objective for w on N arbitrary labelled datapoints
{(xi, yi)}Ni=1. Do not optimize this objective.

where zi ∼ N (0, I), σ2 is known, and f is a deep neural

network parametrized by w.

(i) Write down an expression for the predictive distribution y⋆|X,y,x⋆, where X,y denote the training
datapoints. (You may leave your answer as an integral.)

(ii) Describe how you would approximate this distribution using variational inference and how you can
use your approximation to make a prediction for x⋆. Your answer should include the distribution
p(·) that you wish to approximate (which may or may not be the predictive distribution itself), the
distribution q(·) that is the variational approximation, as well as the variational objective.

(D) Finally, consider the model y ∼ N (w⊤x, σ2) where w ∼ N (0, I) and σ2 is known.
Derive a closed-form expression for the predictive distribution y⋆|X,y,x⋆. What are the parameters of
this predictive distribution and how do you optimize them?

(E) Of the three models defined in parts (A), (C), and (D) above, which are latent variable models and which
are not? Why? (If any are ambiguous, explain why.)

(F) Of the three models defined in parts (A), (C), and (D) above, which are Bayesian models and which are
not? Why? (If any are ambiguous, explain what is Bayesian about it and what is not.)

3 Normalizing Flows

In this question, we will review how we can use invertible transformations and the change-of-variables formula
to turn simple distributions into complex ones. Such transformations are known as normalizing flows. One
reason that flows are useful is that they can map unimodal distributions into multimodal ones, while still
allowing for a tractable density.

(A) Let z0 ∼ N (0, 1). Produce a density plot of z0 using N = 1000 samples.
Now look up “Planar Flow” (Equation 4) of The Expressive Power of a Class of Normalizing Flow Models.
Denote this flow as f . Choose an invertible non-linearity h and find values of w, b, u such that f(z0) is a
multimodal distribution. Plot the density plot of f(z0) using the same N = 1000 samples as above.

Note that d = 1 in this question. Also, it is fine to choose a h that is only invertible on its output range,
e.g. the sigmoid function on (0, 1).

(B) Use the change-of-variables formula and write down an explicit expression for the density of f(z0). This
depends on your choice of h.

(C) Let’s generalize this to D-dimensional variables and to a sequence of invertible functions f (not necessarily
planar flows). We can sample z(0) ∼ N (0, I) and then transform that sample iteratively using a sequence
of invertible functions, f1, . . . , fK , to finally obtain a sample z

z(K) = fK ◦ · · · ◦ f2 ◦ f1(z(0)).

Using the change-of-variables formula, write down a formula for the log-density of z(K) using the functions
{fk}Kk=1, their inverses and Jacobians, as well as the log-density of z

(D) Let us consider how we can improve variational inference using flows.

(i) How can we define an approximation qλ(z) to p(z|x) using flows?
(ii) How do we train the model, i.e. optimize λ to yield a good approximation to p(z|x)?
(iii) In what way is this more flexible than using a Gaussian approximation for qλ(z)?

https://arxiv.org/pdf/2006.00392.pdf

4 Causal Inference: Doubly Robust Estimators

Denote unit i’s treatment, outcome, and its vector of covariates by Ti, Yi, and Xi. Let us model propensity
scores by e(x; θ̂) and the outcome by f(x; ψ̂). Assume strong ignorability and positivity hold. We define the
Doubly Robust Estimator (DRE) for the average outcome when treated, E[Y (1)], by:

Ti − e(Xi; θ̂)

(A) Suppose the propensity model is correctly specified, i.e. e(x; θ̂) = P (Ti = 1|Xi = x). Given any function
f , show that the DRE is unbiased.

(B) Suppose the model f(x; ψ̂) is correctly specified, i.e. f(x; ψ̂) = E[Y (1)i |Xi = x] := E[Yi|Xi = x, T = 1].
Given any function e taking values in (0, 1), show that the DRE is unbiased.

(C) Recall that control variates improve Monte Carlo estimators by defining a new estimator with the same
expectation but lower variance. For some estimator f , this is done by taking a function g with E[g(x)] = 0,
and defining the new estimator as f̂(x) = f(x)−ag(x) for some a ∈ R. Here E[f̂ ] = E[f ], but the variances
are not equal. The value of a that makes the variance of f̂ smallest is a∗ =

When both e(x; θ̂) and f(x; ψ̂) are correctly specified, use control variates to justify the use of Doubly
Robust Estimators.

5 Generative Models with f-divergences

Given two distributions P and Q with density functions p and q, we define the f -divergence as:

Df (P∥Q) =

with respect to a convex function f .

(A) How can we estimate f -divergences using likelihood-ratio estimators as we learned in class?

(B) Using the estimate from your solution to (A), show how we can minimize Df (P∥Qθ) with respect to θ.

(C) Let Q denote the family of distributions that Qθ lives in. Assume that P ̸∈ Q. Compare the KL
and Reverse KL divergences. What are properties that each f -divergence imposes on its corresponding
minimizer Q⋆θ, i.e.

Q⋆θ = arg min

Name Df (P ||Q) f(u)
Kullback-Leibler

dx u log u

Reverse KL

dx − log u

6 Reinforcement Learning with Sparse Rewards

Suppose that you have a robot that should learn to move from any of a set of starting positions s0 ∈ S0 to a
goal position G. The robot can move by performing a sequence of small, low-level continuous actions, such as
rotating its joints. However, you do not know how to explicitly program a sequence of actions that will move
the robot from any s0 to G. You decide to use a reinforcement learning algorithm. Your RL algorithm uses the
policy gradient to learn a policy πθ(a|s) that maps from states s to distributions over actions a. You consider
learning episodes of a finite number of T steps.

(A) You start by designing the Markov Decision Process (MDP) that defines the robot’s learning environment.
The robot receives reward R = 1000 when it reaches location G and R = 0 upon entering any other
position. What is the Monte Carlo estimate of the policy gradient when G is not reached in T steps in
any of the sample trajectories? What happens as the robot explores in this environment and what will
its learning process look like?

(B) Suppose the robot must start in s0 for all learning trajectories. Name two ways you could alter the MDP
environment to improve exploration and gradients.

Variational Inference and Monte Carlo Gradients
Bayesian Parameters versus Latent Variables
Normalizing Flows
Causal Inference: Doubly Robust Estimators
Generative Models with f-divergences
Reinforcement Learning with Sparse Rewards

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts