代写 algorithm deep learning game math python scala graph network Bayesian GPU Deep Learning: Coursework 3¶

Deep Learning: Coursework 3¶

Student Name: <> (Student Number: <> )
Start date: 26th March 2019
Due date: 29th April 2019, 09:00 am

How to Submit¶
When you have completed the exercises and everything has finished running, click on ‘File’ in the menu-bar and then ‘Download .ipynb’. This file must be submitted to Moodle named as studentnumber_DL_cw3.ipynb before the deadline above.
Please produce a pdf with all the results (tables and plots) as well as the answers to the questions below. For this assignment, you don’t need to include any of the code in the pdf, but answers to the questions should be self-contained and should not rely on a code reference. Page limit: 20 pg.
IMPORTANT¶
Please make sure your submission includes all results/answers/plots/tables required for grading. We should not have to re-run your code.
Credits¶
A special thank you to Mihaela Rosca, Shakir Mohammend and Andriy Mnih for their help in this coursework.

Assignment Description¶
(Latent Generative Models)

Topics and optimization techniques covered:¶
• Stochastic variational inference
• Amortized variational inference (VAEs)
• Improving amortized variational inference using KL annealing
• Improving amortized variational inference using constraint optimization
• Avoiding latent space distribution matching using GANs

Tensorflow¶

Note: Before taking on this assignment you might find it useful to take a look at the tensorflow_probability package, especially if you have not used probability distributions in TensorFlow before. In this assignment we will use only standard probability distribtions (like Gaussian and Bernouli), but worth taking a look on how TF handles in-graph sampling and optimizations involving distributions.
In [0]:
#@title Imports (Do not modify!)

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math

import tensorflow as tf

import numpy as np

# Plotting library.
from matplotlib import pyplot as plt
import seaborn as sns

sns.set(rc={“lines.linewidth”: 2.8}, font_scale=2)
sns.set_style(“whitegrid”)

# Tensorflow probability utilities
import tensorflow_probability as tfp

tfd = tfp.distributions

import warnings
warnings.filterwarnings(‘ignore’)
In [3]:
#@title Check you’re using the GPU (Expand me for instructions)
# Don’t forget to select GPU runtime environment in Runtime -> Change runtime type
device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
raise SystemError(‘GPU device not found’)
print(‘Found GPU at: {}’.format(device_name))

Found GPU at: /device:GPU:0

Helper Functions for visualisation¶
In [0]:
def gallery(array, ncols=10, rescale=False):
“””Data visualization code.”””
if rescale:
array = (array + 1.) / 2
nindex, height, width, intensity = array.shape
nrows = nindex//ncols
assert nindex == nrows*ncols
# want result.shape = (height*nrows, width*ncols, intensity)
result = (array.reshape(nrows, ncols, height, width, intensity)
.swapaxes(1,2)
.reshape(height*nrows, width*ncols, intensity))
return result
In [0]:
def show_digits(axis, digits, title=”):
axis.axis(‘off’)
ncols = int(np.sqrt(digits.shape[0]))
axis.imshow(gallery(digits, ncols=ncols).squeeze(axis=2),
cmap=’gray’)
axis.set_title(title, fontsize=15)
In [0]:
def show_latent_interpolations(generator, prior, session):
a = np.linspace(0.0, 1.0, BATCH_SIZE)
a = np.expand_dims(a, axis=1)

first_latents = prior.sample()[0]
second_latents = prior.sample()[0]

# To ensure that the interpolation is still likely under the Gaussian prior,
# we use Gaussian interpolation – rather than linear interpolation.
interpolations = np.sqrt(a) * first_latents + np.sqrt(1 – a) * second_latents

ncols = int(np.sqrt(BATCH_SIZE))
samples_from_interpolations = generator(interpolations)
samples_from_interpolations_np = sess.run(samples_from_interpolations)
plt.gray()
axis = plt.gca()
show_digits(
axis, samples_from_interpolations_np, title=’Latent space interpolations’)

Hyperparameters (Do not modify!)¶
These were chosen to work across all models you are going to be training. At times you will need to explore other configuration to answer the questions, but keep this as default — things should train nicely under these parameters! Check your model and gradients if that is not the case!
In [0]:
BATCH_SIZE = 64
NUM_LATENTS = 10
TRAINING_STEPS = 10000
In [0]:
tf.reset_default_graph()

The Data¶
Handwritten Digit Recognition Dataset (MNIST)¶
We will be revisiting the MNIST digit dataset for this assignment. The setup/processing of the data will be a bit different in this assignment as for training purposes it is sometimes easier to expose the data sampling as an operation in the graph, rather than going through placeholders. This is in general a very useful way of handling data in tensorflow, especially for larger training regimes where ‘stepping out’ of the graph might be very expensive.
In the following we will walk you through how to get the data into this form. You do not need to worry about it, but it is worth making sure you understand the step, as this is something that might be useful to replicate in the future.
In [0]:
mnist = tf.contrib.learn.datasets.load_dataset(“mnist”)
In [0]:
print(mnist.train.images.shape)
print(type(mnist.train.images))

Transform the data from numpy arrays to in graph tensors.¶
This allows us to use TensorFlow datasets, which ensure that a new batch from the data is being fed at each session.run. This means that we do not need to use feed_dicts to feed data to each session.
In [0]:
def make_tf_data_batch(np_data, shuffle=True):
# Reshape the data to image size.
images = np_data.reshape((-1, 28, 28, 1))

# Create the TF dataset.
dataset = tf.data.Dataset.from_tensor_slices(images)

# Shuffle and repeat the dataset for training.
# This is required because we want to do multiple passes through the entire
# dataset when training.
if shuffle:
dataset = dataset.shuffle(100000).repeat()

# Batch the data and return the data batch.
one_shot_iterator = dataset.batch(BATCH_SIZE).make_one_shot_iterator()
data_batch = one_shot_iterator.get_next()
return data_batch
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
print(real_data.shape)

Part 1: Latent Variable models and Variational Inference¶

T1.1 Stochastic Variational Inference¶
In this first task we will consider a simple latent variable model $z \rightarrow x$. Your task is to use stochastic variational inference to train a generative model on the MNIST data. For each data point $x_i$, there is a set of variational parameters to be learned. Throughout this assessment, the posterior and the prior will be Normal random variables, with uncorrelated dimensions.
Objective – maximize: \begin{equation} \mathbb{E}{p^*(x)} \mathbb{E}{q(z|x)}{\left[ \log p\theta(x|z)\right]} – \mathbb{E}{p^*(x)} \left[KL(q(z|x)||p(z))\right] \end{equation}
For more information, please check out:
• http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisley2013.pdf
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Model Implementation¶
In the following, I am going to walk you through implementing this model. This will only be done for the first task, but you can use this to structure your code for all of the tasks after this. Also worth taking a look at question 1 at the end of this section before finishing the implementation – this should give you exactly what you have to implement in the update operations and training loop.

Data variable¶
We will do multiple session.run to update the variational parameters for one data batch. To ensure that the same batch is used, we define a variable for the data, and update it after updating the decoder parameters.
In [0]:
data_var = tf.Variable(
tf.ones(shape=(BATCH_SIZE, 28, 28, 1), dtype=tf.float32),
trainable=False)

data_assign_op = tf.assign(data_var, real_data)

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

Define the decoder¶
In [0]:
DECODER_VARIABLE_SCOPE = “decoder”
In [0]:
def standard_decoder(z):
with tf.variable_scope(DECODER_VARIABLE_SCOPE, reuse=tf.AUTO_REUSE):
h = tf.layers.dense(z, 7 * 7 * 64, activation=tf.nn.relu)
h = tf.reshape(h, shape=[BATCH_SIZE, 7, 7, 64])
h = tf.layers.Conv2DTranspose(
filters=32,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2DTranspose(
filters=1,
kernel_size=5,
strides=2,
activation=None, # Do not activate the last layer.
padding=’same’)(h)
return tf.distributions.Bernoulli(h)
In [0]:
decoder = standard_decoder

Define prior¶
In [0]:
def multi_normal(loc, log_scale):
# We model the latent variables as independent
return tfd.Independent(
distribution=tfd.Normal(loc=loc, scale=tf.exp(log_scale)),
reinterpreted_batch_ndims=1)

def make_prior():
# Zero mean, unit variance prior.
prior_mean = tf.zeros(shape=(BATCH_SIZE, NUM_LATENTS), dtype=tf.float32)
prior_log_scale = tf.zeros(shape=(BATCH_SIZE, NUM_LATENTS), dtype=tf.float32)

return multi_normal(prior_mean, prior_log_scale)
In [0]:
prior = make_prior()

Define variational posterior $q(z|x)$¶
Define this to be a multi-dimensional Gaussian distribution. You can use the helper function above for this, but keep in mind the parameters of this distribution (mean and variance) ought to be trained.
In [0]:
# Build the variational posterior

##################
# YOUR CODE HERE #
##################
# variational_posterior = multi_normal(….)

Define and build optimization objective (ELBO)¶
Putting things together: build the likelihood term and the KL term in the objective in (T1.1) description.
In [0]:
def bound_terms(data_batch, variational_posterior, decoder_fn):

##################
# YOUR CODE HERE #
##################

# Reduce mean over the batch dimensions
likelihood_term = tf.reduce_mean(likelihood_term)

##################
# YOUR CODE HERE #
##################

# Reduce over the batch dimension.
kl_term = tf.reduce_mean(kl_term)

# Return the terms in the optimization objective in (1.1) description
return likelihood_term, kl_term
In [0]:
# Maximize the data likelihodd and minimize the KL divergence between the prior and posterior
likelihood_term, kl_term = bound_terms(data_var, variational_posterior, decoder)
train_elbo = likelihood_term – kl_term

##################
# YOUR CODE HERE #
##################
# loss = …

Build the update operations for the variational and global variables¶
In [0]:
# Variational variable optimizer
variational_vars_optimizer = tf.train.GradientDescentOptimizer(0.05)

##################
# YOUR CODE HERE #
##################
# variational_vars = … # list of variational variables

# Just to check
print(‘Variational vars” {}’.format(variational_vars))
variational_vars_update_op = variational_vars_optimizer.minimize(
loss, var_list=variational_vars)

# Decoder optimizer
decoder_optimizer = tf.train.AdamOptimizer(0.001, beta1=0.9, beta2=0.9)
decoder_vars = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=DECODER_VARIABLE_SCOPE)
print(‘Decoder vars” {}’.format(decoder_vars))
decoder_update_op = decoder_optimizer.minimize(loss, var_list=decoder_vars)

Variational vars” [, ]
Decoder vars” [, , , , , ]
In [0]:
# Check trainable variables
tf.trainable_variables()
Out[0]:
[,
,
,
,
,
,
,
]

Training loop¶
In [0]:
# Number of SVI updates per sample
NUM_SVI_UPDATES = 10
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.global_variables_initializer())
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
losses = []
kls = []
likelihood_terms = []

for i in xrange(TRAINING_STEPS):

# Update the data batch.
sess.run(data_assign_op)

# Training (put things together based on the operations you’ve defined before)
##################
# YOUR CODE HERE #
##################

# Report the loss and the kl once in a while.
if i % 10 == 0:
iteration_loss, iteration_kl, iteration_likelihood = sess.run(
[loss, kl_term, likelihood_term])
print(‘Iteration {}. Loss {}. KL {}’.format(
i, iteration_loss, iteration_kl))
losses.append(iteration_loss)
kls.append(iteration_kl)
likelihood_terms.append(iteration_likelihood)

Results¶
Let us take a look at the optimization process and the resulting model

Visualize training process¶
Plot the loss and KL over the training process (number of iterations)
In [0]:
fig, axes = plt.subplots(1, 2, figsize=(2*8,5))

axes[0].plot(losses, label=’Negative ELBO’)
axes[0].set_title(‘Time’, fontsize=15)
axes[0].legend()

axes[1].plot(kls, label=’KL’)
axes[1].set_title(‘Time’, fontsize=15)
axes[1].legend()

Generate samples, reconstructions and latent interpolation¶
In [0]:
# Read data (just sample from the data set)
# real_data_examples

# Note: the reconstructions are only valid after the inner loop optimization has
# been performed.
# reconstructions = …
In [0]:
# Sample from the generative model!
# final_samples = …
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*4,4))

show_digits(axes[0], real_data_examples, ‘Data’)
show_digits(axes[1], data_reconstructions, ‘Reconstructions’)
show_digits(axes[2], final_samples, ‘Samples’)
In [0]:
show_latent_interpolations(lambda x: decoder(x).mean(), prior, sess)

Q1.1 VI Questions (28 pts):¶
We going to go through some questions on the model you have just implemented. The first question here could be answered before the implementation and can act as a blue-print for how to do the training. We are going to spend a bit more time on this first method as it is paramount you understand the optimization process here as a lot of the other tasks build on top of this one.
Whenever a question asks for an effect/behaviour when varying one of the conditions, feel free to experiment. Both theoretical arguments and emperical plots showing the relevant behaviour will be accepted here.
1. [5 pts] Derive the variational ELBO for one data point $x$ and explain how one would update the parameters for variational posterior $q_{\phi}(z|x)$, as well as the parameters of the generative distribution $p_{\theta}(x|z)$. Assume a Gaussian prior and a multi-dim Gaussian variational prior, as well as the generative function given by the decoder in the code above. This is basically outlining the optimization you should be implementing in the Training loop section above.

2. [6 pts] In the Build the update operations for the variational and global variables, I have define two separate optimizers for the two sets of parameters ($\theta$ and $\phi$).
• i) How would you implement this with just one optimizer? (You just need to explain how you would do it, but not implement).
• ii) What happen if we change the variational variables’ optimizer variational_vars_optimizer to tf.train.AdamOptimizer? (Feel free to experiment and change the learning rate accordingly)
3. [2 pts] What are the computational considerents to think of when using SVI? What would happen if you would now want to train this SVI model on a big dataset, such as ImageNet? What part of this optimization process is mostly affected and in which way?

1. [2 pts]What is the effect of the number of SVI updates on the ELBO and on the KL term?

2. [3 pts] What is the effect of the data batch size on the convergence speed compared to the effect of the number of SVI updates? What is the effect of the number of training steps for the decoder compared to the number of SVI steps per decoder update?

[10 pts] Model Implementation and Results

T1.2 Amortized Variational Inference¶
Reminder: The idea behind amortized inference is to replace the slow iterative optimization process we needed to do in the previous method for each data-point, with a faster non-iterative one. Check the lecture slides and/or references below for more details.
Thus, instead of learning one set of posterior variables per data point, we can use function approximation to learn the distributional variables. Specifically, the posterior parameters for $x_i$ will be the output of a learned function $f_\theta(x_i)$, where $\theta$ are parameters now shared across all data points. Can you think of why this is useful?

Objective – maximize: \begin{equation} \mathbb{E}{p^*(x)} \mathbb{E}{q(z|x)} {\left[ \log p\theta(x|z) \right]} – \mathbb{E}{p^*(x)} \left[ KL(q(z|x)||p(z)) \right] \end{equation}
For more information, please check out:
• https://arxiv.org/abs/1312.6114
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Model Implementation¶
In [0]:
tf.reset_default_graph()
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
print(real_data.shape)

(?, 28, 28, 1)

Define the encoder¶
In [0]:
ENCODER_VARIABLE_SCOPE = ‘encoder’
In [0]:
def encoder(x):
with tf.variable_scope(ENCODER_VARIABLE_SCOPE, reuse=tf.AUTO_REUSE):

h = tf.layers.Conv2D(
filters=8,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(x)
h = tf.layers.Conv2D(
filters=16,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2D(
filters=32,
kernel_size=5,
strides=1,
activation=tf.nn.relu,
padding=’same’)(h)

out_shape = 1
for s in h.shape.as_list()[1:]:
out_shape*= s

h = tf.reshape(h, shape=[BATCH_SIZE, out_shape])
mean = tf.layers.dense(h, NUM_LATENTS, activation=None)
scale = tf.layers.dense(h, NUM_LATENTS, activation=None)
return multi_normal(loc=mean, log_scale=scale)

Define the prior¶
In [0]:
##################
# YOUR CODE HERE #
##################
# prior = …

Define the variational posterior¶
Note: We no longer have to use a variable to store the data. We will perform one encoder update per decoder update, so it is OK for the data batch to be refreshed at each run.
In [0]:
##################
# YOUR CODE HERE #
##################
# variational_posterior = …

Define the decoder¶
We will use the same decoder as in T1.1
In [0]:
decoder = standard_decoder

Define and build optimization objective (ELBO)¶
In [0]:
# Maximize the data likelihodd and minimize the KL divergence between the prior
# and posterior. We use the exact same loss as in the SVI case.

##################
# YOUR CODE HERE #
##################
# likelihood_term, kl_term = …
train_elbo = likelihood_term – kl_term
# loss =

Define optimization and the update operations¶
In [0]:
optimizer = tf.train.AdamOptimizer(0.001, beta1=0.9, beta2=0.9)

##################
# YOUR CODE HERE #
##################
# update_op = optimizer.minimize(loss, var_list=…)

Training¶
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.global_variables_initializer())
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
losses = []
kls = []
likelihood_terms = []

for i in xrange(TRAINING_STEPS):

##################
# YOUR CODE HERE #
##################
# Training, use update_op

if i % 100 == 0:
iteration_loss, iteration_likelihood, iteration_kl = sess.run(
[loss, likelihood_term, kl_term])
print(‘Iteration {}. Loss {}. KL {}’.format(
i, iteration_loss, iteration_kl))
losses.append(iteration_loss)
kls.append(iteration_kl)
likelihood_terms.append(iteration_likelihood)

Results¶
Let us take a look at the optimization process and the resulting model

Visualize the loss in time¶
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*8,5))

axes[0].plot(losses, label=’Negative ELBO’)
axes[0].set_title(‘Time’, fontsize=15)
axes[0].legend()

axes[1].plot(kls, label=’KL’)
axes[1].set_title(‘Time’, fontsize=15)
axes[1].legend()

axes[2].plot(likelihood_terms, label=’Likelihood Term’)
axes[2].set_title(‘Time’, fontsize=15)
axes[2].legend()

Generate samples and latent interpolations¶
In [0]:
samples = decoder(prior.sample()).mean()
samples.shape.assert_is_compatible_with([BATCH_SIZE, 28, 28, 1])

reconstructions = decoder(variational_posterior.sample()).mean()
In [0]:
real_data_vals, final_samples_vals, data_reconstructions_vals = sess.run(
[real_data, samples, reconstructions])
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*4,4))

show_digits(axes[0], real_data_vals, ‘Data’)
show_digits(axes[1], data_reconstructions_vals, ‘Reconstructions’)
show_digits(axes[2], final_samples_vals, ‘Samples’)
In [0]:
show_latent_interpolations(lambda x: decoder(x).mean(), prior, sess)

Q1.2 Questions about Amortized Variational Inference (15 pts)¶
1. [5 pts] What do you notice about amortized variational inference (especially as compared with stochastic variational inference)?
• i) Are there any downsides to using the amortized version?
• ii) What do you observe about sample quality and reconstruction quality?
• iii) What do you observe about the ELBO and KL term? \ (Here, feel free to vary parameters and compare with the T1.1)
2. [4 pts] Stochastic and amortized variational inference can be combined, leading to semi-amortized variational inference. Give an instance of an algorithm that would combine these and explain why that would be useful?
3. [1 pts] What gradient estimation method is used to compute the gradients with respect to the encoder parameters?
[5 pts] Model Implementation and Results

T1.3 KL annealing¶
In this section we are going to be looking at the same model as in T2.1: same encoder + decoder, prior. But we are going to change slightly the optimization objective as given below.
Objective – maximize: \begin{equation} \mathbb{E}{p^*(x)} \mathbb{E}{q(z|x) \log p\theta(x|z)} – \alpha \mathbb{E}{p^*(x)} KL(q(z|x)||p(z)) \end{equation}
Where $\alpha$ changes during training, to weigh in the KL term more. In particular for our problem consider: \begin{equation} \alpha = \frac{n{iter}}{N} \end{equation} where $n{iter}$ is the number of training iterations we are have completed and $N$ is the total number of training iterations TRAINING_STEPS.
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Model Implementation¶
In [0]:
tf.reset_default_graph()
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
In [0]:
prior = make_prior()
decoder = standard_decoder
encoder = encoder

##################
# YOUR CODE HERE #
##################
# variational_posterior = … # Hint: From T2.1

Define the KL coefficient $\alpha$ and its update function¶
In [0]:
##################
# YOUR CODE HERE #
##################
#kl_coefficient = …
#kl_step = …

update_kl_coeff = tf.assign(kl_coefficient, kl_coefficient + kl_step)
In [0]:
##################
# YOUR CODE HERE #
##################
# Hint: This very similar to what you’ve done in T1.2. Same model, only slightly different loss including $\alpha$
# loss = …
In [0]:
# We now perform joint optimization on the encoder and decoder variables.
optimizer = tf.train.AdamOptimizer(0.001, beta1=0.9, beta2=0.9)

##################
# YOUR CODE HERE #
##################
# Parameter update operation (as before)
# variables_update_op = …

# Ensure that a variable update is followed by an update in the KL coefficient.
with tf.control_dependencies([variables_update_op]):
update_op = tf.identity(update_kl_coeff)

Training¶
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.initialize_all_variables())
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
losses = []
kls = []
likelihood_terms = []

for i in xrange(TRAINING_STEPS):
sess.run(update_op)

if i % 100 == 0:
iteration_loss, iteration_likelihood, iteration_kl = sess.run(
[loss, likelihood_term, kl_term])
print(‘Iteration {}. Loss {}. KL {}’.format(
i, iteration_loss, iteration_kl))
losses.append(iteration_loss)
kls.append(iteration_kl)
likelihood_terms.append(iteration_likelihood)

Results¶
Let us take a look at the optimization process and the resulting model

Visualize training process¶
Plot the loss and KL and likelihood over the training process (number of iterations)
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*8,5))

axes[0].plot(losses, label=’Negative ELBO’)
axes[0].set_title(‘Time’, fontsize=15)
axes[0].legend()

axes[1].plot(kls, label=’KL’)
axes[1].set_title(‘Time’, fontsize=15)
axes[1].legend()

axes[2].plot(likelihood_terms, label=’Likelihood Term’)
axes[2].set_title(‘Time’, fontsize=15)
axes[2].legend()

Generate samples, reconstructions and latent interpolation¶
In [0]:
samples = decoder(prior.sample()).mean()
samples.shape.assert_is_compatible_with([BATCH_SIZE, 28, 28, 1])

reconstructions = decoder(variational_posterior.sample()).mean()
In [0]:
real_data_vals, final_samples_vals, data_reconstructions_vals = sess.run(
[real_data, samples, reconstructions])
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*4,4))

show_digits(axes[0], real_data_vals, ‘Data’)
show_digits(axes[1], data_reconstructions_vals, ‘Reconstructions’)
show_digits(axes[2], final_samples_vals, ‘Samples’)
In [0]:
show_latent_interpolations(lambda x: decoder(x).mean(), prior, sess)

Q1.3 Questions about KL annealing (15 pts):¶
1. [3 pts] What do you observe about the KL behaviour throughout training as opposed to amortized variational inference without any KL annealing?
2. [1 pts] How do the samples and reconstruction compare with the previous models?
3. [6 pts] Consider now a schedule where $\alpha$ increases over time — that is the contribution of the KL diminishes over time. When would that be a useful case? (Think about what this objective corresponds to in the optimization problem).
[5 pts] Model Implementation and Results

T1.4 Constrained optimization¶
In this next part, instead of using KL annealing, constrained optimization can be used to automatically tuned the relative weight of the likelihood and KL terms. This removes the need to manually create an optimization schedule, which can be problem specific.
The objective now becomes:
\begin{equation} \text{minimize } \mathbb{E}_{p^*(x)} KL(q(z|x)||p(z)) \text{ such that } \mathbb{E}_{p^*(x)} \mathbb{E}_{q(z|x)} \left[ {\log p_\theta(x|z)} \right] > \alpha \end{equation}
This can be solved using the use of Lagrange multipliers. The objective becomes:
\begin{equation} \text{minimize } \mathbb{E}_{p^*(x)} KL(q(z|x)||p(z)) + \lambda (\mathbb{E}_{p^*(x)} \mathbb{E}_{q(z|x)} (\alpha – \log p_\theta(x|z))) \end{equation}
The difference compared to the KL annealing is that:
• $\lambda$ is a learned parameter – it will be learned using stochastic gradient descent, like the network parameters. The difference is that the lagrangian has to solve a maximization problem. You can see this intuitively: the gradient with respect to $\lambda$ in the objective above is $\mathbb{E}_{p^*(x)} \mathbb{E}_{q(z|x)} (\alpha – \log p_\theta(x|z))$. If $ \mathbb{E}_{p^*(x)} \mathbb{E}_{q(z|x)} (\alpha – \log p_\theta(x|z))> 0$, the constraint is not being satisfied, so the value of the lagrangian needs to increase. This will be done by doing gradient ascent, instead of gradient descent. Note that for $\lambda$ to be a valid lagranian in a minimization problem, it has to be positive.
• The practicioner has to specify the hyperparameter $\alpha$, which determines the reoncstruction quality of the model.
• the coefficient is in front of the likelihood term, not the KL term. This is mainly for convenience, as it is easier to specify the hyperparameter $\alpha$ for the likelihood (reconstruction loss).
For more assumptions made by this method, see the Karush–Kuhn–Tucker conditions.
For more information, see:
• http://bayesiandeeplearning.org/2018/papers/33.pdf
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Model Implementation¶
In [0]:
tf.reset_default_graph()
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
In [0]:
prior = make_prior()
decoder = standard_decoder
encoder = encoder

##################
# YOUR CODE HERE #
##################
# variational_posterior = … # Hint: From T2.1

Define the lagrangian variable $\lambda$.¶
Unlike in the KL annealing case, we learn the coefficient. Remember that this variable has to be always positive. To ensure this, use tf.nn.softplus, Moreover, please initialize the lagrangian such that after the softplus the coefficient is approximately 1. Check emperically that this is true when instantiating the variable.
In [0]:
##################
# YOUR CODE HERE #
##################
#lagrangian_var = …

# Ensure that the lagrangian is positive and has stable dynamics.
lagrangian = tf.nn.softplus(lagrangian_var)
In [0]:
# How good do we want the reconstruction loss to be?
# We can look at previous runs to get an idea what a reasonable value would be.

##################
# YOUR VALUE HERE#
##################
# reconstruction_target = …

Define the loss¶
In [0]:
##################
# YOUR CODE HERE #
##################
#
# loss = kl_term + lagrangian * (reconstruction_target – likelihood_term )
In [0]:
# Check trainable variables (the lagrangian variable should be in here)
tf.trainable_variables()
In [0]:
lagrangian_optimizer = tf.train.GradientDescentOptimizer(0.001)

##################
# YOUR CODE HERE #
##################
# autoencoder_variables_update_op = …

# Ensure that a variable update is followed by an update to the Lagrangian.
with tf.control_dependencies([autoencoder_variables_update_op]):
# Ensure that the lagrangian solves a maximization problem instead of a
# minimization problem by changing the sign of the loss function.
update_op = lagrangian_optimizer.minimize(- loss, var_list=[lagrangian_var])

Training¶
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.initialize_all_variables())
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
losses = []
kls = []
likelihood_terms = []
lagrangian_values = []

for i in xrange(TRAINING_STEPS):
sess.run(update_op)

if i % 100 == 0:
iteration_loss, iteration_likelihood, iteration_kl, lag_val = sess.run(
[loss, likelihood_term, kl_term, lagrangian])
print(‘Iteration {}. Loss {}. KL {}. Lagrangian {}’.format(
i, iteration_loss, iteration_kl, lag_val))
losses.append(iteration_loss)
kls.append(iteration_kl)
likelihood_terms.append(iteration_likelihood)
lagrangian_values.append(lag_val)

Iteration 0. Loss [541.7995]. KL 0.0352992638946. Lagrangian [1.197807]
Iteration 100. Loss [2079.5444]. KL 23.6862182617. Lagrangian [19.252846]
Iteration 200. Loss [1809.995]. KL 44.9460372925. Lagrangian [28.361876]
Iteration 300. Loss [982.47107]. KL 54.9215202332. Lagrangian [32.87678]
Iteration 400. Loss [680.7418]. KL 60.2341461182. Lagrangian [34.964283]
Iteration 500. Loss [596.9203]. KL 56.2811889648. Lagrangian [36.20462]
Iteration 600. Loss [115.77269]. KL 58.1710243225. Lagrangian [36.948597]
Iteration 700. Loss [210.97853]. KL 57.4608154297. Lagrangian [37.513374]
Iteration 800. Loss [136.11151]. KL 51.4106674194. Lagrangian [37.803387]
Iteration 900. Loss [58.05602]. KL 53.627204895. Lagrangian [37.95315]
Iteration 1000. Loss [-51.688946]. KL 51.8700370789. Lagrangian [37.951996]
Iteration 1100. Loss [151.58333]. KL 50.7084274292. Lagrangian [37.86333]
Iteration 1200. Loss [-135.017]. KL 46.5962524414. Lagrangian [37.711273]
Iteration 1300. Loss [32.000687]. KL 49.9308853149. Lagrangian [37.478817]
Iteration 1400. Loss [-116.43298]. KL 49.2396583557. Lagrangian [37.110218]
Iteration 1500. Loss [-168.37144]. KL 49.0758171082. Lagrangian [36.737503]
Iteration 1600. Loss [-19.992836]. KL 45.9241333008. Lagrangian [36.28124]
Iteration 1700. Loss [-261.8975]. KL 43.7975120544. Lagrangian [35.83392]
Iteration 1800. Loss [-101.60282]. KL 47.7132949829. Lagrangian [35.234978]
Iteration 1900. Loss [-204.36224]. KL 45.5789337158. Lagrangian [34.702778]
Iteration 2000. Loss [-42.759132]. KL 46.159034729. Lagrangian [34.081306]
Iteration 2100. Loss [-284.24768]. KL 46.3801574707. Lagrangian [33.371338]
Iteration 2200. Loss [-183.06717]. KL 44.3533935547. Lagrangian [32.71816]
Iteration 2300. Loss [-29.749825]. KL 45.9133453369. Lagrangian [31.986]
Iteration 2400. Loss [-319.1849]. KL 43.3486213684. Lagrangian [31.233154]
Iteration 2500. Loss [-159.78786]. KL 42.0662231445. Lagrangian [30.504232]
Iteration 2600. Loss [-202.27539]. KL 42.9071578979. Lagrangian [29.715223]
Iteration 2700. Loss [-214.05893]. KL 41.4365158081. Lagrangian [28.887203]
Iteration 2800. Loss [-269.72375]. KL 42.4422607422. Lagrangian [28.050806]
Iteration 2900. Loss [-283.41443]. KL 41.647026062. Lagrangian [27.18032]
Iteration 3000. Loss [-133.74292]. KL 40.6553421021. Lagrangian [26.352976]
Iteration 3100. Loss [-152.64445]. KL 41.2285957336. Lagrangian [25.46095]
Iteration 3200. Loss [-96.81888]. KL 41.5986289978. Lagrangian [24.491226]
Iteration 3300. Loss [-196.9295]. KL 42.6475219727. Lagrangian [23.551863]
Iteration 3400. Loss [-153.52452]. KL 40.165599823. Lagrangian [22.673553]
Iteration 3500. Loss [-151.06546]. KL 37.5955734253. Lagrangian [21.687883]
Iteration 3600. Loss [-229.68317]. KL 39.9825668335. Lagrangian [20.716698]
Iteration 3700. Loss [-104.87711]. KL 38.4817276001. Lagrangian [19.743465]
Iteration 3800. Loss [-105.99178]. KL 37.782875061. Lagrangian [18.733461]
Iteration 3900. Loss [-249.00357]. KL 38.3341903687. Lagrangian [17.74827]
Iteration 4000. Loss [-68.69702]. KL 36.1678352356. Lagrangian [16.70795]
Iteration 4100. Loss [-174.97833]. KL 39.7596549988. Lagrangian [15.695165]
Iteration 4200. Loss [-155.16246]. KL 34.0775718689. Lagrangian [14.718307]
Iteration 4300. Loss [-66.7576]. KL 37.329624176. Lagrangian [13.623195]
Iteration 4400. Loss [-108.89918]. KL 35.0873794556. Lagrangian [12.51753]
Iteration 4500. Loss [-118.21915]. KL 33.5963401794. Lagrangian [11.489143]
Iteration 4600. Loss [-53.56543]. KL 34.0425567627. Lagrangian [10.457892]
Iteration 4700. Loss [-26.112885]. KL 34.0569229126. Lagrangian [9.426848]
Iteration 4800. Loss [-18.935707]. KL 32.3330039978. Lagrangian [8.370506]
Iteration 4900. Loss [-36.000717]. KL 31.3581390381. Lagrangian [7.2873397]
Iteration 5000. Loss [-12.562384]. KL 31.8118114471. Lagrangian [6.20336]
Iteration 5100. Loss [-7.5185757]. KL 29.1720752716. Lagrangian [5.173302]
Iteration 5200. Loss [-15.894596]. KL 30.4736557007. Lagrangian [4.1561613]
Iteration 5300. Loss [8.029518]. KL 26.8174228668. Lagrangian [3.1686027]
Iteration 5400. Loss [25.604906]. KL 26.1411895752. Lagrangian [2.3228507]
Iteration 5500. Loss [6.5341053]. KL 22.5977516174. Lagrangian [1.658878]
Iteration 5600. Loss [13.560015]. KL 22.0133094788. Lagrangian [1.227013]
Iteration 5700. Loss [9.776123]. KL 19.2209739685. Lagrangian [0.94283783]
Iteration 5800. Loss [17.78655]. KL 19.883392334. Lagrangian [0.7869422]
Iteration 5900. Loss [16.17637]. KL 18.150560379. Lagrangian [0.6876382]
Iteration 6000. Loss [17.821135]. KL 17.0874729156. Lagrangian [0.6272664]
Iteration 6100. Loss [15.111039]. KL 16.5076503754. Lagrangian [0.574869]
Iteration 6200. Loss [16.96215]. KL 16.7488555908. Lagrangian [0.5575547]
Iteration 6300. Loss [16.180567]. KL 16.7198677063. Lagrangian [0.54117465]
Iteration 6400. Loss [14.43391]. KL 15.754486084. Lagrangian [0.5468288]
Iteration 6500. Loss [19.180103]. KL 16.8049125671. Lagrangian [0.5451356]
Iteration 6600. Loss [12.784924]. KL 15.6518287659. Lagrangian [0.5357091]
Iteration 6700. Loss [16.608559]. KL 16.9377422333. Lagrangian [0.5367914]
Iteration 6800. Loss [16.740673]. KL 15.5889787674. Lagrangian [0.53593206]
Iteration 6900. Loss [10.609623]. KL 15.0649805069. Lagrangian [0.5185327]
Iteration 7000. Loss [16.558035]. KL 16.1625747681. Lagrangian [0.51629287]
Iteration 7100. Loss [17.005291]. KL 15.8910598755. Lagrangian [0.51144433]
Iteration 7200. Loss [16.193079]. KL 15.6455135345. Lagrangian [0.516937]
Iteration 7300. Loss [17.056091]. KL 15.51628685. Lagrangian [0.51827925]
Iteration 7400. Loss [14.588214]. KL 15.8725547791. Lagrangian [0.5180299]
Iteration 7500. Loss [14.940847]. KL 16.0428085327. Lagrangian [0.5280319]
Iteration 7600. Loss [13.2639065]. KL 14.7286262512. Lagrangian [0.5275872]
Iteration 7700. Loss [15.796695]. KL 16.0446853638. Lagrangian [0.5240154]
Iteration 7800. Loss [16.879593]. KL 15.6523971558. Lagrangian [0.5115347]
Iteration 7900. Loss [15.15191]. KL 15.8156938553. Lagrangian [0.5143325]
Iteration 8000. Loss [16.134165]. KL 16.0066623688. Lagrangian [0.513113]
Iteration 8100. Loss [13.667096]. KL 15.0862979889. Lagrangian [0.51661235]
Iteration 8200. Loss [13.566738]. KL 16.448217392. Lagrangian [0.5129573]
Iteration 8300. Loss [17.88404]. KL 15.3343410492. Lagrangian [0.5109427]
Iteration 8400. Loss [10.400753]. KL 15.0261154175. Lagrangian [0.5113629]
Iteration 8500. Loss [15.084027]. KL 15.5496006012. Lagrangian [0.5062438]
Iteration 8600. Loss [15.508322]. KL 15.0626335144. Lagrangian [0.4987896]
Iteration 8700. Loss [13.834112]. KL 14.5499210358. Lagrangian [0.49169123]
Iteration 8800. Loss [16.373087]. KL 15.8138771057. Lagrangian [0.4988923]
Iteration 8900. Loss [16.049427]. KL 15.4969863892. Lagrangian [0.4991973]
Iteration 9000. Loss [14.9651165]. KL 15.0618896484. Lagrangian [0.49420357]
Iteration 9100. Loss [14.640309]. KL 15.2749557495. Lagrangian [0.49794906]
Iteration 9200. Loss [16.911331]. KL 14.5537433624. Lagrangian [0.49588507]
Iteration 9300. Loss [15.264124]. KL 15.9398488998. Lagrangian [0.49344075]
Iteration 9400. Loss [18.282375]. KL 15.1034412384. Lagrangian [0.49902657]
Iteration 9500. Loss [11.54266]. KL 14.602973938. Lagrangian [0.49189723]
Iteration 9600. Loss [18.094093]. KL 15.3920249939. Lagrangian [0.48762178]
Iteration 9700. Loss [13.533478]. KL 14.4577045441. Lagrangian [0.49118614]
Iteration 9800. Loss [12.521185]. KL 14.5473852158. Lagrangian [0.4888521]
Iteration 9900. Loss [15.116826]. KL 15.0607624054. Lagrangian [0.49271715]

Results¶
Let us take a look at the optimization process and the resulting model

Visualize training process¶
Plot the loss and KL over the training process (number of iterations)
In [0]:
fig, axes = plt.subplots(2, 2, figsize=(2*8, 2* 5))

axes[0, 0].plot(losses, label=’Negative ELBO’)
axes[0, 0].set_title(‘Time’, fontsize=15)
axes[0, 0].legend()

axes[0, 1].plot(kls, label=’KL’)
axes[0, 1].set_title(‘Time’, fontsize=15)
axes[0, 1].legend()

axes[1, 0].plot(likelihood_terms, label=’Likelihood Term’)
axes[1, 0].set_title(‘Time’, fontsize=15)
axes[1, 0].legend()

axes[1, 1].plot(lagrangian_values, label=’Lagrangian Values’)
axes[1, 1].set_title(‘Time’, fontsize=15)
axes[1, 1].legend()
In [0]:
samples = decoder(prior.sample()).mean()
samples.shape.assert_is_compatible_with([BATCH_SIZE, 28, 28, 1])

reconstructions = decoder(variational_posterior.sample()).mean()

Generate samples and latent interpolations¶
In [0]:
real_data_vals, final_samples_vals, data_reconstructions_vals = sess.run(
[real_data, samples, reconstructions])
In [0]:
fig, axes = plt.subplots(1, 3, figsize=(3*4,4))

show_digits(axes[0], real_data_vals, ‘Data’)
show_digits(axes[1], data_reconstructions_vals, ‘Reconstructions’)
show_digits(axes[2], final_samples_vals, ‘Samples’)
In [0]:
show_latent_interpolations(lambda x: decoder(x).mean(), prior, sess)

Q4.1 Questions about constrained optimization (12 pts)¶
1. [3 pts] Based on previous results, set and try varying the threshold for reconstruction (variable reconstruction_target int the code above). Describe what happens when you vary this variable. (Produce the plots to support your answer)
2. [1 pts] What do you observe about the behaviour of the likelihood and KL term throughout training? How is it different than in Stochastic Variational Inference and Amortized Variational inference with and without KL annealing?
3. [3 pts] What do you notice about the behaviour of the lagrangian during training? Is that what you expected?
[5 pts] Model Implementation and Results

Part 2: Reconstruction-free Generative Models¶

T2.1 Generative Adversarial Networks¶
So far we have discussed variational inference models, which learn how to match the marginal distribution $p_\theta(x)$, learned by the model, with the true data distribution $p^\star(x)$ through the variational lower bound. This approach uses latent variables and requires that the conditional posterior distributions cover the prior space; otherwise the decoder will not be able to generalise to prior samples which are unlike what it has seen during training.
To avoid this issue, some methods directly match $p_\theta(x)$ learned by the model with the true data distribution $p^\star(x)$. Such an approach is given by generative adversarial networks (GANs).
Generative adversarial networks optimize an adversarial two-player game given by the value function: \begin{equation} \max{G} \min{D} \mathbb{E}{p^*(x)} \log D(x) + \mathbb{E}{p(z)} \log (1- D(G(z))) \end{equation} where $G$ is the generator (as before this takes a latent sample $z$ and produce a image $x_{gen}$) and $D$ denotes the discriminator.
A depiction of the model can be found below:

For more information, see:
• https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
• https://arxiv.org/abs/1701.00160
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Q2.1 Prelimary questions (5 pts)¶
Before trying to implement this generative model, let us take a closer look at its components and how one can train them.
• [2 pts] First thing to note is that now we have two models to train: a generative model $G$ and a discriminative model $D$. How does one train these two models in the (standard) GAN formulation? Give the update rules and losses for each these ($D$ and $G$). Which of these losses uses the generated data and which of them uses the real data.
• [3 pts] [Generator loss] Instead of using the generator loss above, in practice we often use a surrogate, $- \log D(G(x))$. Why do you think that is the case? Plot the original loss and associated gradients. Then plot the surgate loss $- \log D(G(x))$ and its associated gradients.

Model Implementation¶
In [0]:
tf.reset_default_graph()

Get and rescale the data¶
Scale the data between -1 and 1. This helps training stability and improves GAN convergence.
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
real_data = 2 * real_data – 1

Define the discriminator and generator networks¶
We will use the same network as the VAE decoder for the generator. The only difference is that the generator here is implicit – it does not define a probability distribution over pixels. Since the input data is scaled to be between -1, and 1, the generator output range will be the same, by using a tanh output non-linearty.
In [0]:
DISCRIMINATOR_VARIABLE_SCOPE = ‘discriminator’
GENERATOR_VARIABLE_SCOPE = ‘generator’
In [0]:
def discriminator(x):
with tf.variable_scope(DISCRIMINATOR_VARIABLE_SCOPE, reuse=tf.AUTO_REUSE):
h = x
h = tf.layers.Conv2D(
filters=8,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(x)
h = tf.layers.Conv2D(
filters=16,
kernel_size=5,
strides=1,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2D(
filters=32,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2D(
filters=64,
kernel_size=5,
strides=1,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2D(
filters=64,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(h)

out_shape = 1
for s in h.shape.as_list()[1:]:
out_shape*= s

h = tf.reshape(h, shape=[BATCH_SIZE, out_shape])
logits = tf.layers.dense(h, 1, activation=None)
return logits

In [0]:
def generator(z):
with tf.variable_scope(GENERATOR_VARIABLE_SCOPE, reuse=tf.AUTO_REUSE):
h = tf.layers.dense(z, 7 * 7 * 64, activation=tf.nn.relu)
h = tf.reshape(h, shape=[BATCH_SIZE, 7, 7, 64])
h = tf.layers.Conv2DTranspose(
filters=32,
kernel_size=5,
strides=2,
activation=tf.nn.relu,
padding=’same’)(h)
h = tf.layers.Conv2DTranspose(
filters=1,
kernel_size=5,
strides=2,
activation=None, # Do not activate the last layer.
padding=’same’)(h)
return tf.nn.tanh(h)

Generate samples¶
In [0]:
##################
# YOUR CODE HERE #
##################
# samples = …

Set up the adversarial game¶

Discriminator loss¶
In [0]:
##################
# YOUR CODE HERE #
##################

# Reduce loss over batch dimension
# discriminator_loss = tf.reduce_mean(…)

Generator loss¶
In [0]:
##################
# YOUR CODE HERE #
##################
# generator_loss = tf.reduce_mean(…)

Create optimizers and training ops¶
Important: You will need to pass the list of variables to the TensorFlow optimizer, otherwise the generator and discriminator variables will receive both the loss of the discrimiantor and that of the generator.
We want to freeze the discriminator when we update the generator, and vice versa.
In [0]:
discriminator_optimizer = tf.train.AdamOptimizer(0.0001, beta1=0.5, beta2=0.9)
generator_optimizer = tf.train.AdamOptimizer(0.0003, beta1=0.9, beta2=0.9)

# Optimize the discrimiantor.
discriminator_vars = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=DISCRIMINATOR_VARIABLE_SCOPE)
discriminator_update_op = discriminator_optimizer.minimize(
discriminator_loss, var_list=discriminator_vars)

# Optimize the generator.
generator_vars = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=GENERATOR_VARIABLE_SCOPE)
generator_update_op = generator_optimizer.minimize(
generator_loss, var_list=generator_vars)

Training¶

Create the tensorflow session¶
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.global_variables_initializer())

Training Loop¶
We train the discriminator and generator by alternating gradient descent runs. We record the losses to plot them later.
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
disc_losses = []
gen_losses = []

for i in xrange(TRAINING_STEPS):
sess.run(discriminator_update_op)
sess.run(generator_update_op)

if i % 100 == 0:
disc_loss = sess.run(discriminator_loss)
gen_loss = sess.run(generator_loss)

print(‘Iteration: {}. Disc loss: {}. Generator loss {}’.format(
i, disc_loss, gen_loss))

disc_losses.append(disc_loss)
gen_losses.append(gen_loss)

Results¶

Visualize the behaviour of the two losses during training¶
Note that unlike losses for classifiers, or for VAEs, the losses are not stable and are going up and down, depending on the training dynamics.
In [0]:
figsize = (18, 4)
fig, axs = plt.subplots(1, 2, figsize=figsize)

# First plot the loss, and then the derivative.
axs[0].plot(disc_losses, ‘-‘)
axs[0].plot([np.log(2)] * len(disc_losses), ‘r–‘, label=’Discriminator is being folled’)
axs[0].legend()
axs[0].set_title(‘Discriminator loss’, fontsize=20, y=-0.2)
axs[1].plot(gen_losses, ‘-‘)
axs[1].set_title(‘Generator loss’, fontsize=20, y=-0.2)

Generate and plot samples and latent interpolations¶
In [0]:
real_data_vals, final_samples_vals = sess.run([real_data, samples])
In [0]:
fig, axes = plt.subplots(1, 2, figsize=(2*4,4))

show_digits(axes[0], real_data_vals, ‘Data’)
show_digits(axes[1], final_samples_vals, ‘Samples’)
In [0]:
show_latent_interpolations(generator, prior, sess)

Q2.2 GAN Questions (25 pts):¶
1. [3 pts] In defining the optimization procedure above (Section Create optimizers and training ops) we opted for two optimizer one for the discriminator and one for the generator. Is this necessary? Why would this be a good/bad idea in general?

2. [3 pts] Discuss the hyperparameter sensitivy of GANs compared to that of VAEs. (What happens to the model if you use a higher learning rate for the discriminator or the generator?)

3. [2 pts] When would you want to use GANs and when would you want to use VAEs? Which of the following can be performed using VAEs, and which can be performed using GANs: density estimation, representation learning, data generation?

4. [2 pts] What do you observe about GAN samples compared to VAE samples?
5. [3 pts] What happens if you optimize the GAN discriminator 5 times per generator update? (This will become particularly relevant for next part)
6. [2 pts] What happens if you optimize the GAN generator 10 times per discriminator update?
[10 pts] Model Implementation and Results

===== END OF GRADED COURSEWORK ========¶

T2.2 [Optional] Wasserstein GAN¶
Since the proposal of the original GAN, multiple objectives have been proposed, inspired by different learning principles. In Wasserstein GAN, optimal transport is used to create the training criteria: \begin{equation} \sup{||f||{L \leq 1}} \mathbb{E}{p^*(x)} f(x) – \mathbb{E}{p(z)} f(G(z)) \end{equation} where $||f||{L \leq 1}$ denotes the family of 1-Lipchitz functions.
Due to the intractability of the supremum in the equation above, the WGAN value function is constructed from the optimal transport criteria using the Kantorovich-Rubinstein duality: \begin{equation} \min{G} \max{D} \mathbb{E}{p^*(x)} D(x) – \mathbb{E}{p_(z)} D(G(z)) \end{equation}
where $D$ is a 1-Lipchitz function. The Lipchitz constraint is imposed using gradient penalties on the discriminator.
For extra reading, see:
• https://arxiv.org/abs/1701.07875
• https://arxiv.org/abs/1704.00028
Task: Implement and train this model to generate MNIST digits. Visualise the results and answer the questions at the end of the section.

Model Implementation¶
In [0]:
tf.reset_default_graph()

Define the gradient penalty¶
In [0]:
def batch_l2_norms(x, eps=1e-5):
reduction_axis = range(1, x.get_shape().ndims)
squares = tf.reduce_sum(tf.square(x), axis=reduction_axis)
squares.get_shape().assert_is_compatible_with([None])
return tf.sqrt(eps + squares)

def wgan_gradient_penalty(discriminator, real_data, samples):
“””The gradient penalty loss on an interpolation of data and samples.

Proposed by https://arxiv.org/pdf/1704.00028.pdf for Wasserstein GAN, but
recently becoming more widely adopted, outside the Wasserstein setting.

Args:
discriminator: An instance of `AbstractDiscriminator`.
real_data: A `tf.Tensor` (joint discriminator `tf.Tensor` sequences are not
yet supported). The data associated as real by the GAN, usually from a
datasets. Needs to be a valid input for `discriminator`.
samples: A `tf.Tensor` or `tf.Tensor` sequence (for joint discriminators).
Samples obtained from the model. Needs to be a valid input for
`discriminator`.

Returns:
A `tf.Tensor` scalar, containing the loss.
“””

##################
# YOUR CODE HERE #
##################

Get and rescale the data¶
Scale the data between -1 and 1. This helps training stability and improves GAN convergence.
In [0]:
real_data = make_tf_data_batch(mnist.train.images)
real_data = 2 * real_data – 1

Generate samples¶
In [0]:
##################
# YOUR CODE HERE #
##################
# samples =

Set up the adversarial game¶

Discriminator and generator loss¶
In [0]:
# The weight of the gradient penalty
GRADIENT_PENALTY_COEFF = 10
In [0]:
##################
# YOUR CODE HERE #
##################
# discriminator_loss =
# generator_loss =

Create optimizers and training ops¶
In [0]:
discrimiantor_optimizer = tf.train.AdamOptimizer(0.0001, beta1=0.5, beta2=0.9)
generator_optimizer = tf.train.AdamOptimizer(0.0001, beta1=0.5, beta2=0.9)

# Optimize the discrimiantor.
discriminator_vars = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=DISCRIMINATOR_VARIABLE_SCOPE)
discriminator_update_op = discrimiantor_optimizer.minimize(
discriminator_loss, var_list=discriminator_vars)

# Optimize the generator.
generator_vars = tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, scope=GENERATOR_VARIABLE_SCOPE)
generator_update_op = generator_optimizer.minimize(
generator_loss, var_list=generator_vars)

Training¶
We train the discriminator and generator by alternating gradient descent runs. We record the losses to plot them later.

Create the tensorflow session¶
In [0]:
sess = tf.Session()

# Initialize all variables
sess.run(tf.global_variables_initializer())
In [0]:
NUM_DISC_UPDATES_PER_GEN_UPDATE = 5
In [0]:
# %hide_pyerr # – uncomment to interrupt training without a stacktrace
disc_losses = []
gen_losses = []

for i in xrange(TRAINING_STEPS):
# Do multiple discriminator updates per generator update.
for _ in xrange(NUM_DISC_UPDATES_PER_GEN_UPDATE):
sess.run(discriminator_update_op)
sess.run(generator_update_op)

if i % 100 == 0:
disc_loss = sess.run(discriminator_loss)
gen_loss = sess.run(generator_loss)

print(‘Iteration: {}. Disc loss: {}. Generator loss {}’.format(
i, disc_loss, gen_loss))
disc_losses.append(disc_loss)
gen_losses.append(gen_loss)

Results¶

Visualize the behaviour of the two losses during training¶
Note that unlike losses for classifiers, or for VAEs, the losses are not stable and are going up and down, depending on the training dynamics.
In [0]:
figsize = (20, 6)
fig, axs = plt.subplots(1, 2, figsize=figsize)

# First plot the loss, and then the derivative.
axs[0].plot(disc_losses, ‘-‘)
axs[0].plot([0.] * len(disc_losses), ‘r–‘, label=’Discriminator is being folled’)
axs[0].legend()
axs[0].set_title(‘Discriminator loss’, fontsize=20, y=-0.2)
axs[1].plot(gen_losses, ‘-‘)
axs[1].set_title(‘Generator loss’, fontsize=20, y=-0.2)

Generate samples and latent interpolations¶
In [0]:
real_data_vals, final_samples_vals = sess.run([real_data, samples])
In [0]:
fig, axes = plt.subplots(1, 2, figsize=(2*4,4))

show_digits(axes[0], real_data_vals, ‘Data’)
show_digits(axes[1], final_samples_vals, ‘Samples’)
In [0]:
show_latent_interpolations(generator, prior, sess)

Questions about WGANs:¶
• What happens if you optimize the original GAN discriminator 5 times per generator update – like in Wasserstein GANs? What happens if you train the Wassterstein GAN with 1 discriminator update per generator update?
• Can you think of a general recipe to create a new type of GAN loss?
In [0]: