The University of Sydney Page 1
Deep Generation Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
Generative Modeling
– Density Estimation
– Sample Generation
Image credit to [Ian Goodfellow, NIPS Tutorial on Generative Model 2016]
Training Sample Generated Sample
The University of Sydney Page 3
Why study generative model?
– Realistic generation tasks
– Simulate possible futures for planning (e.g. Stock Market
Prediction)
– Training generative models can also enable inference of latent
representations that can be useful as general features
Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros
Berkeley AI Research (BAIR) Laboratory
University of California, Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Facade BW to Color
Aerial to Map
Labels to Street Scene
Edges to Photo
input output input
inputinput
input output
output
outputoutput
input output
Day to Night
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.
Abstract
We investigate conditional adversarial networks as a
general-purpose solution to image-to-image translation
problems. These networks not only learn the mapping from
input image to output image, but also learn a loss func-
tion to train this mapping. This makes it possible to apply
the same generic approach to problems that traditionally
would require very different loss formulations. We demon-
strate that this approach is effective at synthesizing photos
from label maps, reconstructing objects from edge maps,
and colorizing images, among other tasks. As a commu-
nity, we no longer hand-engineer our mapping functions,
and this work suggests we can achieve reasonable results
without hand-engineering our loss functions either.
Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
image into a corresponding output image. Just as a concept
may be expressed in either English or French, a scene may
be rendered as an RGB image, a gradient field, an edge map,
a semantic label map, etc. In analogy to automatic language
translation, we define automatic image-to-image translation
as the problem of translating one possible representation of
a scene into another, given sufficient training data (see Fig-
ure 1). One reason language translation is difficult is be-
cause the mapping between languages is rarely one-to-one
– any given concept is easier to express in one language
than another. Similarly, most image-to-image translation
problems are either many-to-one (computer vision) – map-
ping photographs to edges, segments, or semantic labels,
or one-to-many (computer graphics) – mapping labels or
sparse user inputs to realistic images. Traditionally, each of
these tasks has been tackled with separate, special-purpose
machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]),
despite the fact that the setting is always the same: predict
pixels from pixels. Our goal in this paper is to develop a
common framework for all these problems.
1
ar
X
iv
:1
61
1.
07
00
4v
1
[
cs
.C
V
]
2
1
N
ov
2
01
6
Image credit to [Jun-Yan Zhu et al. 2017]
Image credit to [Phillip Isola et al. 2017]
The University of Sydney Page 4
Generative Adversarial Networks
The University of Sydney Page 5
Generative Adversarial Networks
The University of Sydney Page 6
Generative Adversarial Networks
– A counterfeiter-police game between two components: a
generator G and a discriminator D
– G: counterfeiter, trying to fool police with fake currency
– D: police, trying to detect the counterfeit currency
– Competition drives both to improve, until counterfeits are
indistinguishable from genuine currency
X
⎷
The University of Sydney Page 7
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
– G: try to fool the discriminator by generating real-looking
images
– D: try to distinguish between real and fake images
X
⎷
The University of Sydney Page 8
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images
Discriminator outputs likelihood in (0,1) of real image.
The University of Sydney Page 9
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images
D’s goal: maximize objective such that D(x) is close to 1 (real) and D(G(z))
is close to 0 (fake)
G’s goal: minimize objective such that D(G(z)) is close to 1 (discriminator
is fooled into thinking generated G(z) is real)
The University of Sydney Page 10
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images
Q: What can we use to represent D and G?
A: Neural networks!
The University of Sydney Page 11
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
Z ~ P(z)
G(z)
D(x)
D(G(z))
X ~ Pdata(x)
The University of Sydney Page 12
Generative Adversarial Networks
– A two-player game between two components: a generator G
and a discriminator D
– Alternate between:
1. Gradient ascent on D
2. Gradient descent on G
The University of Sydney Page 13
Algorithm
The University of Sydney Page 14
A simple example
Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(blue line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (green line). The lower horizontal line is the
domain from which z is sampled, in this case uniformly. The horizontal line above is part of the
domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform
distribution pg on transformed samples. G contracts in regions of high density and expands in regions
of low density of pg.
Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]
The University of Sydney Page 15
A simple example
After several steps of training, if G and D have enough capacity, they will
reach a point at which both cannot improve because pg = pdata. The
discriminator is unable to differentiate between the two distributions, i.e. D(x)
= 1/2 .
Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]
The University of Sydney Page 16
A simple example
When the data distribution is a 1-D Gaussian distribution
http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
The University of Sydney Page 17
Generative Adversarial Networks
– Generated samples
Nearest neighbor from training set
Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]
The University of Sydney Page 18
Generative Adversarial Networks
– Generated samples (CIFAR-10)
Nearest neighbor from training set
Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]
The University of Sydney Page 19
Generative Adversarial Networks
–
– It is only required that, G is differentiable
– So, having training data from real data distribution
pr what we want is a generative model that can
draw samples from generator’s distribution pg,
where pr ≈ pg
– Don’t need to write a formula for pg just learn to
draw sample directly.
zz
xx
The University of Sydney Page 20
Deep Convolutional GANs (DCGANs)
– Replace any pooling layers with strided convolutions (discriminator)
and fractional-strided convolutions (generator).
– Use batchnorm in both the generator and the discriminator.
– Remove fully connected hidden layers for deeper architectures.
– Use ReLU activation in generator for all layers except for the output,
which uses Tanh.
– Use LeakyReLU activation in the discriminator for all layers.
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
The University of Sydney Page 21
Deep Convolutional GANs (DCGANs)
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Samples
from the
model look
much
better!
The University of Sydney Page 22
Wasserstein GAN
The University of Sydney Page 23
Difficulty of Generative Adversarial Networks
– The gradient vanishing
– The full objective
!(#) = −$%~’! log( ) − $%~'”[log 1 − ( ) ]
!(() = $%~'”[log 1 − ( ) ]
– At the very early phase of training, D is very easy to be
confident in detecting G, so D will output almost always 0
In GAN, better discriminator leads to worse vanishing gradient in its
generator!
The University of Sydney Page 24
Proof Sketch:
– Minimizing generator yields minimizing the JS divergence when the
discriminator is optimal:
!(“) = −$$~&! log( ) − $$~&”[log 1 − ( ) ]
– The optimal – for any .’ and .( is always:
(∗ ) =
.'())
.’ ) + .(())
– The generative loss (by adding a term independent of Pg) is:
!(*) = $$~&”[log 1 − ( ) ]
Plug D* into !(*):
!(“∗) = 2!3(.’||.() − 2 log 2
So, when D is optimal, minimizing the loss is equal to minimizing the JS
divergence.
[Martin Arjovsky et al., Wasserstein GAN, 2017]
Jensen–Shannon divergence
https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
The University of Sydney Page 25
JS
divergence:
The University of Sydney Page 26
Proof Sketch:
!(#
∗) = 2!.(0)||0*) − 2 log 2
– If the supports (underlying low-dimension manifolds) of 0)
and 0* (almost) have no overlap, then !.(0)||0*) = log 2, and
thus the gradient of !(() w.r.t. 0* vanishes.
– The probability that the support of 0) and 0* have almost zero
overlap is 1.
[Martin Arjovsky et al., Wasserstein GAN, 2017]
The University of Sydney Page 27
Preliminaries: distance measures for distributions
– JS
– Wasserstein
the set of all joint +
distributions whose
marginals are P and Q,
respectively.
+ indicates a plan to
transport “mass” from x to y,
when performing P into Q.
The Wasserstein (or Earth-Mover) distance is then the
“cost” of the optimal transport plan
Image credit to: sbl.inria.fr
The University of Sydney Page 28
Examples
!”($!| $” = ‘
+∞ *+ , ≠ 0
0 *+ , = 0
/0($!| $” = ‘
log2 *+ , ≠ 0
0 *+ , = 0
W($!, $”) = |,| (Smooth)
The University of Sydney Page 29
Wasserstein GANs
– The Earth-Mover (EM) distance or Wasserstein-1:
have nicer properties when optimized than JS
– However, the infimum is highly intractable.
– Wasserstein distance has a duality form
where supremum is over all the K-Lipschitz functions
The University of Sydney Page 30
Wasserstein GANs
– Wasserstein distance has a duality form
where supremum is over all the K-Lipschitz functions
– Consider a w-parameterized family of functions {fw}w∈W that
are all K -Lipschitz
For example, W = [−c, c]l. To satisfy this requirement, WGAN
enforces the weights of D lie within a compact space [-c, c] by
applying weight clipping
The University of Sydney Page 31
Wasserstein GANs
– The loss for discriminator/critic fw
– fw ’s goal: maximize Wasserstein distance between real data
distribution and generative distribution
– The loss for generator
– gθ’s goal: minimize Wasserstein distance between real data
distribution and generative distribution
The University of Sydney Page 32
Algorithm
compute gradient of fw
weight clipping to
satisfy that functions
{fw}w∈W are all K –
Lipschitz
gradient ascent on fw
compute gradient of gθ
gradient decent on gθ
The University of Sydney Page 33
Algorithm
Optimal discriminator and critic when learning to differentiate two Gaussians. As we can see, the
discriminator of a minimax GAN saturates and results in vanishing gradients. WGAN critic provides very
clean gradients on all parts of the space.
Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]
The University of Sydney Page 34
Meaningful loss metric
Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]
The University of Sydney Page 35
– Vanilla GANs
Meaningful loss metric
JS estimates for an MLP generator (upper left) and a DCGAN generator (upper right) trained with the
standard GAN procedure. Both had a DCGAN discriminator. Both curves have increasing error. Samples
get better for the DCGAN but the JS estimate increases or stays constant, pointing towards no significant
correlation between sample quality and loss.
Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]
The University of Sydney Page 36
Variational Auto-Encoder
The University of Sydney Page 37
Recap: Autoencoder
– Unsupervised approach for learning a lower-dimensional
feature representation from unlabeled training data
Deep, Linear
+ nonlinear (sigmoid, ReLU),
Fully-connected, CNN
z usually smaller than x
(dimension reduction)
The University of Sydney Page 38
Recap: Autoencoder
– Train such that features can be used to reconstruct original data
“Autoencoding” – encoding itself
Deep, Linear
+ nonlinear (sigmoid,
ReLU),
Fully-connected, CNN
(upconv)
L2 Loss function:
, − ., $
The University of Sydney Page 39
Recap: Autoencoder
Encoder can be
used to initialize a
supervised model
Fine-tune
encoder
jointly with
classifier
Throw away
decoder
The University of Sydney Page 40
Recap: Autoencoder
Autoencoders can reconstruct data, and can learn features to
initialize a supervised model. Features capture factors of variation in
training data.
Can we generate new images from an autoencoder?
The University of Sydney Page 41
Variational Autoencoders
– Probabilistic spin on autoencoders – will let us sample from the
model to generate data!
– Assume training data )(/)
/01
2
is generated from underlying
unobserved (latent) representation z
Sample from
true conditional
pθ* (x|z(i))
Sample from
true prior pθ* (z)
Choose prior p(z) to
be simple, e.g.
Gaussian.
Reasonable for
latent attributes, e.g.
pose.
The University of Sydney Page 42
Variational Autoencoders
3~4(0,1)
3 = 4 ⋅ 6 + 8
Sampling
Reparemerization
Encoder
Decoder
The University of Sydney Page 43
Variational Autoencoders
3~4(0,1)
Cross Entropy
9
/
8
− x9log ;)/ + 1 − )/ log(1 − ;)/)
MSE 9
/
8
)/ − ;)/
:
KL Divergence
−0.5(1 + log 4:
− 8: − 4:)
Min CrossEntropy + KLDivergence
The University of Sydney Page 44
Variational Autoencoders
Generating Data:
Use decoder network.
Sample z from prior.
z~4(0,1)
The University of Sydney Page 45
Variational Autoencoders
Generating Data:
32×32 CIFAR-10 Labeled Faces in the Wild
Image credit to [(L) Drick Kingma et al. 2016; (R) Anders Larsen et al. 2017].
The University of Sydney Page 46
VAE + GAN