CS计算机代考程序代写 js AI algorithm The University of Sydney Page 1

The University of Sydney Page 1

Deep Generation Models

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

Generative Modeling
– Density Estimation

– Sample Generation

Image credit to [Ian Goodfellow, NIPS Tutorial on Generative Model 2016]

Training Sample Generated Sample

The University of Sydney Page 3

Why study generative model?

– Realistic generation tasks

– Simulate possible futures for planning (e.g. Stock Market
Prediction)

– Training generative models can also enable inference of latent
representations that can be useful as general features

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros

Berkeley AI Research (BAIR) Laboratory
University of California, Berkeley

{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu

Labels to Facade BW to Color

Aerial to Map

Labels to Street Scene

Edges to Photo

input output input

inputinput

input output

output

outputoutput

input output

Day to Night

Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.

Abstract

We investigate conditional adversarial networks as a
general-purpose solution to image-to-image translation
problems. These networks not only learn the mapping from
input image to output image, but also learn a loss func-
tion to train this mapping. This makes it possible to apply
the same generic approach to problems that traditionally
would require very different loss formulations. We demon-
strate that this approach is effective at synthesizing photos
from label maps, reconstructing objects from edge maps,
and colorizing images, among other tasks. As a commu-
nity, we no longer hand-engineer our mapping functions,
and this work suggests we can achieve reasonable results
without hand-engineering our loss functions either.

Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
image into a corresponding output image. Just as a concept

may be expressed in either English or French, a scene may
be rendered as an RGB image, a gradient field, an edge map,
a semantic label map, etc. In analogy to automatic language
translation, we define automatic image-to-image translation
as the problem of translating one possible representation of
a scene into another, given sufficient training data (see Fig-
ure 1). One reason language translation is difficult is be-
cause the mapping between languages is rarely one-to-one
– any given concept is easier to express in one language
than another. Similarly, most image-to-image translation
problems are either many-to-one (computer vision) – map-
ping photographs to edges, segments, or semantic labels,
or one-to-many (computer graphics) – mapping labels or
sparse user inputs to realistic images. Traditionally, each of
these tasks has been tackled with separate, special-purpose
machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]),
despite the fact that the setting is always the same: predict
pixels from pixels. Our goal in this paper is to develop a
common framework for all these problems.

1

ar
X

iv
:1

61
1.

07
00

4v
1

[
cs

.C
V

]
2

1
N

ov
2

01
6

Image credit to [Jun-Yan Zhu et al. 2017]
Image credit to [Phillip Isola et al. 2017]

The University of Sydney Page 4

Generative Adversarial Networks

The University of Sydney Page 5

Generative Adversarial Networks

The University of Sydney Page 6

Generative Adversarial Networks
– A counterfeiter-police game between two components: a

generator G and a discriminator D

– G: counterfeiter, trying to fool police with fake currency

– D: police, trying to detect the counterfeit currency

– Competition drives both to improve, until counterfeits are
indistinguishable from genuine currency

X

The University of Sydney Page 7

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

– G: try to fool the discriminator by generating real-looking
images

– D: try to distinguish between real and fake images

X

The University of Sydney Page 8

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images

Discriminator outputs likelihood in (0,1) of real image.

The University of Sydney Page 9

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images

D’s goal: maximize objective such that D(x) is close to 1 (real) and D(G(z))
is close to 0 (fake)

G’s goal: minimize objective such that D(G(z)) is close to 1 (discriminator
is fooled into thinking generated G(z) is real)

The University of Sydney Page 10

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

– G: try to fool the discriminator by generating real-looking images
– D: try to distinguish between real and fake images

Q: What can we use to represent D and G?

A: Neural networks!

The University of Sydney Page 11

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

Z ~ P(z)

G(z)
D(x)

D(G(z))

X ~ Pdata(x)

The University of Sydney Page 12

Generative Adversarial Networks

– A two-player game between two components: a generator G
and a discriminator D

– Alternate between:
1. Gradient ascent on D

2. Gradient descent on G

The University of Sydney Page 13

Algorithm

The University of Sydney Page 14

A simple example

Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(blue line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (green line). The lower horizontal line is the
domain from which z is sampled, in this case uniformly. The horizontal line above is part of the
domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform
distribution pg on transformed samples. G contracts in regions of high density and expands in regions
of low density of pg.

Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]

The University of Sydney Page 15

A simple example

After several steps of training, if G and D have enough capacity, they will
reach a point at which both cannot improve because pg = pdata. The
discriminator is unable to differentiate between the two distributions, i.e. D(x)
= 1/2 .

Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]

The University of Sydney Page 16

A simple example

When the data distribution is a 1-D Gaussian distribution

http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/

The University of Sydney Page 17

Generative Adversarial Networks

– Generated samples

Nearest neighbor from training set

Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]

The University of Sydney Page 18

Generative Adversarial Networks

– Generated samples (CIFAR-10)

Nearest neighbor from training set

Image credit to [Ian Goodfellow et al., Generative Adversarial Nets, NIPS 2016]

The University of Sydney Page 19

Generative Adversarial Networks

– It is only required that, G is differentiable

– So, having training data from real data distribution
pr what we want is a generative model that can
draw samples from generator’s distribution pg,
where pr ≈ pg

– Don’t need to write a formula for pg just learn to
draw sample directly.

zz

xx

The University of Sydney Page 20

Deep Convolutional GANs (DCGANs)

– Replace any pooling layers with strided convolutions (discriminator)
and fractional-strided convolutions (generator).
– Use batchnorm in both the generator and the discriminator.
– Remove fully connected hidden layers for deeper architectures.
– Use ReLU activation in generator for all layers except for the output,
which uses Tanh.
– Use LeakyReLU activation in the discriminator for all layers.

Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

The University of Sydney Page 21

Deep Convolutional GANs (DCGANs)

Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

Samples
from the
model look
much
better!

The University of Sydney Page 22

Wasserstein GAN

The University of Sydney Page 23

Difficulty of Generative Adversarial Networks

– The gradient vanishing

– The full objective

!(#) = −$%~’! log( ) − $%~'”[log 1 − ( ) ]

!(() = $%~'”[log 1 − ( ) ]

– At the very early phase of training, D is very easy to be
confident in detecting G, so D will output almost always 0

In GAN, better discriminator leads to worse vanishing gradient in its
generator!

The University of Sydney Page 24

Proof Sketch:
– Minimizing generator yields minimizing the JS divergence when the

discriminator is optimal:

!(“) = −$$~&! log( ) − $$~&”[log 1 − ( ) ]

– The optimal – for any .’ and .( is always:

(∗ ) =
.'())

.’ ) + .(())

– The generative loss (by adding a term independent of Pg) is:

!(*) = $$~&”[log 1 − ( ) ]

Plug D* into !(*):
!(“∗) = 2!3(.’||.() − 2 log 2

So, when D is optimal, minimizing the loss is equal to minimizing the JS
divergence.

[Martin Arjovsky et al., Wasserstein GAN, 2017]

Jensen–Shannon divergence

https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

The University of Sydney Page 25

JS
divergence:

The University of Sydney Page 26

Proof Sketch:

!(#
∗) = 2!.(0)||0*) − 2 log 2

– If the supports (underlying low-dimension manifolds) of 0)
and 0* (almost) have no overlap, then !.(0)||0*) = log 2, and
thus the gradient of !(() w.r.t. 0* vanishes.

– The probability that the support of 0) and 0* have almost zero
overlap is 1.

[Martin Arjovsky et al., Wasserstein GAN, 2017]

The University of Sydney Page 27

Preliminaries: distance measures for distributions

– JS

– Wasserstein

the set of all joint +
distributions whose
marginals are P and Q,
respectively.

+ indicates a plan to
transport “mass” from x to y,
when performing P into Q.

The Wasserstein (or Earth-Mover) distance is then the
“cost” of the optimal transport plan

Image credit to: sbl.inria.fr

The University of Sydney Page 28

Examples

!”($!| $” = ‘
+∞ *+ , ≠ 0
0 *+ , = 0

/0($!| $” = ‘
log2 *+ , ≠ 0
0 *+ , = 0

W($!, $”) = |,| (Smooth)

The University of Sydney Page 29

Wasserstein GANs

– The Earth-Mover (EM) distance or Wasserstein-1:

have nicer properties when optimized than JS

– However, the infimum is highly intractable.

– Wasserstein distance has a duality form

where supremum is over all the K-Lipschitz functions

The University of Sydney Page 30

Wasserstein GANs

– Wasserstein distance has a duality form

where supremum is over all the K-Lipschitz functions
– Consider a w-parameterized family of functions {fw}w∈W that

are all K -Lipschitz

For example, W = [−c, c]l. To satisfy this requirement, WGAN
enforces the weights of D lie within a compact space [-c, c] by
applying weight clipping

The University of Sydney Page 31

Wasserstein GANs

– The loss for discriminator/critic fw

– fw ’s goal: maximize Wasserstein distance between real data
distribution and generative distribution

– The loss for generator

– gθ’s goal: minimize Wasserstein distance between real data
distribution and generative distribution

The University of Sydney Page 32

Algorithm

compute gradient of fw

weight clipping to
satisfy that functions
{fw}w∈W are all K –
Lipschitz

gradient ascent on fw

compute gradient of gθ
gradient decent on gθ

The University of Sydney Page 33

Algorithm

Optimal discriminator and critic when learning to differentiate two Gaussians. As we can see, the
discriminator of a minimax GAN saturates and results in vanishing gradients. WGAN critic provides very

clean gradients on all parts of the space.

Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]

The University of Sydney Page 34

Meaningful loss metric

Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]

The University of Sydney Page 35

– Vanilla GANs

Meaningful loss metric

JS estimates for an MLP generator (upper left) and a DCGAN generator (upper right) trained with the
standard GAN procedure. Both had a DCGAN discriminator. Both curves have increasing error. Samples

get better for the DCGAN but the JS estimate increases or stays constant, pointing towards no significant

correlation between sample quality and loss.

Image credit to [Martin Arjovsky et al., Wasserstein GAN, 2017]

The University of Sydney Page 36

Variational Auto-Encoder

The University of Sydney Page 37

Recap: Autoencoder

– Unsupervised approach for learning a lower-dimensional
feature representation from unlabeled training data

Deep, Linear
+ nonlinear (sigmoid, ReLU),
Fully-connected, CNN

z usually smaller than x
(dimension reduction)

The University of Sydney Page 38

Recap: Autoencoder
– Train such that features can be used to reconstruct original data

“Autoencoding” – encoding itself

Deep, Linear
+ nonlinear (sigmoid,
ReLU),
Fully-connected, CNN
(upconv)

L2 Loss function:
, − ., $

The University of Sydney Page 39

Recap: Autoencoder

Encoder can be
used to initialize a
supervised model

Fine-tune
encoder
jointly with
classifier

Throw away
decoder

The University of Sydney Page 40

Recap: Autoencoder

Autoencoders can reconstruct data, and can learn features to
initialize a supervised model. Features capture factors of variation in
training data.

Can we generate new images from an autoencoder?

The University of Sydney Page 41

Variational Autoencoders

– Probabilistic spin on autoencoders – will let us sample from the
model to generate data!

– Assume training data )(/)
/01

2
is generated from underlying

unobserved (latent) representation z

Sample from
true conditional
pθ* (x|z(i))

Sample from
true prior pθ* (z)

Choose prior p(z) to
be simple, e.g.
Gaussian.
Reasonable for
latent attributes, e.g.
pose.

The University of Sydney Page 42

Variational Autoencoders

3~4(0,1)

3 = 4 ⋅ 6 + 8

Sampling

Reparemerization

Encoder

Decoder

The University of Sydney Page 43

Variational Autoencoders

3~4(0,1)

Cross Entropy

9
/

8

− x9log ;)/ + 1 − )/ log(1 − ;)/)

MSE 9
/

8

)/ − ;)/
:

KL Divergence

−0.5(1 + log 4:

− 8: − 4:)

Min CrossEntropy + KLDivergence

The University of Sydney Page 44

Variational Autoencoders

Generating Data:
Use decoder network.
Sample z from prior.

z~4(0,1)

The University of Sydney Page 45

Variational Autoencoders

Generating Data:

32×32 CIFAR-10 Labeled Faces in the Wild

Image credit to [(L) Drick Kingma et al. 2016; (R) Anders Larsen et al. 2017].

The University of Sydney Page 46

VAE + GAN