9b: Autoencoders and Adversarial Training
Autoencoders
The encoder networks we met in Week 2 can be seen as a simple example of a much
wider class of Autoencoder Networks, consisting of an Encoder which converts each input to a
vector of latent variables , and a Decoder which converts the latent variables to output .
N −K−N
x
z z x′
[image source]
If the encoder computes and the decoder computes then we aim to
minimise some distance function between and
z = f (x) x =′ g(z) = g(f (x))
x g(f (x))
E = L(x, g(f (x)))
The activations normally pass through a bottleneck, so the network is forced to compress the data in
some way.
If we are working with images, the Encoder could be in the form of a Convolutional Neural Network
with the �nal classi�cation layer removed so that it maps an image to a vector of latent variables
determined by the activations in the last fully connected layer.
The Decoder could also be a CNN but in reverse (sometimes called a Deconvolutional Network) such
as this architecture from (Radford, 2015).
Unsupervised Pre-Training
Autoencoders have sometimes been used for unsupervised pre-training. First, the autoencoder is
trained to reconstruct its input. Then, the Decoder part is removed and replaced with, for example, a
classi�cation layer, and the combined network is trained by backpropagation. The features learnt by
the autoencoder then serve as initial weights for the supervised learning task.
The process can even be repeated, with each subsequent layer trained to reproduce the activations of
the previous layer. This greedy layerwise pre-training was found to improve the performance of deep
networks in the early 2000s when and activations were still common. With recent
advances such as ReLU, Weight Initialisation and Batch Normalisation, unsupervised pre-training has
diminished in importance.
sigmoid tanh
Regularised Autoencoders
Often, a regularisation term is introduced, to force the latent variables either to conform to a
particular distribution, or to have some other desirable properties. Common forms of regularised
autoencoders include sparse autoencoders, contractive autoencoders, denoising autoencoders,
variational autoencoders and Wasserstein autoencoders.
Sparse Autoencoder
One way to regularise an autoencoder is to include a penalty term in the loss function, based on the
hidden unit activations. This is analogous to the weight decay term we previously used for supervised
learning. One popular choice is to penalise the sum of the absolute values of the activations in the
hidden layer:
E = L(x, g(f (x)) + λ ∣z ∣
i
∑ i
This is sometimes known as -regularisation (because it involves the absolute value rather than the
square); it can encourage some of the hidden units to go to zero, thus producing a sparse
representation.
L 1
Contractive Autoencoder
Another popular penalty term is the -norm of the derivatives of the hidden units with respect to
the inputs
L 2
E = L(x, g(f (x)) + λ ∇ z
i
∑ ∣ x i∣2
This forces the model to learn hidden features that do not change much when the training inputs
are slightly altered.
x
Denoising Autoencoder
Another regularisation method, similar to contractive autoencoder, is to add noise to the inputs, but
train the network to recover the original input
repeat:
sample a training item x(i)
generate a corrupted version of xx~ (i)
train to reduce E = L(x , g(f ( )))(i) x~
end
Generative Models
Another important application of autoencoders is image generation. After the autoencoder has been
trained, we detach the encoder and feed randomly generated vectors into the decoder, with the aim
of producing images which are similar in style to the training images, but not quite the same.
In other words, we want to be able to choose latent variables from a standard normal distribution
feed these values of to the decoder, and have it produce a new item which is somehow
z
p(z), z x
similar to the training items.
In order to improve the quality of the generated images, we try to force the distribution of the
encoded images in latent space to match the standard normal distribution from which the random
variables will be chosen. One way to achieve this is with a Variational Autoencoder.z
Entropy and KL-Divergence
Recall that the entropy and KL-Divergence for continuous distributions and areq() p()
H(q)
D (q ∥ p)KL
= q(θ)(− log q(θ))dθ∫
θ
= q(θ)(log q(θ) − log p(θ))dθ∫
θ
The KL-Divergence between two -dimensional multivariate Gaussian distributions and with
mean , and variance , , respectively is
d q() p()
μ 1 μ 2 Σ 1 Σ 2
D (q ∥ p) =KL (μ − μ ) Σ (μ − μ ) + Trace(Σ Σ ) + log − d2
1 [ 2 1 T 2−1 2 1 2−1 1 ∣Σ ∣1
∣Σ ∣2 ]
When in the Standard Normal Distribution with , , this reduces to:p() μ =2 0 Σ =2 I
D (q ∥ p) =KL ∥μ ∥ + Trace(Σ ) − log ∣Σ ∣ − d2
1
[ 1
2
1 1 ]
If the distribution depends on some parameters , then minimising will
encourage the distribution to be centered at the origin and to spread out so that it approximates
. If is diagonal, we have
q ()ϕ ϕ D q (z)∥ p(z)KL ( ϕ )
q ()ϕ
p() Σ =1 diag(σ , … , σ )1
2
d
2
D (q ∥ p) =KL [∥μ ∥ +2
1
1
2
(σ −
i=1
∑
d
i
2 2 log(σ ) −i 1)]
which is minimised when and for all .μ =1 0 σ =i 1 i
Variational Autoencoder
Instead of producing a single for each the encoder (with parameters ) can be made to
produce a mean and standard deviation for each input image (Kingma & Welling,
2013). This de�nes a conditional (multivariate Gaussian) probability distribution . We then
train the system to maximise
z x ,(i) ϕ
μ z∣x(i) σ z∣x(i) x
(i)
q (z ∣ x )ϕ
(i)
E [log p (x ∣ z)] −z∼q (z ∣ x )ϕ (i) θ
(i) D (q (z ∣ x )∥ p(z))KL ϕ
(i)
The �rst term enforces that any sample drawn from the conditional distribution should
produce something approximating when fed to the decoder (with weights ). The second term
encourages to approximate . In practice, the distributions for various
will occupy complementary regions within the overall distribution .
z q (z ∣ x )ϕ (i)
x(i) θ
q (z ∣ x )ϕ (i) p(z) q (z ∣ x )ϕ (i) x(i)
p(z)
Variational Autoencoder Digits
[image source: introducing-variational-autoencoders-in-prose-and-code]
Variational Autoencoder Digits
Variational Autoencoder Faces
The variational autoencoder produces reasonable results, but the images tend to be a bit blurry. More
recent deterministic models, such as the Wasserstein autoencoder, produce somewhat better images.
But, to get really good images, we need a Generative Adversarial Network, which we will describe
after taking a brief detour into Artist-Critic Coevolution.
References
Kingma, D.P., & Welling, M. (2013). Auto-encoding variational Bayes, arXiv: 1312.6114.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep
convolutional generative adversarial networks, arXiv: 1511.06434.
Resources
https://blog.evjang.com/2016/08/variational-bayes.html
Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):
Regularized Autoencoders (14.2)
Gaussian distribution (3.9.3)
Variational Autoencoder (20.10.3)
Artist-Critic Coevolution
“All in all the creative act is not performed by the artist alone; the spectator brings the
work in contact with the external world by deciphering and interpreting its inner
quali�cations and thus adds his contribution to the creative act.”
Marcel Duchamp (1957)−
One paradigm for producing computer-generated art is called Artist-Critic Coevolution.
In this paradigm, an Artist produces images, and a Critic evaluates those images along with a set of
real images. The Critic is rewarded for distinguishing real images from those generated by the Artist.
The Artist is rewarded for fooling the Critic into thinking that the generated images are real. This table
shows various choices for Artist and Critic that have been tried over the years.
Artist
Biomorph
GP
CPPN
CA
GP
GP
Agents
GP
DCNN
DCNN
HERCL
HERCL
Critic
Human
Human
Human
Human
SOM
NN
NN
NN
DCNN
DCNN
HERCL
DCNN
Method
Blind Watchmaker
Interactive Evolution
PicBreeder
EvoEco
Artificial Creativity
Computational Aesthetics
Evolutionary Art
Aesthetic Learning
Generative Adversarial Nets
Plug & Play Generative Nets
Co-Evolving Line Drawings
HERCL Function/CNN
Reference
(Dawkins, 1986)
(Sims, 1991)
(Secretan, 2011)
(Kowaliw, 2012)
(Saunders, 2001)
(Machado, 2008)
(Greenfield, 2009)
(Li & Hu, 2010)
(Goodfellow, 2014)
(Nguyen, 2016)
(Vickers, 2017)
(Soderlund, 2018)
Interactive Evolution
In early work, a human played the role of the Critic in a process known as Interactive Evolution.
Typically, 15 images appear on the screen and the user is invited to click on one or more images that
they �nd appealing. These images are then copied, with slight mutations, into the next generation,
and the process continues. Here are some examples of resulting Biomorphs from The Blind
Watchmaker (Dawkins, 1986).
The same process was used in Sims (1991) but in this case, the Artist is a Genetic Program (GP) which
acts as a function to compute R,G,B values for each pixel location (x,y).
PicBreeder (Secretan, 2011) used a Compositional Pattern Producing Network (CPPN) for the Artist.
[image source: picbreeder.org]
It is interesting to see how each type of Artist leads to its own style of image.
Automating the Critic
The process of Interactive Evolution requires a lot of e�ort from the human. (Machado, 2008)
attempted to replace the human with an automated Critic in the form of a 2-layer neural network
which takes as input certain statistical features computed from the image. These are some of the
images that emerged.
These images have their own kind of charm, but are not very realistic. The big breakthrough came
with the advent of Generative Adversarial Networks (GAN) in(Goodfello ( 2014) where a Convolutional
Neural Network is used for both the Generator (Artist) and the Discriminator (Critic).
Before describing GANs in detail, we �nish this section with some recent examples of Artist-Critic
Coevolution using a special kind of Genetic Program for the Artist and a Convolutional Neural
Network for the Critic (Blair, 2019).
[image source: pickartso.com]
References
Blair, A. (2019). Adversarial evolution and deep learning—How does an artist play with our visual
system? In International conference on computational intelligence in music, sound, art and design
(EvoMusArt’19) (pp. 18–34). LNCS 11453.
Dawkins, R. (1986). The Blind Watchmaker: Why the evidence of evolution reveals a world without design.
Norton.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio,
Y., 2014. Generative adversarial nets, Advances in Neural Information Processing Systems, 2672-
2680.
Machado, P., Romero, J., & Manaris, B. (2008). Experiments in computational aesthetics: An iterative
approach to stylistic change in evolutionary art. In The art of arti�cial evolution: A handbook on
evolutionary art and music (pp. 381–415). Springer.
Secretan, J., Beato, N., D’Ambrosio, D. B., Rodriguez, A., Campbell, A., Folsom- Kovarik, J. T., & Stanley,
K. O. (2011). Picbreeder: A case study in collaborative evolutionary exploration of design space.
Evolutionary Computation, 19(3), 373–403.
Sims, K. (1991). Arti�cial evolution for computer graphics. ACM Computer Graphics, 25(4), 319–328.
Generative Adversarial Networks
Generative Adversarial Networks (Goodfellow, 2014) are a form of Artist-Critic Coevolution where the
Generator (Artist) and Discriminator (Critic) are both Deep Convolutional Neural Networks.
The Generator with parameters generates an image from latent variables
(sampled from a Standard Normal distribution). The Discriminator with
parameters takes an image and estimates the probability of the image being real.
G θ D ψ
G :θ z ↦ x, θ, x z
D :ψ x ↦ D (x) ∈ψ (0, 1),
ψ, x
The Generator and Discriminator play a two-player (initially, zero-sum) game.
The Discriminator tries to maximise the bracketed expression, while the Generator tries to minimise
it.
We alternate between gradient ascent on the Discriminator, using:
and gradient descent on the Generator, using:
The di�erentials are backpropagated from the Discriminator, through the image and into the
E log D (x) + E log 1 − D G (z)
ψ
max ( x∼p data [ ψ ] z∼p model [ ( ψ ( θ ))])
E log 1 − D G (z)
θ
min z∼p model [ ( ψ ( θ ))]
Generator.
In practice, it is better to change the game so it is no longer zero-sum. The original formula (above)
puts too much emphasis on images with which the Generator is already successfully fooling the
Discriminator. It is better to do gradient ascent on the Generator using:
E log D G (z)
θ
max z∼p model [ ( ψ ( θ ))]
This puts more emphasis on the images with which the Generator is currently failing to fool the
Discriminator.
GAN algorithm
repeat:
for k steps do
sample minibatch of m latent samples {z , … , z } from p(z)(1) (m)
sample minibatch of m training items {x , … , x }(1) (m)
update Discriminator by gradient ascent on ψ :
∇ log(D (x )) + log(1 − D (G (z )))ψ
m
1
i=1
∑
m
[ ψ (i) ψ θ (i) ]
end for
sample new minibatch of m latent samples {z , … , z } from p(z)(1) (m)
update Generator by gradient ascent on θ :
∇ log(D (G (z )))θ
m
1
i=1
∑
m
ψ θ
(i)
end repeat
Compared to previous approaches, GANs produce images that are much more realistic!
[image source]
Coevolutionary Dynamics
From the point of view of traditional machine learning, the enormous number of independent
parameters in a Deep Convolutional Neural Network (DCNN) compared to the number of training
images would suggest the risk of over�tting. In practice, DCNNs are very good at generalising to
classify new, unseen images. However, as illustrated in these examples from Szegedy (2013), if we
know the architecture and weights of a DCNN, we can take an image, add a small amount of noise to
it which is imperceptible to humans, and the network will classify it with high con�dence into a totally
di�erent category.
This may perhaps explain why GAN-generated images look so realistic. If the Discriminator were
�xed, the Generator would be able to exploit weaknesses in the Discriminator and create images
analogous to those above, which look unrealistic to humans but are somehow able to fool the
Discriminator into classifying them as real. Because the Discriminator is being trained
simultaneously, it will have an opportunity to respond to the Generator and correct these
weaknesses, leading to a coevolutionary arms race.
Oscillation and Mode Collapse
The Generator network aims to produce the full range of images , with di�erent values for the latent
variables . Like any coevolution, GANs can sometimes oscillate or get stuck in a mediocre stable
state.
x
z
Oscillation: GAN trains for a long time, generating a variety of images, but quality fails to
improve
Mode Collapse: Generator produces only a small subset of the desired range of images, or
converges to a single image (with minor variations)
Several methods have been proposed for avoiding mode collapse, including Conditioning
Augmentation, Minibatch features (�tness sharing), Unrolled GANs and others.
GAN Variants
Many variants of the original GAN algorithm have been developed. For example, this system allows a
realistic image to be created according to constraints speci�ed by the user of certain parts of the
image belonging to certain categories.
http://nvidia-research-mingyuliu.com/gaugan/
DeepFake Technology and Computational Creativity
An error occurred.
Try watching this video on www.youtube.com, or enable JavaScript if it is
disabled in your browser.
This is video from a talk I gave to the Australian Skeptics in February, 2021.
References
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio,
Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (pp.2672–
2680).
Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., & Yosinski, J. (2017). Plug & play generative
networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 4467–4477).
Nguyen, A., Yosinski, J., & Clune, J. (2015). High con�dence predictions for unrecognizable images. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 427–436).
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep
convolutional generative adversarial networks. arXiv: 1511.06434.
Park, T., Liu, M.Y., Wang, T.C., & Zhu, J.Y., 2019. Semantic image synthesis with spatially-adaptive
normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 2337-2346).
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R., 2013.
Intriguing properties of neural networks. arXiv: 1312.6199.
Quiz 8: Unsupervised Learning
Question 1
No response
Question 2
No response
Question 3
No response
What is the Energy function for each of these architectures?
Boltzmann Machine
Restricted Boltzmann Machine
Remember to de�ne any variables you use.
The Variational Auto-Encoder is trained to maximize
Brie�y state what each of these two terms aims to achieve.
E [log p (x ∣ z)] −z∼q (z ∣ x )ϕ (i) θ
(i) D (q (z ∣ x )∥ p(z))KL ϕ
(i)
Generative Adversarial Networks traditionally made use of a two-player zero-sum game between a
Generator and a Discriminator , to compute .G θ D ψ (V (G , D ))
θ
min
ψ
max θ ψ
Give the formula for V (G , D )θ ψ
Explain why it may be advantageous to change the GAN algorithm so that the game is no longer
zero-sum, and write the formula that the Generator would try to maximize in that case.
Week 9 Coding Exercise
In this exercise, you will practise how to implement a simple autoencoder. We will train an
autoencoder model and a variational autoencoder model to reconstruct images from MNIST
dataset. We also compare the quality of sampling images from these two models.
Please download the notebook and implement/run it locally.
Week 9 Thursday video