COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Autoencoders
2
COMP9444 20T2 Autoencoders 3
9a. Autoencoders
Autoencoder Networks (14.1)
Regularized Autoencoders (14.2)
Stochastic Encoders and Decoders (14.4) Generative Models
Variational Autoencoders (20.10.3)
Recall: Encoder Networks
Autoencoder Networks
Textbook, Chapter 14
identity mapping through a bottleneck
also called N–M–N task
used to investigate hidden unit representations
output is trained to reproduce the input as closely as possible
activations normally pass through a bottleneck, so the network is
COMP9444
⃝c Alan Blair, 2017-20
abstract features from the input
COMP9444 ⃝c Alan Blair, 2017-20
Inputs
Outputs
10000
01000
00100
00010
00001
10000
01000
00100
00010
00001
COMP9444 20T2
Autoencoders 1
forced to compress the data in some way
like the RBM, Autoencoders can be used to automatically extract
COMP9444 20T2 Autoencoders 4
COMP9444 20T2 Autoencoders 5
Autoencoder Networks
Autoencoder as Pretraining
If the encoder computes z = f (x) and the decoder computes g( f (x)) then we aim to minimize some distance function between x and g( f (x))
E = Lx, g( f (x))
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Autoencoders 6
COMP9444 20T2
Autoencoders 7
Greedy Layerwise Pretraining
Avoiding Trivial Identity
Autoencoders can be used as an alternative to Restricted Bolzmann Machines, for greedy layerwise pretraining.
If there are more hidden nodes than inputs (which often happens in image processing) there is a risk the network may learn a trivial identity mapping from input to output.
An autoencoder with one hidden layer is trained to reconstruct the inputs. The first layer (encoder) of this network becomes the first layer of the deep network.
We generally try to avoid this by introducing some form of regularization.
Each subsequent layer is then trained to reconstruct the previous layer.
Afinalclassificationlayeristhenaddedtotheresultingdeepnetwork, and the whole thing is trained by backpropagation.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
after an autoencoder is trained, the decoder part can be removed and replaced with, for example, a classification layer
this new network can then be trained by backpropagaiton
the features learned by the autoencoder then serve as initial weights
for the supervised learning task
COMP9444 20T2 Autoencoders 8
COMP9444 20T2 Autoencoders 9
Regularized Autoencoders (14.2)
Sparse Autoencoder (14.2.1)
autoencoders with dropout at hidden layer(s) sparse autoencoders
contractive autoencoders
denoising autoencoders
One way to regularize an autoencoder is to include a penalty term in the loss function, based on the hidden unit activations.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2
Autoencoders 10
COMP9444 20T2 Autoencoders 11
Contractive Autoencoder (14.2.3)
Denoising Autoencoder (14.2.2)
Another popular penalty term is the L2-norm of the derivatives of the hidden units with respect to the inputs
Another regularization method, similar to contractive autoencoder, is to add noise to the inputs, but train the network to recover the original input
E = L(x,g(f(x))+λ∑i ||∇x hi||2
This forces the model to learn hidden features that do not change
repeat:
sample a training item x(i)
generate a corrupted version x ̃ of x(i) t r a i n t o r e d u c e E = L x ( i ) , g ( f ( x ̃ ) )
much when the training inputs x are slightly altered.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
This is analagous to the weight decay term we previously used for supervised learning.
One popular choice is to penalize the sum of the absolute values of the activations in the hidden layer
end
E = L(x,g(f(x))+λ∑|hi| i
ThisissometimesknownasL1-regularization(becauseitinvolvesthe absolute value rather than the square); it can encourage some of the hidden units to go to zero, thus producing a sparse representation.
COMP9444 20T2 Autoencoders 12
COMP9444 20T2 Autoencoders 13
Loss Functions and Probability
Stochastic Encoders and Decoders (14.4)
We saw previously how the loss (cost) function at the output of
a feedforward neural network (with parameters θ) can be seen as defining a probability distribution pθ(x) over the outputs. We then train to maximize the log of the probability of the target values.
For autoencoders, the decoder can be seen as defining a conditional probability distribution pθ(x|z) of output x for a certain value z of the hidden or “latent” variables.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Autoencoders 14
COMP9444 20T2 Autoencoders
15
◮ ◮ ◮
squared error assumes an underlying Gaussian distribution, whose mean is the output of the network
cross entropy assumes a Bernoulli distribution,
with probability equal to the output of the network softmax assumes a Boltzmann distribution
In some cases, the encoder can also be seen as defining a conditional probability distribution qφ(z|x) of latent variables z based on an input x.
Generative Models
Gaussian Distribution (3.9.3)
Sometimes, as well as reproducing the training items {x(i)}, we also want to be able to use the decoder to generate new items which are of a similar “style” to the training items.
Pμ,σ(x) = √ 1 2πσ
e−(x−μ)2/2σ2
In other words, we want to be able to choose latent variables z from a standard Normal distribution p(z), feed these values of z to the decoder, and have it produce a new item x which is somehow similar to the training items.
Generative models can be:
◮ explicit (Variational Autoencoders)
◮ implicit (Generative Adversarial Networks)
0
COMP9444
⃝c Alan Blair, 2017-20
⃝c Alan Blair, 2017-20
We have seen an example of this with the Restricted Boltzmann Machine, where qφ(z|x) and pθ(x|z) are Bernoulli distributions.
μ = mean
σ = standard deviation
Multivariate Gaussian: COMP9444
Pμ,σ(x) = ∏ Pμi,σi (xi) i
COMP9444 20T2 Autoencoders
16
COMP9444 20T2 Autoencoders 17
Entropy and KL-Divergence
Variational Autoencoder (20.10.3)
The entropy of a distribution q() is H(q) = q(θ)(−logq(θ))dθ θ
Instead of producing a single z for each x(i), the encoder (with parameters φ) can be made to produce a mean μz|x(i) and standard deviation σz|x(i)
This defines a conditional (Gaussian) probability distribution qφ(z|x(i)) We then train the system to maximize
In Information Theory, H(q) is the amount of information (bits) required to transmit a random sample from distribution q()
For a Gaussian distribution, H (q) = ∑ log σi
KL-Divergence i
E (i) [log pθ(x(i)|z)] − DKLqφ(z|x(i))||p(z) z∼qφ (z|x )
DKL(q||p)= q(θ)(logq(θ)−log p(θ))dθ θ
the first term enforces that any sample z drawn from the conditional distribution qφ(z|x(i)) should, when fed to the decoder, produce somthing approximating x(i)
DKL (q || p) is the number of extra bits we need to trasmit if we designed a code for p() but the samples are drawn from q() instead.
the second term encourages qφ(z|x(i)) to approximate p(z)
in practice, the distributions qφ(z|x(i)) for various x(i) will occupy
If p(z) is Standard Normal distribution, minimizing DKLqφ(z)||p(z) encourages qφ() to center on zero and spread out to approximate p().
complementary regions within the overall distribution p(z)
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Autoencoders
18
COMP9444 20T2 Autoencoders
19
Variational Autoencoder Digits
Variational Autoencoder Digits
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
1st Epoch 9th Epoch
Original
COMP9444 20T2 Autoencoders 20
COMP9444 20T2 Autoencoders 21
Variational Autoencoder Faces
Variational Autoencoder
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Autoencoders 22
References
http://kvfrans.com/variational-autoencoders-explained/ http://cs231n.stanford.edu/slides/2017/cs231n 2017 lecture13.pdf https://arxiv.org/pdf/1606.05908.pdf
COMP9444 ⃝c Alan Blair, 2017-20
Variational Autoencoder produces reasonable results
tends to produce blurry images
oftenendupusingonlyasmallnumberofthedimensionsavailabletoz