Deep Learning – COSC2779 – Deep Unsupervised Learning
Deep Learning – COSC2779
Deep Unsupervised Learning
Dr. Iman Abbasnejad
September 20, 2021
Reference: Chapter 14, 20: Ian Goodfellow et. al., “Deep Learning”, MIT Press, 2016.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 1 / 51
Outline
1 AutoEncoders
2 Generative Adversarial Networks (GAN)
3 Text Generation
4 Speech Generation
Lecture 9 Deep Learning – COSC2779 September 20, 2021 2 / 51
Supervised Learning
Data:
D =
{(
x(i), y (i)
)}N
i=1
Goal: Learn a function that h : x→ y
Example: Classification, Regression,
Object detection, Image segmentation,
Sentiment analysis, Machine Translation,
Image captioning, . . .
Probabilistic interpretation:
p
(
y (i) | x (i)1 , · · · , x
(i)
d
)
Classification:
Dog
Lecture 9 Deep Learning – COSC2779 September 20, 2021 3 / 51
Unsupervised Learning
Supervised Learning
Data:
D =
{(
x(i), y (i)
)}N
i=1
Goal: Learn a function that h : x→ y
Example: Classification, Regression,
Object detection, Image segmentation,
Sentiment analysis, Machine Translation,
Image captioning, . . .
Unsupervised Learning
Data:
D =
{(
x(i), ·
)}N
i=1
Goal: Learn some underlying hidden
structure of the data.
Example: Clustering, dimensionality
reduction, feature learning, density
estimation, . . .
Lecture 9 Deep Learning – COSC2779 September 20, 2021 4 / 51
Unsupervised Learning
Example: K-means clustering
Data:
D =
{(
x(i), .
)}N
i=1
Goal: To assign sample x i to the k-th
cluster.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 5 / 51
Generative Models
A form of unsupervised learning.
Data:
D =
{
x(i)
}N
i=1
Goal: Given training data, generate new
samples from same distribution.
Example: Autoencoders, GAN,
One-to-Many RNN, . . .
Probabilistic interpretation:
p
(
x (i)1 , · · · , x
(i)
d
)
Learning:
pmodel (x)
Inference (testing):
pmodel (x)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 6 / 51
Generative Models: Simple Example
Given the heights of students in a class, guess the height of the next student
entering the class.
height
pmodel
Assumption: The heights of students are normally distributed
pmodel = p
(
x (i)
)
= N (x i ;µ, σ) = 1
σ
√
2π
exp− (x
i−µ)2
2σ2
We can fit the model to data using MLE – Learning.
After Learning, we can use the model to generate new data – Sampling. E.g.
Inverse CDF sampling
Lecture 9 Deep Learning – COSC2779 September 20, 2021 7 / 51
Generative Models: Simple Example
Given the heights of students in a class, guess the height of the next student
entering the class.
height
pmodel
Assumption: The heights of students are normally distributed
pmodel = p
(
x (i)
)
= N (x i ;µ, σ) = 1
σ
√
2π
exp− (x
i−µ)2
2σ2
We can fit the model to data using MLE – Learning.
After Learning, we can use the model to generate new data – Sampling. E.g.
Inverse CDF sampling
Lecture 9 Deep Learning – COSC2779 September 20, 2021 7 / 51
Generative Models: Simple Example
Given the heights of students in a class, guess the height of the next student
entering the class.
height
pmodel
Assumption: The heights of students are normally distributed
pmodel = p
(
x1, x2, …, xn|Θ
)
=
∏N
i=1 p
(
x i |Θ
)
MLE:⇒ ∂
∂Θ
∏N
i=1 p
(
x i |Θ
)
= 0⇒
∑N
i=1
∂
∂Θp
(
x i |Θ
)
= 0
µMLE = 1N
∑i=N
i=1 x i , σMLE =
√
1
N
∑i=N
i=1 (x i − µ)2
Lecture 9 Deep Learning – COSC2779 September 20, 2021 8 / 51
Generative Models: Why?
Learn useful features for downstream tasks such as classification.
Getting insights from high-dimensional data (physics, medical imaging,
etc.)
Realistic samples for artwork, super-resolution, colorization, etc
Modeling physical world for simulation and planning
. . .
18 Impressive Applications of Generative Adversarial Networks (GANs)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 9 / 51
https://machinelearningmastery.com/impressive-applications-of-generative-adversarial-networks/
Objective of the Lecture
Gain a basic understanding of the deep generative models applicable
for image, text and speech generation.
Will be a more conceptual lecture.
Will provide an example for training a generative model.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 10 / 51
Outline
1 AutoEncoders
2 Generative Adversarial Networks (GAN)
3 Text Generation
4 Speech Generation
Lecture 9 Deep Learning – COSC2779 September 20, 2021 11 / 51
Example Scenario
Airbus provides several services for the operation of
the Columbus module and its payloads on the
International Space Station (ISS).
To ensure the health of the crew as well as
hundreds of systems onboard the Columbus
module, engineers have to keep track of many
telemetry data-streams, which are constantly
beamed to earth.
Airbus is interested in automated detection of
anomalies in the telemetry data-streams.
Data: Telemetry data-streams of the last 10 years
results in over 5 trillion data points. However there
are very few (none for most systems) anomalies. How Airbus Detects Anomalies in ISS Telemetry
Data Using TFX
Lecture 9 Deep Learning – COSC2779 September 20, 2021 12 / 51
https://blog.tensorflow.org/2020/04/how-airbus-detects-anomalies-iss-telemetry-data-tfx.html
https://blog.tensorflow.org/2020/04/how-airbus-detects-anomalies-iss-telemetry-data-tfx.html
Detecting Anomalies in ISS Telemetry Data
Can this be approached as a supervised learning problem?
Heavily skewed data. 99.9% data maybe from Normal class. Very few
anomalous examples (none for most systems).
Anomaly is not a pattern that we can learn. The pattern is only in the
normal data.
We are interested in modelling how normal data is distributed. pnormal (x)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 13 / 51
Detecting Anomalies in ISS Telemetry Data
Can this be approached as a supervised learning problem?
Heavily skewed data. 99.9% data maybe from Normal class. Very few
anomalous examples (none for most systems).
Anomaly is not a pattern that we can learn. The pattern is only in the
normal data.
We are interested in modelling how normal data is distributed. pnormal (x)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 13 / 51
AutoEncoders: Intuition
Encoder Decoder
Code
Think about JPEG encoding and decoding. In that case both encoder and
decoder are predetermined functions.
Can we learn an encoding function and decoding function using data.
Parallel idea: PCA (Dimensionality reduction).
Lecture 9 Deep Learning – COSC2779 September 20, 2021 14 / 51
Autoencoder Basics
Encoder
z(i)
Decoder
x̂ (i)
x (i) Input
Code
Reconstruction
Unsupervised approach for learning a
lower-dimensional feature representation
from unlabeled training data
x is the input. Can be an image,
sequence or other feature vector.
x̂ is the prediction. Same dimension
as the input.
z is the Latent representation (code).
Usually has smaller dimensions than
the input.
Both encoder and decoder are neural
networks (MLP, CNN, RNN).
Lecture 9 Deep Learning – COSC2779 September 20, 2021 15 / 51
Autoencoder Training
Encoder
z(i)
Decoder
x̂ (i)
x (i)
L = ‖x − x̂‖2
Unsupervised approach for learning a
lower-dimensional feature representation
from unlabeled training data
Only need x to train the network.
z has to capture information in the
input image in order to be able to
reconstruct that image.
Therefore, this training process will
generate a model that can encode
image information into a compact
vector (z)
Will only work well for test inputs that
are “similar” to the training data.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 16 / 51
Convolutional Autoencoder for Images
Input Reconstruction
Feature
Encoder consists of convolution layers. Some layers with Striding (or
pooling) to reduce dimension.
Decorder consists of convolution layers. Some layers with Transpose
convolution (or un-pooling) to increase dimension.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 17 / 51
Detecting Anomalies in ISS Telemetry Data
For sequence data, both encoder and
decoder can be RNN.
How Airbus Detects Anomalies in ISS
Telemetry Data Using TFX
Train the AutoEncoder model only using
the normal data.
The AutoEncoder will learn to
reconstruct the normal data well – Low
reconstruction error on normal data.
The hypothesis is that it will NOT do well
on anomalous data.
Threshold the reconstruction error to
detect anomalies in test data.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 18 / 51
https://blog.tensorflow.org/2020/04/how-airbus-detects-anomalies-iss-telemetry-data-tfx.html
https://blog.tensorflow.org/2020/04/how-airbus-detects-anomalies-iss-telemetry-data-tfx.html
AutoEncoder for Learning Feature Extractors
Encoder
z(i)
Head
x (i)
Can use the trained encoder as a feature
extractor and do transfer learning for other
similar tasks.
A form of self-supervision.
Can also use z to cluster the dataset.
Used to be a common pre-training
technique for image classification. Not so
popular anymore as transfer learning has
taken over.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 19 / 51
Example: De-Noising AutoEncoder
You are given a set of images from a clothing database. The task is to create a feature
extractor that can be used for classifying the images to common clothing categories. E.g.
Trousers, t-shirts, . . .
An AutoEncoder with sufficient complexity may learn the so-called “Identity Function”,
meaning that the output equals the input, marking the AutoEncoder useless.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 20 / 51
Example: De-Noising AutoEncoder
You are given a set of images from a clothing database. The task is to create a feature
extractor that can be used for classifying the images to common clothing categories. E.g.
Trousers, t-shirts, . . .
Train an AutoEncoder with Noise added images as input and the corresponding training
image as the target.
This is called a de-noising AutoEncoder.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 21 / 51
Example: De-Noising AutoEncoder
Encoder
z(i)
Decoder
x̂ (i)n
x (i)n
L = ‖x − x̂n‖2
Noisy image
Original image
Train an AutoEncoder with Noise added
images as input and the corresponding
training image as the target.
At test time, Input corrupted test image
and the network will generate a noise free
version of it.
Now the network is forced to learn
underlying structure. Cannot take
shortcuts by memorizing inputs.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 22 / 51
Example: De-Noising AutoEncoder
Encoder
z(i)
Head
x (i)
Can use the trained encoder as a feature
extractor and do transfer learning for other
similar tasks.
A form of self-supervision.
Can also use z to cluster the dataset.
Used to be a common pre-training
technique for image classification. Not so
popular anymore as transfer learning has
taken over.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 23 / 51
Example: Segmentation AutoEncoder
U-Net is an architecture used for
segmentation.
With small training samples and
augmentation they obtained
state-of-the-art results on
biomedical image data.
It consists of a contracting path
(left side) and an expansive path
(right side).
A concatenation with the
correspondingly cropped feature
map from the contracting path in
the expansive path.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 24 / 51
Issues in Basic Autoencoder
Encoder
z(i)
Decoder
x̂ (i)
x (i) Input
Code
Reconstruction
The basic auto-encoder can memorize
(over-fit) training data.
Adding noise to input or feature.
Inpainting
Cannot generate novel images. Code
is unknown.
Variational Auto-encoder.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 25 / 51
Issues in Basic Autoencoder
Encoder
z(i)
Decoder
x̂ (i)
x (i) Input
Code
Reconstruction
The basic auto-encoder can memorize
(over-fit) training data.
Adding noise to input or feature.
Inpainting
Cannot generate novel images. Code
is unknown.
Variational Auto-encoder.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 26 / 51
Variational Autoencoder
Encoder Decoder
Code
Think about JPEG encoding and decoding. In that case both encoder and
decoder are predetermined functions.
Can we alter the code to get new images?
Difficult. Would mostly be unrealistic even if we manage to generate an image.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 27 / 51
Variational Autoencoder
z(i)
Decoder
x̂ (i)
Code
Reconstruction
1 Sample z from prior: z(i) ∼ p (z)
2 Sample x from:
x (i) ∼ pmodel
(
x | z(i)
)
We want to learn p (x1, x2, · · · , xd ) given
training data.
In autoencoder we learn
p (x1, x2, · · · , xd | z)
What if z is from a known distribution for
realistic images. E.g. z(i) ∼ N (0, 1) –
Prior.
Then we can sample z(i) and use a decoder
network to generate images.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 28 / 51
Variational Autoencoder
Encoder
µ(i)Σ(i)
z(i)
Decoder
x̂ (i)
x (i) Input
Code
Reconstruction
VAE maps the input data into the
parameters of a probability distribution,
such as the mean and variance of a
Gaussian. This approach produces a
continuous, structured latent space, which
is useful for image generation.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 29 / 51
Variational Autoencoders
Probabilistic spin to traditional autoencoders
which allows generating data
Pros:
Principled approach to generative models.
Interpretable latent space.
Can be useful feature representation for
other tasks.
Cons:
Samples blurrier and lower quality
compared to state-of-the-art (GANs)
Image: Kingma and Welling, “Auto-Encoding Variational Bayes”,
ICLR 2014Lecture 9 Deep Learning – COSC2779 September 20, 2021 30 / 51
Outline
1 AutoEncoders
2 Generative Adversarial Networks (GAN)
3 Text Generation
4 Speech Generation
Lecture 9 Deep Learning – COSC2779 September 20, 2021 31 / 51
Example Scenario
Assume you are hired by a startup to design
a system that takes a face image as input
and generate aged versions of that face.
Image: businessinsider
Lecture 9 Deep Learning – COSC2779 September 20, 2021 32 / 51
Example Scenario
Assume you are hired by a startup to design
a system that takes a face image as input
and generate aged versions of that face.
It is not practical to generate a dataset
that has the same person images at
different times in history. This is required if
we are going to use supervised learning.
However it is relatively easy to collect a
dataset of faces of people with different
ages. E.g. Dataset of 40 year old’s, 50 year
old’s . . .
Image: Antipov, Grigory, et. al.“Face aging with conditional generative
adversarial networks.” ICIP, 2017.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 32 / 51
Generative Adversarial Networks: Intuition
Counterfeiter: Becomes good at generating realistic looking money by
learning about how he got caught.
Policeman: Becomes good at discriminating fake from real as the
Counterfeiter becomes more sophisticated.
A two player game where both learn from each other. Players are neural
networks for our use case.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 33 / 51
Generative Adversarial Networks: Intuition
Discriminator: Try to distinguish between real and fake images.
Discriminator has a binary classification task, Objective:
arg max
θd
∑
i
[
log Dθd
(
x (i)
)
+ log
(
1− Dθd
(
Gθg
(
z (i)
)))]
D(x) = pdatapdata+pg ⇒ Discriminator cannot distinguish between real
and fake therefore the optimum probability of D∗(x) = 12
Real images: x (i)
Fake images: Gθg
(
z (i)
)
Latent feature (code): z (i)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 34 / 51
Generative Adversarial Networks: Intuition
Generator: Try to fool the discriminator by generating real-looking
images
Generator Objective:
arg max
θg
∑
i
[
log
(
Dθd
(
Gθg
(
z (i)
)))]
Generator wants to maximize objective such that D(G(z)) is close to
1 (discriminator is fooled into thinking generated G(z) is real)
Real images: x (i)
Fake images: Gθg
(
z (i)
)
Latent feature (code): z (i)
Lecture 9 Deep Learning – COSC2779 September 20, 2021 35 / 51
GAN: Model
Generator network takes a random vector and maps it to a image. For images
it is a upsampling network (transpose convolution).
Discriminator is a classification CNN network that can distinguish between real
and fake images.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 36 / 51
GAN: Algorithm
Algorithm 1: Training GANs: Goodfellow, NIPS 2014
for Number of training iterations do
for k steps do
Sample m code samples {z (1), · · · , z (m)};
Sample m real images {x (1), · · · , x (m)} from training data;
Update the discriminator weights (θd ) by maximizing discriminator objective.
end
Sample m code samples {z (1), · · · , z (m)};
Update the generator weights (θg ) by maximizing generator objective.
end
After training, use generator network to generate new images
Lecture 9 Deep Learning – COSC2779 September 20, 2021 37 / 51
GAN: Convolutional Architectures
The original GAN networks were notoriously difficult to train.
Architecture guidelines for stable deep convolutional GANs:
Replace any pooling layer in discriminator with strided convolutions and any pooling
layer in genenrator with transpose convolutions.
Use batch-norm in both generator and descriminator.
Remove FC layer from deep architectures.
Use ReLU activation in generator for all layers except for output where tanh is used.
Use LeakyReLU activation in the descriminator.
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Lecture 9 Deep Learning – COSC2779 September 20, 2021 38 / 51
GAN: Convolutional Architectures
Image: Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
Lecture 9 Deep Learning – COSC2779 September 20, 2021 39 / 51
GAN: Better Training and Generation
Developing better training algorithms for GANs is still a very active research
area:
LSGAN
Wasserstein GAN
Improved Wasserstein GAN
Progressive GAN
. . .
List of some GAN papers
Lecture 9 Deep Learning – COSC2779 September 20, 2021 40 / 51
https://github.com/nightrome/really-awesome-gan
Example Scenario
Assume you are hired by a startup to design
a system that takes a face image as input
and generate aged versions of that face.
It is not practical to generate a dataset
that has the same person images at
different times in history. This is required if
we are going to use supervised learning.
However it is relatively easy to collect a
dataset of faces of people with different
ages. E.g. Dataset of 40 year old’s, 50 year
old’s . . .
Can we solve this problem with GANs
we discussed so far?
Image: Antipov, Grigory, et. al.“Face aging with conditional generative
adversarial networks.” ICIP, 2017.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 41 / 51
Conditional GANs
To convey the fundamental idea only. the actual network is slightly more complicated.
Hidden representation (code), Z , is now generated by an encoder CNN.
The generated uses the code and age related information (a) to generate the images.
Discriminator takes in an image and age related information and decide weather the
pair is fake or real.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 42 / 51
Examples: Conditional GANs
Live Interactive Demos by NVIDIA Research
Lecture 9 Deep Learning – COSC2779 September 20, 2021 43 / 51
https://www.nvidia.com/en-us/research/ai-playground/
GAN: Image to Image translation
Image: Isola et al, “Image-to-image translation with conditional adversarial nets”, CVPR 2017
Image: Zhu et al, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”, ICCV 2017
Lecture 9 Deep Learning – COSC2779 September 20, 2021 44 / 51
GAN: Style Transfer
Lecture 9 Deep Learning – COSC2779 September 20, 2021 45 / 51
Outline
1 AutoEncoders
2 Generative Adversarial Networks (GAN)
3 Text Generation
4 Speech Generation
Lecture 9 Deep Learning – COSC2779 September 20, 2021 46 / 51
One-to-Many
Generating Text/Music.
E.g Generating text similar to Shakespeare’s writing.
Image: Sonnet 18 in the 1609 Quarto of Shakespeare’s sonnets.
Given some text from Shakespeare’s writing generate novel sentences that look
similar.
Lecture 9 Deep Learning – COSC2779 September 20, 2021 47 / 51
One-to-Many
Generating Shakespeare’s writing.
x 〈t〉 is a one-hot with size equal to number of characters.
ŷ 〈t〉 is a soft-max-out with size equal to number of characters.
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉 x 〈5〉
All Waa,Wax ,Wya are shared across RNN cells
Text generation with an RNN
Lecture 9 Deep Learning – COSC2779 September 20, 2021 48 / 51
https://www.tensorflow.org/text/tutorials/text_generation
One-to-Many
Inference: Convert ŷ 〈t〉 to one-hot by sampling and input as
x 〈t+1〉
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉
x 〈3〉 x 〈5〉
All Waa,Wax ,Wya are shared across RNN cells
Text generation with an RNN
Lecture 9 Deep Learning – COSC2779 September 20, 2021 48 / 51
https://www.tensorflow.org/text/tutorials/text_generation
One-to-Many
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉
x 〈5〉
All Waa,Wax ,Wya are shared across RNN cells
Text generation with an RNN
Lecture 9 Deep Learning – COSC2779 September 20, 2021 48 / 51
https://www.tensorflow.org/text/tutorials/text_generation
One-to-Many
x 〈1〉
ŷ 〈1〉 ŷ 〈2〉 ŷ 〈3〉 ŷ 〈Ty 〉
a〈1〉 a〈2〉 a〈3〉a〈0〉
Wax
Waa Waa Waa Waa
Wya Wya Wya Wya
. . .
x 〈2〉 x 〈3〉 x 〈5〉
All Waa,Wax ,Wya are shared across RNN cells
Text generation with an RNN
Lecture 9 Deep Learning – COSC2779 September 20, 2021 48 / 51
https://www.tensorflow.org/text/tutorials/text_generation
Outline
1 AutoEncoders
2 Generative Adversarial Networks (GAN)
3 Text Generation
4 Speech Generation
Lecture 9 Deep Learning – COSC2779 September 20, 2021 49 / 51
Speech Generation
WaveNet by google is one of the most popular deep networks for speech
generation.
According to google: WaveNets are able to generate speech which mimics any
human voice and which sounds more natural than the best existing
Text-to-Speech systems, reducing the gap with human performance by over
50%.
Recommended Reading: WaveNet: A generative model for raw audio
Lecture 9 Deep Learning – COSC2779 September 20, 2021 50 / 51
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
Summary
Unsupervised learning: Deep generative models
AutoEncoders: Interpretable latent space. Allows inference of q(z |x),
can be useful feature representation for other tasks. Samples blurrier and
lower quality compared to state-of-the-art.
GANs: Take game-theoretic approach. learn to generate from training
distribution through 2-player game. Beautiful, state-of-the-art samples.
Trickier & more unstable to train. Can’t solve inference queries such as
p(x), p(z |x).
Lecture 9 Deep Learning – COSC2779 September 20, 2021 51 / 51
AutoEncoders
Generative Adversarial Networks (GAN)
Text Generation
Speech Generation