Introduction to Deep Learning
Introduction to Deep Learning
Angelica Sun
(adapted from Atharva Parulekar, Jingbo Yang)
Overview
Motivation for deep learning
Convolutional neural networks
Recurrent neural networks
Transformers
Deep learning tools
But we learned multi-layer perceptron in class?
Expensive to learn. Will not generalize well.
Does not exploit the order and local relations in the data!
64x64x3=12288 parameters
We also want many layers
What are areas of deep learning?
Convolutional NN
Image
Recurrent NN
Sequential Inputs
Deep RL
Control System
Graph NN
Networks/Relational
Transformers
Parallelized Sequential Inputs
Starting from CNN
Convolutional
Neural Network
Leo Mehr () – Nice! I really like these types of slides that focus in on one point. One step further is to delete all the other photos and just show the CNN image
Penguin Yang () – We have one in Slide 8 😀
But I get your point, let me made a slight modification
Let us look at images in detail
Filters in traditional Computer Vision
Image credit: https://home.ttic.edu/~rurtasun/courses/CV/lecture02.pdf
Learning filters in CNN
Why not extract features using filters?
Better, why not let the data dictate what filters to use?
Learnable filters!!
Convolution on multiple channels
Images are generally RGB !!
How would a filter work on a image with RGB channels?
The filter should also have 3 channels.
Now the output has a channel for every filter we have used.
Parameter Sharing
Lesser the parameters less computationally intensive the training. This is a win win as we are reusing parameters.
Translational invariance
Since we are training filters to detect cats and the moving these filters over the data, a differently positioned cat will also get detected by the same set of filters.
Visualizing learned filters
Images that maximize filter outputs at certain layers. We observe that the images get more complex as filters are situated deeper
How deeper layers can learn deeper embeddings. How an eye is made up of multiple curves and a face is made up of two eyes.
A typical CNN structure:
Image credit: LeCun et al. (1998)
Convolution really is just a linear operation
In fact convolution is a giant matrix multiplication.
We can expand the 2 dimensional image into a vector and the conv operation into a matrix.
SOTA Example – Detectron2
How do we learn?
Instead of
They are “optimizers”
Momentum: Gradient + Momentum
Nestrov: Momentum + Gradients
Adagrad: Normalize with sum of sq
RMSprop: Normalize with moving avg of sum of squares
ADAM: RMsprop + momentum
Leo Mehr () – I think this should come before other topics, since dropout/initialization/etc. are all related to optimizing the loss function. Perhaps also one slide that just says something like “we now have a network with a bunch of weight and a loss function. to learn we just do gradient descent and backpropogate the error derivates” — this will reinforce what was learned in lecture / the notes, and set you up for these last ~5 slides
Mini-batch Gradient Descent
Expensive to compute gradient for large dataset
Memory size
Compute time
Mini-batch: takes a sample of training data
How to we sample intelligently?
Is deeper better?
Deeper networks seem to be more powerful but harder to train.
Loss of information during forward propagation
Loss of gradient info during back propagation
There are many ways to “keep the gradient going”
One Solution: skip connection
Connect the layers, create a gradient highway or information
highway.
ResNet (2015)
Image credit: He et al. (2015)
Initialization
Can we initialize all neurons to zero?
If all the weights are same we will not be able to break symmetry of the network and all filters will end up learning the same thing.
Large numbers, might knock relu units out.
Relu units once knocked out and their output is zero, their gradient flow also becomes zero.
We need small random numbers at initialization.
Variance : 1/sqrt(n)
Mean: 0
Popular initialization setups
(Xavier, Kaiming) (Uniform, Normal)
Leo Mehr () – Maybe cite the names of/link to a couple popular initialization methods: e.g. He, Xavier
Dropout
What does cutting off some network connections do?
Trains multiple smaller networks in an ensemble.
Can drop entire layer too!
Acts like a really good regularizer
More tricks for training
Data augmentation if your data set is smaller. This helps the network generalize more.
Early stopping if training loss goes above validation loss.
Random hyperparameter search or grid search?
Leo Mehr () – I like this slide! This is great
CNN sounds like fun!
What are some other areas of deep learning?
Recurrent NN
Sequential data
Convolutional NN
Deep RL
Graph NN
We can also have 1D architectures (remember this)
CNN works on any data where there is a local pattern
We use 1D convolutions on DNA sequences, text sequences, and music notes
But what if time series has causal dependency or any kind of sequential dependency?
To address sequential dependency?
Use recurrent neural network (RNN)
Step output
Latent Output
Input at one time step
RNN Cell
Unrolling an RNN
The RNN Cell (Composed of Wxh and Whh in this example) is really the same cell.
NOT many different cells like the filters of CNN.
How does RNN produce result?
I love CS !
Result after reading full sentence
Evolving “embedding”
2 Typical RNN Cells
Long Short Term Memory (LSTM)
Gated Recurrent Unit (GRU)
Store in “long term memory”
Response to current input
Update gate
Reset gate
Response to current input
Recurrent AND deep?
Taking last value
Pay “attention” to everything
Stacking
Attention Model
Transformer – Attention is All You Need!
Originally proposed for translation.
Encoder computes hidden representations for each word in the input sentence
Applies self attention.
Decoder makes sequential prediction similar as in RNN
At each time step, it predicts the next word based on its previous predictions (partial sentence).
Applies self attention and attention on encoder outputs.
Transformer – Attention is All You Need!
The dot product in softmax below computes how each word of sequence 1 (Q) is influenced by all the other words in the sequence 2 (K).
Considering the different importance, we computed a weighted sum of the information in the sequence 2 (V) to use in computing the hidden representation of sequence 1.
Transformer – Attention is All You Need!
The dot product in softmax below computes how each word of sequence 1 (Q) is influenced by all the other words in the sequence 2 (K).
Considering the different importance, we computed a weighted sum of the information in the sequence 2 (V) to use in computing the hidden representation of sequence 1.
Transformer – Attention is All You Need!
Multiple heads!
— Similar as how you have multiple filters in CNN
Loss of sequential order?
— Positional encoding! (often use sine waves)
Examples of attention scores from two different self-attention heads.
References:
https://arxiv.org/pdf/1706.03762.pdf
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://towardsdatascience.com/transformers-141e32e69591
https://towardsdatascience.com/transformers-explained-visually-part-2-how-it-works-step-by-step-b49fa4a64f34
SOTA Example – GPT3
SOTA Example – DALLE
SOTA Example – GPT3
More? Take CS230, CS236, CS231N, CS224N
Convolutional NN
Image
Recurrent NN
Time Series
Deep RL
Control System
Graph NN
Networks/Relational
Not today, but take CS234 and CS224W
Convolutional NN
Image
Recurrent NN
Time Series
Deep RL
Control System
Graph NN
Networks/Relational
Tools for deep learning
Popular Tools
Specialized Groups
Leo Mehr () – I think it would make sense to highlight TensorFlow and Pytorch as the two most popular libraries for modern DL. Also I think dependency arrows could be clearer when flipped direction.
Penguin Yang () – Ah, I see. I’ll make dependency more obvious
$50 not enough! Where can I get free stuff?
Google Colab
Free (limited-ish) GPU access
Works nicely with Tensorflow
Links to Google Drive
Register a new Google Cloud account
=> Instant $300??
=> AWS free tier (limited compute)
=> Azure education account, $200?
To SAVE money
CLOSE your GPU instance
~$1 an hour
Azure Notebook
Kaggle kernel???
Amazon SageMaker?
https://colab.research.google.com/drive/1Enc-pKlP4Q3cimEBfcQv0B_6hUvjVL3o
Leo Mehr () – Yes!!
Good luck!
Well, have fun too 😀