Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 17, 2021
This section:
• LSTMs
• Sequence to sequence models
• Transformer models
• Attention
Readings: optional but recommended
• “Dive into Deep Learning” chapters 6.6, 8-8.4, 10.3-10.7
• This book is a free download on the web, and contains running code
Recurrent Neural Nets for Sequential Data
Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]
• Forward Pass
Slide Credit: Piotr Mirowski
• Backward Pass
* problem: vanishing and/or exploding gradients
Slide Credit: Piotr Mirowski
• Learned hidden representations of context can be useful for: • part of speech labeling
• sentiment analysis
• information extraction
•…
• Predict label for each word, instead of predicting next word
Slide Credit: Piotr Mirowski
Example: Opinion Mining
Label opinion segments by labeling each word. o = outside
b = beginning of segment
i = inside segment
Label:o b i i i i o
Trump [has come a long way] from [Irsoy & Cardie, 2014]
h summarizes earlier words
Deep Bidirectional Recurrent Network
[Irsoy & Cardie, 2014]
Two additional ideas:
• Multiple layers to compute y from x
• A left-to-right RNN, plus right-to-left RNN
Example:
• Y label values {begin, inside, outside} for each word, to label
contiguous text segments indicating opinions. [Irsoy & Cardie, 2014]
Long Short Term Memory
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
LSTMs
h1
h2
h3
x1 x2 x3
Element-wise multiply
LSTMs
h1
h2
h3
x1 x2 x3
Element-wise multiply
Bi-directional Recurrent Neural Networks
• Key idea: processing of word at position t can depend on following words too, not just preceding words
[Goodfellow et al., 2016]
Deep Bidirectional LSTM Network
[“Hybrid Speech Recognition with Deep Bidirectional LSTM,” Graves et al., 2013]
Gated Recurrent Units (GRUs)
Element-wise multiply
GRU
fewer parameters than LSTM found equally effective in some experiments involving
• speech recognition
• music analysis
see [Chung et al., 2014]
Optional material – won’t be on exam
Sequence to Sequence Learning
Learned Representation
Encoder
Output Sequence
Decoder
Input Sequence
• RNN Encoder-Decoders for Machine Translation (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015)
Seq2Seq. Encoder-Decoder Architecture
Ourput sequence y1 … yT’
XYZQ
v
A B C D __ X Y Z Input sequence x1 … xT
Sequence to Sequence Models
• machine translation
• text summarization
• text to computer command •…
Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks
Problem: HS3 has to encode entire sequence…
[Pranay Dugar https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 ]
Problem: HS3 has to encode entire sequence…
Maybe do this?
[Pranay Dugar https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 ]
Problem: HS3 has to encode entire sequence…
Attention: encoder outputs a weighted avg. of encoder states, where weights depend on state of decoder.
[Pranay Dugar https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 ]
Transformer Architecture
Transformer architecture uses attention in three ways:
1. Bydecoder:Attentiononencoderstates,basedondecoderstate
2. Insideencoder:ReplaceRNNbyself-attentionacrossinputtokens 3. Insidedecoder:ReplaceRNNbyself-attentionacrossoutputtokens
[Pranay Dugar https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 ]
Scaled Dot Product Attention
Given:
• Set of
• ki vector of dim dk
• vi vector of dim dv
• Query q which is vector of dim dk
Return:
• Vector of dim dv which is a weighted sum of the vi
• where weights given by the Softmax of dot products q ki / sqrt(dk)
Multi-Head Attention
Transformer Architecture
[Vaswani et al., 2019]
Result A(Q,K,V)
[Vaswani et al., 2019]
[Vaswani et al., 2019]
[Vaswani et al., 2019]
[Vaswani et al., 2019]
[Vaswani et al., 2019]
[Vaswani et al., 2019]
Yields Significant Improvement in Machine Translation
More details:
See http://nlp.seas.harvard.edu/2018/04/03/attention.html
• Multiple sine waves added as positional encoding of input tokens
• Dropout during training at every layer
• Layer-norm
• ADAM optimizer with learning rate warmup (warmup + exponential decay in learning rate)
• Auto-regressive decoding with beam search and length biasing •…
How to Think About Transformers?
Attention: encoder outputs a weighted avg. of encoder states, where weights depend on state of decoder.
General program schema
• “Decoder” outputs a sequence of tokens
• Based on its perceived input, plus what it has already output
• With attention mechanism to focus on relevant subset of its input and output
• Learned parameters define both attention and operations it performs
BERT: Bidirectional Encoder Representations from Transformers
Goal: Trained model that will produce generally useful encodings of arbitrary text
• Uses transformer architecture
• Bidirectional attention across entire input
• Accept input sequences with multiple sentences
– (e.g., question/answer pairs)
– Special token [CLS] indicates beginning of sequence, output vector embedding for this token represents sequence for classification tasks
– Special token [SEP] indicates beginning of new sentence
• Train by
– masking out 15% of words, and predicting them
– classify whether second sentence actually follows the first sentence
• True in 50% of cases
• 24 layers deep, hidden unit dim = 1024, 12 self-attention heads, 340M trained parameters
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
* “segment embedding” is a learned embedding indicating either sentence A or sentence B
BERT: Bidirectional Encoder Representations from Transformers
Goal: Trained model that will produce generally useful encodings of words and sentences
Fine tuning for new tasks:
• Adding an output layer for new tasks (Q/A, textual entailment, sentiment analysis, equivalence of two questions, ..) then fine- tuning by further training, advances state of the art performance on many language tasks
• Can “freeze” the 340M trained parameters, and fine-tune by training only the new output layer
• Or, fine tune end-to-end, tuning all parameters
COMET
[Bosselut et al., 2019]
COMET
[Bosselut et al., 2019]
COMET
Demo: https://mosaickg.apps.allenai.org/comet_atomic
[Bosselut et al., 2019]
Programming Frameworks for Deep Nets
• Pytorch (Facebook)
• TensorFlow (Google)
• TFLearn (runs on top of TensorFlow, but simpler to use)
• Theano (University of Montreal)
• CNTK (Microsoft)
• Keras (can run on top of Theano, CNTK, TensorFlow) •…
Many support use of Graphics Processing Units (GPU’s) Major factor in dissemination of Deep Network technology
TensorFlow example
Modern Deep Networks: 2021 vs 1987
• vastly more online data
• GPU’s, TPU’s
• Heterogenous units
– Relu, sigmoid, tanh, linear
• including units composed from other units – LSTM, GRU, Attention, …
• many new architectures
– 100 layers deep, bidirectional LSTMs, Convolutional nets, Transformers…
• New ideas for gradient descent
– dropout, batch normalization, Adagrad, layer normalization, …
• unification with probabilistic models – train to output probabilities
• frameworks like TensorFlow
• online text with code: Dive into Deep Learning
What you should know:
• Representation learning
– Hidden layers re-represent inputs in form to predict outputs
– Autoencoders
– Sometimes reused widely (e.g., word2vec word embeddings)
• Convolutional neural networks
– Convolution provides translation invariance
– Network stages with reducing spatial resolution, Mult. channels,…
• Recurrent neural networks
– Learn to represent history in time series
– Backpropagation as unfolding in time
– LSTM memory units
• Neural architectures
– Shared parameters across multiple computations
– Layers with different structures/functions
– RNN’s, Seq2Seq, Transformer, …
– Probabilistic classification à output Softmax layer