COMP5046
Natural Language Processing
Lecture 12: Pretrained Model
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN Lecture 12: Pretrained Model
1. The Rise of the Pre-trained Model 2. BERT
3. Post BERT
4. Multimodal Pretrained Model
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning
In computer vision, prove the value of transfer learning
• pre-training a neural network on a known task (i.e. ImageNet)
• performing fine-tuning
• using the trained neural network as the basis of a new purpose-specific model.
1
The Rise of the Pre-trained Model
Pre-training and Transfer Learning in NLP
Popular Pre-trained Model in NLP
(Peters et al, 2018) (Devlin et al, 2018)
Using Contextual word representations
1
The Rise of the Pre-trained Model Before we started…
Word Structure and subword models
We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.
Word
Vocab mapping
computer
desk (index)
play
cute (index)
cooooooooooool
UNK (index)
laern
UNK (index)
Transformerify
UNK (index)
Common words
Variations misspellings novel items
1
The Rise of the Pre-trained Model Before we started…
Word Structure and subword models
Many languages exhibit complex morphology, or word structure. • The effect is more word types, each occurring fewer times.
1
The Rise of the Pre-trained Model Before we started…
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level. (Parts of words, characters, bytes.)
Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.
1. Start with a vocabulary containing only characters and an “end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm (1994)
The byte-pair encoding algorithm: How it works?
A simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data
aaabdaaabac aaabdaaabac ZabdZabac ZYdZYac XdXac
Replace Z = aa
Replace Y = ab
Replace X = ZY
Byte Pair
Replacement
ZY
X
ab
Y
aa
Z
1
The Rise of the Pre-trained Model
The byte-pair encoding algorithm in NLP
Traditional Word Encoding in NLP
Dictionary
#vocab: occurance
low: 5, lower: 2, newest: 6, widest: 3
Vocabulary
low, lower, newest, widest
What if we have the word ‘lowest’ in the test set?
OOV Issue
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2,
n e w e s t: 6, w i d e s t: 3
Character-based segmentation
Vocabulary
l, o, w, e, r, n, w, s, t, i, d
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2,
n e w e s t: 6, w i d e s t: 3
Character-based segmentation
Vocabulary
l, o, w, e, r, n, w, s, t, i, d
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation – 1st round
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2,
n e w e s t: 6, w i d e s t: 3
Character-based segmentation
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2, n e w es t: 6, w i d es t: 3
Vocabulary
l, o, w, e, r, n, w, s, t, i, d
Replace es = e s
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation – 2nd round
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2, n e w es t: 6, w i d es t: 3
Character-based segmentation
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2, n e w est: 6, w i d est: 3
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es
Replace est = es t
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation – 3rd round
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2, n e w est: 6, w i d est: 3
Character-based segmentation
Dictionary
#vocab: occurance
lo w: 5,
lo w e r: 2, n e w est: 6, w i d est: 3
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est
Replace lo = l o
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est, lo
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation – 3rd round
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
l o w: 5,
l o w e r: 2, n e w est: 6, w i d est: 3
Character-based segmentation
Dictionary
#vocab: occurance
lo w: 5,
lo w e r: 2, n e w est: 6, w i d est: 3
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est
Replace lo = l o
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est, lo
Repeat this process until 10th round
1
The Rise of the Pre-trained Model The byte-pair encoding algorithm in NLP
Subword Segmentation – Final (after 10th round)
Character/unicode→Vocabulary (Bottom up style)
→The most common pair of consecutive bytes of data is replaced with a byte
Dictionary
#vocab: occurance
low: 5, low e r: 2, newest: 6, widest: 3
Vocabulary
l, o, w, e, r, n, w, s, t, i, d, es, est, lo, low, ne, new, newest, wi, wid, widest
What if we have the word ‘lowest’ in the test set?
Repeat this process until 10th round
1
The Rise of the Pre-trained Model Before we started…
Word Structure and subword models
Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components.
In the worst case, words are split into as many subwords as they have characters
Common words
Variations misspellings novel items
Word
Vocab mapping
computer
Computer
play
play
coooooool
coo##ooo#ool
laern
la##ern##
Transformerify
Transformer##ify
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
Popular Pre-trained Model: Word Embeddings
Word embeddings (e.g. word2vec) are the basis of deep learning for NLP
king [-0.5, -0.9, 1.4, …] queen [0.7, 0.2, -0.5, 1.1, …]
Word embeddings (word2vec, GloVe) are often pre-trained on text corpus from co-occurrence statistics
Inner-product Inner-product
The king wore a crown The queen wore a crown
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
Popular Pre-trained Model: Contextual Representations
Word embeddings (i.e. word2vec, fastText, GloVe) are applied in a context free manner
Step up to the bat bat [0.7, 0.2, -0.5, 1.1, …] A vampire bat bat [0.7, 0.2, -0.5, 1.1, …]
Need to train contextual representation on text corpus
Step up to the bat bat [1.1, -0.7, 0.8, 2.1, …] A vampire bat bat [0.3, 0.5, -0.9, 1.3, …]
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
Early-Stage Pretraining in NLP
Semi-supervised Sequence Learning (Dai and Le, 2015)
Train LSTM Language Model
Fine-tune on Classification Task
(e.g. sentiment analysis)
POSITIVE
LSTM LSTM LSTM
very beautiful name
call
LSTM
me baby
LSTM LSTM
call me
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMO: Deep Contextual Word Embeddings (Peters et al., 2018)
call
LSTM
me baby
LSTM LSTM
call me
call me
LSTM LSTM LSTM
call me baby
call me baby
Pretrained Embeddings
Train Separate Left-to-Right and Right-to-Left Language Models
Apply as “Pretrained Embeddings”
Classification Model
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
ELMo provided a significant step towards pre-training in the context of NLP. Let’s dig in what the ELMo’s big secret is!
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
ELMo gained its language understanding from being trained to predict the next word in a sequence of words, Language Modeling Tasks. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.
1
The Rise of the Pre-trained Model
Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
We can see the hidden state of each unrolled-LSTM step peaking out from behind ELMo’s head. Those come in handy in the embedding process after this pre-training is done.
ELMo goes a step further and trains a bi-directional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
ELMo comes with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).
1
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
Early-Stage Pretraining in NLP
Improving Language Understanding by Generative Pre-Training (2018)
Train Deep (12 layer) Transformer Language Model
call me baby
Transformer Transformer Transformer
call me
Fine-tune on Classification Task
(e.g. sentiment analysis)
Transformer
very
Transformer
beautiful
POSITIVE
Transformer
name
1
The Rise of the Pre-trained Model Transformer (Recap)
1. Encoder
A stack of N=6 identical layers.
Each layer with two sub-layers:
1.Multi-head self-attention mechanism 2.Position-wise fully connected feed-forward network
2. Decoder
A stack of N=6 identical layers.
Each layer with three sub-layers:
1. Multi-head self-attention mechanism
2. Position-wise fully connected feed-forward network
3. Masked Multi-head self-attention
Encoder
Decoder
The transformer – model architecture
1
The Rise of the Pre-trained Model Transformer (Recap)
Multi-head Attention
• Models context Feed-forward layers
• Computes non-linear hierarchical features Layer norm and residuals
• Makes training deep networks healthy
Positional Embeddings
• Allows model to learn relative positioning
Encoder
The transformer – Encoder
Multi-Head Attention
(With self attention)
1
The Rise of the Pre-trained Model Transformer (Recap)
Multi-head Attention
• Models context Feed-forward layers
• Computes non-linear hierarchical features Layer norm and residuals
• Makes training deep networks healthy Positional Embeddings
• Allows model to learn relative positioning
Blocks are repeated 6 or more times (in vertical stack)
Encoder
The transformer – Encoder
1
The Rise of the Pre-trained Model Transformer VS LSTM
1. Self-Attention == no locality bias
• Long distance context has “equal opportunity”
2. Single multiplication per layer == efficiency on TPU • Effective batch size is number of words, not
sequences.
Encoder
The transformer – Encoder
2
BERT
2
BERT
Problem with Previous Approaches
Problem: Language models only use left context or right context, but language understanding is bidirectional
Unidirectional context
Build representation incrementally
Bidirectional context
Word can “see themselves”
call
LSTM LSTM
me baby
LSTM LSTM LSTM LSTM call me
call
LSTM LSTM
me baby
LSTM LSTM LSTM LSTM call me
2
BERT
How the BERT is working (Brief)
2
BERT
Pre-training and Transfer Learning in NLP
BERT: Input Representation
2
BERT
Word-piece Token Embedding
Word Structure and subword models
Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components.
In the worst case, words are split into as many subwords as they have characters
Common words
Variations misspellings novel items
Word
Vocab mapping
computer
Computer
play
play
coooooool
coo##ooo#ool
laern
la##ern##
Transformerify
Transformer##ify
2
BERT
Pretraining for BERT
Remember GPT? Pretraining the Language Model
2
BERT
Pretraining: Masked Language Model (LM)
Mask out k% of the input words, and then predict the masked words (use k=15%)
store gallon
The man went to the [MASK] to buy a [MASK] of milk
• Too little masking: Too expensive to train
• Too much masking: Not enough context
2
BERT
Pre-training and Transfer Learning in NLP
BERT: Masked language Model
With those 15%
2
BERT
Pre-training and Transfer Learning in NLP
BERT: Next Sentence Prediction
2
BERT
Pre-training and Transfer Learning in NLP
BERT: Input Representation
2
BERT
Model Details
• Data: Wikipedia (2.5B words) + BookCorpus (800M words)
• Batch Size: 131,072 words (1024 sequences * 128 length or
256 sequences *512 length
• Training Time: 1M steps (~40 epochs)
• Optimizer: AdamW, 1e-4 learning rate, linear decay
• BERT-Base: 12-layer, 768-hidden, 12-head
• BERT-Large: 24-layer, 1024-hidden, 16-head
• Trained on 4×4 or 8×8 TPU slice for 4 days
2
BERT
Pre-training and Transfer Learning in NLP
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2
BERT
Pre-training and Transfer Learning in NLP
The two steps of how BERT is developed. Download the model pre- trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2
Pretraining Fine-Tuning
2
BERT
Fine-Tuning Procedure
2
BERT
Fine-Tuning Procedure
The following shows a number of ways to use BERT for different tasks.
2
BERT
Accuracy… Performance
https://rajpurkar.github.io/SQuAD-explorer/
2
BERT
Effect of Pre-training Task
• Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks.
• Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by Bi-LSTM
2
BERT
Effect of Model Size
• Big models help a lot
• Going from 110M→340M params helps even on datasets with
3,600 labelled examples
• Improvements have not asymptoted
2
BERT
Resource! Resource!
TPUS… and Resources..
BERT-Base: 4 Cloud TPUs (16 TPU chips total) in 4 days BERT-Large: 16 Cloud TPUs (64 TPU chips total)
2
BERT Questions
• Why did no one think of this before?
• Better Question: Why wasn’t contextual pre-training popular
before 2018 with ELMo?
• Good Results on pre-training is > 1,000 x to 100,000 more expensive than supervised training.
2
BERT Questions
• The model must be learning more than “contextual embeddings”
• Alternate interpretation: Predicting missing words (or next words) requires learning many types of language understanding features
– Syntax, semantics, pragmatics, coreference, etc.
• Implication: Pre-trained model is much bigger than it needs to
be to solve specific task
• Task-specific model distillation words very well
2
BERT
More advanced pre-trained model
3
Post BERT
3
Post BERT Pretrained Model Map
3
Post BERT RoBERTa
A Robustly Optimized BERT Pretraining Approach (Liu et al, University of Washington and Facebook, 2019)
• Trained BERT for more epochs and/or on more data
– Showed that more epochs alone helps, even on same data
– More data also helps
• Improved masking and pre-training data slightly
3
Post BERT XLNet
Generalized Autoregressive Pretraining for Language Understanding (Yang et al, CMU and Google, 2019)
Innovation 1: Relative Position Embedding
• Sentence: Caren ate a hot dog
• Absolute Attention: “How much should dog attend to hot (in any position), and how much should dog in position 4 attend to the word in position 3?”
• Relative Attention: “How much should dog attend to hot (in any position) and how much should dog attend to the previous word?”
3
Post BERT XLNet
Generalized Autoregressive Pretraining for Language Understanding (Yang et al, CMU and Google, 2019)
Innovation 2: Permutation Language Modeling
• In a left-to-right language model, every word is predicted based on all of the words to its left
• Instead: Randomly permute the order for every training sentence
• Equivalent to masking, but many more predictions per sentence
• Can be done efficiently with Transformers
3
Post BERT XLNet
Generalized Autoregressive Pretraining for Language Understanding (Yang et al, CMU and Google, 2019)
• Also used more data and bigger models, but showed that innovations improved on BERT even with same data and model size
• XLNet results:
3
Post BERT ALBERT
A Lite BERT for Self-supervised Learning of Language Representations (Lan et al, Google and TTI Chicago, 2019)
Innovation 1: Factorized Embedding Parameterisation
• Use small embedding size (e.g., 128) and then project it to Transformer hidden size (e.g., 1024) with parameter matrix
VS.
3
Post BERT ALBERT
Innovation #2: Cross-layer parameter sharing
• Share all parameters between Transformer layers Results:
ALBERT is light in terms of parameters, not speed
3
Post BERT T5
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al, Google, 2019)
Ablated many aspects of pre-training:
• Model Size
• Amount of Training data
• Domain/cleanness of training data
• Pre-training objective details (e.g. span length of masked text)
• Ensembling
• Finetuning recipe (e.g. only allowing certain layers to finetune)
• Multi-task training
3
Post BERT T5
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al, Google, 2019)
Conclusions:
• Scaling up model size and amount of training data helps a lot
• Best model is 11B parameters (BERT-Large is 330M), trained on
120B words of cleaned common crawl text
• Exact masking/corruptions strategy does not matter that much
• Mostly negative results for better finetuning and multi-task strategies
3
Post BERT ELECTRA
Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al, 2020)
Train model to discriminate locally plausible text from real text:
3
Post BERT ELECTRA
Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al, 2020)
Difficult to match SOTA results with less compute
3
Post BERT LongFormer
The Long-Document Transformer (Peters et al., 2020)
Why?
• •
Traditional Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length
To address this, Longformer uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
3
Post BERT
LongFormer
The Long-Document Transformer (Peters et al., 2020)
3
Post BERT
Applying Models to Production Services
• BERT and other pre-trained language models are extremely large and expensive
• How are companies applying them to low-latency production services?
3
Post BERT
Applying Models to Production Services
• BERT and other pre-trained language models are extremely large and expensive
• How are companies applying them to low-latency production services?
The Answer is ‘Distillation’ (Model Compression)
3
Post BERT
Distillation (Model Compression)
The idea has been around for a long time (from 2006)
• Model Compression (Bucila et al. 2006)
• Distilling the Knowledge in a Neural Network (Hinton et al., 2015)
4
Multimodal Pretrained Model
4
Multimodal Pretrained Model Video Representation Learning
Supervised Learning: Large labelled data with CNN
• Expensive to collect labelled data
• Small corresponding label vocabs: not able to represent the
nuances of actions (e.g. difference “sipping” – “drinking” –
“gulping”)
• Represent short video clips (a few seconds long)
Unsupervised Learning: Learning density models from video
• Single static stochastic variable, decoded into a sequence
learning using RNN (VAE-style loss or GAN-style loss)
• Temporal stochastic variable (SV2P/SVCLP) or GAN-based
(SAVP/MoCoGAN)
• What if not using explicit stochastic latent variable
4
Multimodal Pretrained Model
VideoBERT (ICCV 2019)
A Joint Model for Video and Language Representation Learning
4
Multimodal Pretrained Model
ViLBERT (NIPS 2019)
A Joint Model for Video and Language Representation Learning
4
Multimodal Pretrained Model ViLBERT (NIPS 2019)
Results across all transfer tasks
• improves performance over a single-stream model
• result in improved visiolinguistic representations
• Finetuning from ViLBERT is a powerful strategy for vision and
language tasks
5
Before we finish this lecture…
0
Before we finish the lecture… The current ML/DL-based NLP trends
1) Assume that we collect data from Royal Prince Alfred Hospital
ME
0
Before we finish the lecture… The current ML/DL-based NLP trends
1) Assume that we collect data from Royal Prince Alfred Hospital
ME
Training with RNN, Transformer, or even pretrained model…
2) Train and Test on data from the same hospital
0
Before we finish the lecture… The current ML/DL-based NLP trends
1) Assume that we collect data from Royal Prince Alfred Hospital
ME
Training with RNN, Transformer, or even pretrained model…
2) Train and Test on data from the same hospital
0
Before we finish the lecture… The current ML/DL-based NLP trends
“Indeed, we can publish papers showing the algorithms are comparable to human medical experts in spotting certain conditions”
0
Before we finish the lecture… The current ML/DL-based NLP trends
What if we use this same model to another hospital?
0
Before we finish the lecture… The current ML/DL-based NLP trends
Assume you take that same DL-based NLP model, to St Vincent’s Private Hospital,
with an older testing machine, and the technician there uses a slightly different testing protocol
ME
0
Before we finish the lecture… The current ML/DL-based NLP trends
Assume you take that same DL-based NLP model, to St Vincent’s Private Hospital,
with an older testing machine, and the technician there uses a slightly different testing protocol
ME
Data drifts to cause the performance of DL-based NLP model to degrade significantly
0
Before we finish the lecture…
The current ML/DL-based NLP trends
In contrast, the Doctor just can walk down the street and diagnose the patient
0
Before we finish the lecture… The current ML/DL-based NLP trends
When a system is not performing well,
many teams instinctually try to improve the code
(try different model/component or change hyperparameters)
However, for many practical applications,
it is more effective instead to focus on improving the data
0
Before we finish the lecture… The current ML/DL-based NLP trends
When a system is not performing well,
many teams instinctually try to improve the code
(try different model/component or change hyperparameters)
However, for many practical applications,
it is more effective instead to focus on improving the data
Everyone jokes about ML/DL is 80% data preparation, but no one seems to care
Prof. Andrew Ng (March, 2021)
0
Before we finish the lecture… The current ML/DL-based NLP trends
There is unprecedented competition around beating the benchmarks. If Google has BERT then OpenAI has GPT-3.
However, these fancy models take up only 20% of business problems. What differentiates a good deployment is the quality of data.
“Data Dispersion”
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
• Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
• Miller, A., Fisch, A., Dodge, J., Karimi, A. H., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
• Drawings
• http://jalammar.github.io/illustrated-bert/
• http://jalammar.github.io/illustrated-transformer/