COMP5046
Natural Language Processing
Lecture 11: Advanced NLP: Machine Translation and Transformer
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 11: Machine Translation and Transformer
1. Machine Translation
2. Statistical Machine Translation
3. Neural Machine Translation
4. Attention and Transformer for MT
5. The Rise of the Pre-trained Model
0
Assignment 2 Specification
1
What is Machine Translation?
1
Machine Translation Machine Translation
“translate a sentence x from one language (the source language) to a sentence y in another language (the target language).”
Source language
Target language
Sentence x
生命短暂
Sentence y
life is short
Machine
1
Machine Translation Machine Translation
2
Statistical Machine Translation
2
Statistical Machine Translation Statistical Machine Translation
“Learning a probabilistic model from data”
Source language (x) Target language (y)
Best translation?
How to learn translation model ?
Sentence x
生命短暂
Sentence y
life is short
2
Statistical Machine Translation Statistical Machine Translation
“Learning a probabilistic model from data”
Source language (x) Target language (y)
Sentence x
生命短暂
Bayes Rule
Best translation?
Sentence y
life is short
Translation Model (fidelity)
Models how words and phrases should be translated
Language Model (fluency)
Models to write good English
2
Statistical Machine Translation Statistical Machine Translation
“Learning a probabilistic model from data”
Source language (x) Target language (y)
Sentence x
生命短暂
Bayes Rule
Best translation?
Sentence y
life is short
Translation Model (fidelity) Learnt from parallel data
Language Model (fluency)
Learnt from monolingual data
2
Statistical Machine Translation
How to learn translation model with parallel corpus?
Parallel Corpus
English Corpus
Source language (x)
Target language (y)
生命短暂
Translation Model
Broken English
Language Model
Life is short
Short life Life is brief Short is life …
Bayes Rule
Translation Model (fidelity) Learnt from parallel data
Language Model (fluency)
Learnt from monolingual data
2
Statistical Machine Translation Parallel corpus and Alignment
How to learn translation model from the parallel corpus?
i.e. pairs of human-translated
Chinese/English sentences
http://opus.nlpl.eu/
2
Statistical Machine Translation
Parallel corpus and Alignment
How to align these sentence (Open subtitles)
2
Statistical Machine Translation How to learn translation model?
How to learn translation model from the parallel corpus?
i.e. pairs of human-translated
Chinese/English sentences
a is the alignment
Alignment is the correspondence between particular words in the translated sentence pair. (i.e. word-level correspondence
between source sentence x and target sentence y)
2
Statistical Machine Translation What is Alignment a?
“The correspondence between particular words in the translated sentence pair”
Spurious word
Give 你 Give me 的 me
把你的手机给我
把
your 手 your phone 机
给 phone 我
2
Statistical Machine Translation
What is Alignment a? Many-to-One Alignment
2
Statistical Machine Translation
What is Alignment a? One-to-Many Alignment
This 这
result 结
is 果
satisfying 是 令
这结果是令人满意的
This
result
人 is
满 意 的
satisfying
2
Statistical Machine Translation What is Alignment a?
I know
我 了
我了解你而你也了解我
I
Many-to-many Alignment
about 解 know
you 你 and 而
vice versa
你 也 了 解 我
about you
and vice
versa
2
Statistical Machine Translation What is Alignment a?
Some word has no single-word equivalent in English
用 All
尽 方 法
possible
ways
have
been
used 方
法
用 尽
All possible
ways have
been used
2
Statistical Machine Translation Decoding for SMT
Translation Model (fidelity) Learnt from parallel data
• We could enumerate every possible y and calculate the probability? Too expensive!
• Answer: Use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability
Language Model (fluency)
Learnt from monolingual data
2
Statistical Machine Translation Statistical Machine Translation
SMT was a huge research field and Extremely complex System
Hundreds of important details (haven’t mentioned here)
• Systems had many separately-designed subcomponents
• Lots of feature engineering
– Need to design features to capture particular language phenomena
• Require compiling and maintaining extra resources – Like tables of equivalent phrases
• Lots of human effort to maintain
– Repeated effort for each language pair!
The Best System
3
Neural Machine Translation
3
Neural Machine Translation
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
“a way to do Machine Translation with a single neural network (NN)” • The NN architecture is called seq2seq and involves two RNNs.
Life is short
Decoder
state
Target sentence
[Decoder]
Encoder
生命短暂
Source sentence
[Encoder]
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
“a way to do Machine Translation with a single neural network (NN)” • The NN architecture is called seq2seq and involves two RNNs.
Life
is short
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Encoding of the source sentence
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生命短暂
Source sentence
Life is
Target sentence
[Decoder]
short
[Encoder]
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
“a way to do Machine Translation with a single neural network (NN)” • The NN architecture is called seq2seq and involves two RNNs.
Life
is short
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Encoding of the source sentence
Life is
Target sentence
[Decoder]
short
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生命短暂
Source sentence
Decoder RNN is a Language Model that generates target sentence, conditioned on encoding.
[Encoder]
3
Neural Machine Translation
Neural Machine Translation: Greedy Decoding [Recap]
Language Model Decoding: Recap
• Generate the sentence by taking argmax (the most probable word) on each step
• Use that as the next word, and feed it as input on the next step
• Keep going until you produce
Life
argmax
is short
argmax argmax
argmax
Greedy decoding has no way to undo decisions!! (Ungrammatical, unnatural)
Solution..? try computing all possible sequences
Life is
Target sentence
[Decoder]
short
3
Neural Machine Translation
Neural Machine Translation: Beam Search Decoding [Recap]
Language Model Decoding: Recap
• A search algorithm which aims to find a high-probability sequence (not necessarily the optimal sequence, though) by tracking multiple possible sequences at once.
• On each step of decoder, keep track of the k most probable partial sequences (which we call hypotheses)
• K is the beam size (in practice around 5 to 10)
• After you reach some stopping criterion, choose the sequence with the highest probability (factoring in some adjustment for length)
3
Neural Machine Translation
Neural Machine Translation: Beam Search Decoding
Language Model Decoding: Recap
Assume that k(beam size)=2
Life
is
was
life
were
Short Temporarily
is Short
1. Find top k next words and calculate scores 2. Of these 𝑘2hypotheses, keep only highest k
Short
3
Neural Machine Translation Evaluate Machine Translation
BLEU (Bilingual Evaluation Understudy)
“Compares the machine-written translation to one or several human-
written translation(s), and computes a similarity score based on”
• n-gram precision (usually for 1 to 4-grams)
• Plus a penalty for too-short system translations
BLEU is useful but imperfect
• Many valid ways to translate a sentence
• So a good translation can get a poor BLEU score because it has low n- gram overlap with the human translation
3
Neural Machine Translation However, there are still several difficulties…
• Out-of-vocabulary (OOV) words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
3
Neural Machine Translation Machine Translation is not PERFECT…
Using common sense is still hard and NMT picks up biases in training data
3
Neural Machine Translation Machine Translation is not PERFECT…
Uninterpretable systems do strange things
https://www.vice.com/en_us/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
RNN-based neural MT was sort of successful! But…
RNN
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生命短暂
Source sentence
[Encoder]
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
RNN-based neural MT was sort of successful! But…
RNN
Cannot remember all information about the source sentence
Vanishing….
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生命短暂
Source sentence
[Encoder]
Neural Machine Translation Neural Machine Translation with Seq2Seq
RNN-based neural MT was successful! But…
LSTM
Encoder recurrent layer
Encoder embedding layer
One-hot vector
3
Forget/Input/Output Gate….
生命短暂
Source sentence
[Encoder]
They can remember sequences of 100s, not 1000s or 10,000s or more.
3
Neural Machine Translation Neural Machine Translation with Seq2Seq
Then, how to solve the information bottleneck issue?
Attention!
Life
is short
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生命短暂
Source sentence
Life is
short
Forget/Input/Output Gate….
[Encoder]
3
Neural Machine Translation
Neural Machine Translation with RNN and Attention
Then, how to solve the information bottleneck issue?
Attention with RNN!
Attention output
softmax
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
生 命 短 暂
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
3
Neural Machine Translation
Neural Machine Translation with RNN and Attention
Then, how to solve the information bottleneck issue?
Attention with RNN!
Wait…!
Attention gives us access to any state… Do we really need the RNNs?
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
Attention output
softmax
生 命 短 暂
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention iasnAdlTl rYaonusNfoeremder for MT
Early 2018 ~
4
Attention and Transformer for MT Attention is All You Need (Vaswani et al., 2017)
Encoder-Decoder with only Attention
Core Task: Machine Translation with Parallel Corpus
• Use self-attention in the encoder, instead of RNN or CNNs
• Predict each translated word
• Final cost/error function
→ standard cross-entropy error on top of a softmax classifier
Attention is All You Need!
Output (Target Language)
Hello World
Decoder
‘The Transformer’!!
Encoder
こんにちは世界
The Transformer!
Input (Source Language)
4
Attention and Transformer for MT Attention is All You Need (Vaswani et al., 2017)
Encoder-Decoder with only Attention
Core Task: Machine Translation with Parallel Corpus
• Use self-attention in the encoder, instead of RNN or CNNs
• Predict each translated word
• Final cost/error function
→ standard cross-entropy error on top of a softmax classifier
Attention is All You Need?
‘The Transformer’!!
Encoder
こんにちは世界
Output (Target Language)
Hello World
Decoder
Input (Source Language)
4
Attention and Transformer for MT The Transformer
Encoder – Decoder Architecture
1. Encoder
A stack of N=6 identical layers.
Each layer with two sub-layers:
1. Multi-head self-attention mechanism
2. Position-wise fully connected feed-forward network
* Residual connection around each of the two sub-layers, followed by layer normalisation
Encoder
The transformer – model architecture
4
Attention and Transformer for MT The Transformer
Encoder – Decoder Architecture
2. Decoder
A stack of N=6 identical layers.
Each layer with three sub-layers:
1. Multi-head self-attention mechanism
2. Position-wise fully connected feed-forward network
3. Masked Multi-head self-attention
* Residual connection around each of the two sub-layers, followed by layer normalisation
Decoder
The transformer – model architecture
4
Attention and Transformer for MT The Transformer
Encoder – Decoder Architecture
Brief Summary
Decoder
Encoder
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer – Encoder
(Stage1)
We are not using RNN anymore… No time step concept!
Encoder
Decoder
To make use of the order of the sequence, inject information about the position of the tokens in the sequence.
Positional Encoding
(use sin and cos for position/dimension)
Input embedding
(a vector of size 512)
++
こんにちは
世界
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder (Stage 2)
Encoder
Decoder
こんにちは 世界
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder (Stage 2)
Encoder
Decoder
こんにちは 世界
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer – Encoder
(Stage 2)
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions
Encoder
Decoder
Multi-Head Attention
(With self attention)
Q=Query, K=Key, V=Value (64 dimension)
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer – Encoder
(Stage 2)
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions
Encoder
Decoder
Source and Target attention
Multi-Head Attention
(With self attention)
Q=Query, K=Key, V=Value (64 dimension)
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer – Encoder
(Stage 2)
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions
Decoder
Self-attention
Multi-Head Attention
(With self attention)
Q=Query, K=Key, V=Value (64 dimension)
Encoder
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer – Encoder
(Stage 2)
512 dim
64 dim
Encoder
こんにちは
世界
Decoder
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder (Stage 2)
Encoder
Decoder
こんにちは 世界
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder (Stage 2)
Encoder
Decoder
こんにちは 世界
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder (Stage 2)
Decoder
Encoder
こんにちは 世界
The transformer – model architecture
4
Attention and Transformer for MT The Transformer – Encoder to Decoder
こんにちは 世界 Hello World
4
Attention and Transformer for MT The Transformer – Decoder
Hello
Label [0, 0, …………………….., 1, 0 ] Output[0.1, 0.01,…………………….., 0.8, 0.1]
Decoder
The transformer – model architecture
4
Attention and Transformer for MT
The Transformer with example – Encoder to Decoder
4
Attention and Transformer for MT
The Transformer with example – Decoding Phrases
5
TAhttenRtisoenoisf tAhlel YPorue-Ntreaeinded Model
Early 2019 ~
5
The Rise of the Pre-trained Model Pre-training and Transfer Learning
In computer vision, prove the value of transfer learning
• pre-training a neural network on a known task (i.e. ImageNet)
• performing fine-tuning
• using the trained neural network as the basis of a new purpose-specific model.
5
The Rise of the Pre-trained Model
Pre-training and Transfer Learning in NLP
Popular Pre-trained Model in NLP
(Peters et al, 2018) (Devlin et al, 2018)
Using Contextual word representations
5
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
Popular Pre-trained Model: Contextual Representations
Word embeddings (i.e. word2vec, fastText, GloVe) are applied in a context free manner
Step up to the bat bat [0.7, 0.2, -0.5, 1.1, …] A vampire bat bat [0.7, 0.2, -0.5, 1.1, …]
Need to train contextual representation on text corpus
Step up to the bat bat [1.1, -0.7, 0.8, 2.1, …] A vampire bat bat [0.3, 0.5, -0.9, 1.3, …]
5
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
ELMo provided a significant step towards pre-training in the context of NLP. Let’s dig in what the ELMo’s big secret is!
5
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo: Deep Contextual Word Embeddings (2017)
ELMo gained its language understanding from being trained to predict the next word in a sequence of words, Language Modeling Tasks. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.
5
The Rise of the Pre-trained Model Pre-training and Transfer Learning in NLP
ELMo and BERT
(Peters et al, 2018) (Devlin et al, 2018)
5
The Rise of the Pre-trained Model The future of NLP…
/
COMP5046 Natural Language Processing
What we learned in this course!
Week 1: Introduction to Natural Language Processing (NLP)
Week 2: Word Embeddings (Word Vector for Meaning)
Week 3: Word Classification with Machine Learning I Week 4: Word Classification with Machine Learning II
NLP and Machine Learning
Week 5: Language Fundamental
Week 6: Part of Speech Tagging
Week 7: Dependency Parsing
Week 8: Language Model
NLP Techniques
Week 9: Information Extraction: Named Entity Recognition
Advanced Topic
Week 10: Advanced NLP: Attention and Reading Comprehension
Week 11: Advanced NLP: Transformer and Machine Translation
Week 12: Advanced NLP: Pretrained Model
Week 13: Future of NLP and Exam Review
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
• Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
• Miller, A., Fisch, A., Dodge, J., Karimi, A. H., Bordes, A., & Weston, J. (2016). Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
• Drawings
• http://jalammar.github.io/illustrated-bert/
• http://jalammar.github.io/illustrated-transformer/