The University of Sydney Page 1
Transformer Neural
Networks
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
AI’s Big Breakthroughs in 2020
GPT-3, the 3rd version of Generative Pre-Trained Transformer model
released by OpenAI.
Develop Web apps. Enter a sentence
describing Google home page layout, and
here you see GPT-3 generating the code for
it.
Building ML Models. Keras code written by
GPT-3 while input was simple plain text of
what ML model do we want to write code for,
and boom, it generated the model.
The University of Sydney Page 3
𝑎!𝑎”𝑎#𝑎$
𝑏!𝑏”𝑏#𝑏$
𝑎!𝑎”𝑎#𝑎$
𝑏!𝑏”𝑏#𝑏$
Self-Attention Layer
You can try to replace any thing that has been done by
RNN with self-attention.
Self-Attention
Attention is all
you need.
The University of Sydney Page 4
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝑎!𝑎”𝑎#𝑎$
𝑞: query (to match others)
𝑘: key (to be matched)
𝑣: value (information to be extracted)
𝑎% = 𝑊𝑥%
𝑞% = 𝑊&𝑎%
𝑘% = 𝑊’𝑎%
𝑣% = 𝑊(𝑎%
Self-Attention
The University of Sydney Page 5
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝛼$,$ 𝛼$,#
𝛼!,# = 𝑞! $ 𝑘#/ 𝑑Scaled Dot-Product Attention:
𝛼$,” 𝛼$,!
dot product
d is the dim of 𝑞 and 𝑘
𝑎!𝑎”𝑎#𝑎$
Self-Attention
The University of Sydney Page 6
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝛼$,$ 𝛼$,# 𝛼$,” 𝛼$,!
Soft-max
*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!
(𝛼!,# = 𝑒𝑥𝑝 𝛼!,# /,
$
𝑒𝑥𝑝 𝛼!,$
𝑎!𝑎”𝑎#𝑎$
Self-Attention
The University of Sydney Page 7
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!
𝑎!𝑎”𝑎#𝑎$
𝑏$
Considering the whole sequence
𝑏! =,
#
(𝛼!,#𝑣#
Self-Attention
The University of Sydney Page 8
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!
𝑎!𝑎”𝑎#𝑎$
𝑏#
𝑏% =,
#
(𝛼%,#𝑣#
Self-Attention
The University of Sydney Page 9
𝑥$ 𝑥# 𝑥” 𝑥!
𝑎!𝑎”𝑎#𝑎$
𝑏$ 𝑏# 𝑏” 𝑏!
Self-Attention Layer
𝑏$, 𝑏#, 𝑏”, 𝑏! can be parallelly computed.
Self-Attention
The University of Sydney Page 10
𝑥$ 𝑥# 𝑥” 𝑥!
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝑎!𝑎”𝑎#𝑎$
𝑞% = 𝑊&𝑎%
𝑘% = 𝑊’𝑎%
𝑣% = 𝑊(𝑎%
𝑞$𝑞#𝑞”𝑞! = 𝑊& 𝑎$𝑎#𝑎”𝑎!
= 𝑊’
= 𝑊(
𝑎$𝑎#𝑎”𝑎!
𝑎$𝑎#𝑎”𝑎!𝑣$ 𝑣”𝑣!𝑣#
𝑘$ 𝑘”𝑘!𝑘#
I
I
I
𝑄
𝐾
𝑉
Self-Attention
The University of Sydney Page 11
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!
𝑏$
𝛼$,$ = 𝑞$𝑘$
(ignore 𝑑 for simplicity)
𝛼$,# = 𝑞$𝑘#
𝛼$,” = 𝑞$𝑘
” 𝛼$,! = 𝑞$𝑘! 𝑞
$
𝑘$
𝑘#
𝑘”
𝑘!
=
𝛼$,$
𝛼$,#
𝛼$,”
𝛼$,!
Self-Attention
The University of Sydney Page 12
*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!
𝑏#
𝑏% =,
#
(𝛼%,#𝑣#
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝑞$
𝑘$
𝑘#
𝑘”
𝑘!
=
𝛼$,$
𝛼$,#
𝛼$,”
𝛼$,!
𝑞#
𝛼#,$
𝛼#,#
𝛼#,”
𝛼#,!
𝛼”,$
𝛼”,#
𝛼”,”
𝛼”,!
𝛼!,$
𝛼!,#
𝛼!,”
𝛼!,!
𝐾&𝐴
𝑄
𝑞” 𝑞!
*𝛼$,$
*𝛼$,#
*𝛼$,”
*𝛼$,!
*𝛼#,$
*𝛼#,#
*𝛼#,”
*𝛼#,!
*𝛼”,$
*𝛼”,#
*𝛼”,”
*𝛼”,!
*𝛼!,$
*𝛼!,#
*𝛼!,”
*𝛼!,!
4𝐴
Self-Attention
softmax
The University of Sydney Page 13
*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!
𝑏#
𝑏% =,
#
(𝛼%,#𝑣#
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
*𝛼$,$
*𝛼$,#
*𝛼$,”
*𝛼$,!
*𝛼#,$
*𝛼#,#
*𝛼#,”
*𝛼#,!
*𝛼”,$
*𝛼”,#
*𝛼”,”
*𝛼”,!
*𝛼!,$
*𝛼!,#
*𝛼!,”
*𝛼!,!
4𝐴
𝑣$ 𝑣”𝑣!𝑣#
𝑉
=𝑏$𝑏#𝑏”𝑏!
O
Self-Attention
The University of Sydney Page 14
= 𝑊&
= 𝑊’
= 𝑊(
Q
K
V
𝐾& QA:A
:AV=
I
O
=
I
II
O
Self-Attention
The University of Sydney Page 15
𝑘% 𝑣%
𝑎%
𝑞%
(2 heads as example)
𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$
𝑘* 𝑣*
𝑎*
𝑞*
𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$
𝑏%,$
𝑞% = 𝑊&𝑎%
𝑞%,$ = 𝑊&,$𝑞%
𝑞%,# = 𝑊&,#𝑞%
Multi-head Self-Attention
The University of Sydney Page 16
𝑘% 𝑣%
𝑎%
𝑞%
(2 heads as example)
𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$
𝑘* 𝑣*
𝑎*
𝑞*
𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$
𝑏%,$
𝑏%,#
𝑞% = 𝑊&𝑥%
𝑞%,$ = 𝑊&,$𝑞%
𝑞%,# = 𝑊&,#𝑞%
Multi-head Self-Attention
The University of Sydney Page 17
𝑘% 𝑣%
𝑎%
𝑞%
(2 heads as example)
𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$
𝑘* 𝑣*
𝑎*
𝑞*
𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$
𝑏%,$
𝑏%,#
𝑏%
𝑏% = 𝑊+
𝑏%,$
𝑏%,#
Multi-head Self-Attention
The University of Sydney Page 18
Positional Encoding
– No position information in self-attention.
– Original paper: each position has a unique
positional vector 𝑒% (not learned from data)
– In other words: each 𝑥% appends a one-hot
vector 𝑝%
𝑥%
𝑣%𝑘%𝑞%
𝑎%𝑒% +
𝑝% =
1
0
0…
…
…
…
i-th dim
𝑥%
𝑝%
𝑊
𝑊, 𝑊-
𝑊,
𝑊-+
= 𝑥%
𝑝%
𝑎%
𝑒%
The University of Sydney Page 19
source of image: http://jalammar.github.io/illustrated-transformer/
𝑥%
𝑝%
𝑊
𝑊, 𝑊-
𝑊,
𝑊-+
= 𝑥%
𝑝%
𝑎%
𝑒%
-1 1
𝑒! 𝑖, 2𝑗 = sin(𝑖/10000″#/%)
𝑒! 𝑖, 2𝑗 + 1 = cos (𝑖/10000″#/%) 𝑗 is the dimension.
The University of Sydney Page 20
Seq2seq with Attention
𝑥!𝑥”𝑥#𝑥$
ℎ!ℎ”ℎ#ℎ$
Encoder
𝑐$ 𝑐#
Decoder
𝑐”𝑐#𝑐$
𝑜$ 𝑜# 𝑜”
Self-Attention Layer Self-Attention Layer
The University of Sydney Page 21
Encoder Decoder
Using Chinese to
English translation as
example
神经网络
Neural Network
Transformer
The University of Sydney Page 22
…
𝑏>
+
𝑎
𝑏
Masked: attend on
the generated
sequence
attend on the
input sequence
Batch Size
𝜇 = 0, 𝜎 = 1
𝜇 = 0,
𝜎 = 1
Layer
Norm
Layer
Batch
Transformer
The University of Sydney Page 23
Attention Visualization
https://arxiv.org/abs/1706.03762
The University of Sydney Page 24
Attention Visualization
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The University of Sydney Page 25
Multi-head Attention
The University of Sydney Page 26
BERT
A transformer uses Encoder stack to model input, and uses Decoder stack to model output
(using input information from encoder side).
If we are only interested in training a language model for the input for some other tasks, then
we do not need the Decoder of the transformer, that gives us BERT.
The University of Sydney Page 27
dog
cat
rabbit
jump
run
flower
tree
apple = [ 1 0 0 0 0]
bag = [ 0 1 0 0 0]
cat = [ 0 0 1 0 0]
dog = [ 0 0 0 1 0]
elephant = [ 0 0 0 0 1]
1-of-N Encoding Word Embedding
flower
tree apple
dog
cat bird
class 1 Class 2 Class 3
ran
jumped
walk
Word
Class
The University of Sydney Page 28
A word can have multiple senses.
Have you paid that money to the bank yet ?
It is safest to deposit your money in the bank .
The victim was found lying dead on the river bank .
They stood on the river bank to fish.
The hospital has its own blood bank.
The third sense or not?
The University of Sydney Page 29
money
theContextualized
Word Embedding
Contextualized
Word Embedding
in the bank
river bank
own
Contextualized
Word Embedding
blood bank
… …
… …
… …
Contextualized Word Embedding
The University of Sydney Page 30
Embeddings from Language Model (ELMO)
– RNN-based language models (trained from lots of sentences)
e.g. given “five little monkeys jumping on the bed”
RNN
RNN
RNN
RNN
RNN
RNN
five little monkeys
…
…
RNN
RNN
little monkeys jumping
RNN
RNN
RNN
RNN
five little monkeys
…
…
…
…
The University of Sydney Page 31
ELMO
RNN
RNN
five little
RNN
RNN
RNN
RNN
…
…
RNN
RNN
little monkeys jumping
RNN
RNN
RNN
RNN
…
…
…
…
RNN RNN RNN … RNN RNN RNN ……
Each layer in deep LSTM can generate a
latent representation.
Which one should we use???
ℎ$
ℎ#
… … … … … …
The University of Sydney Page 32
ELMO
five little monkeys jumping ……
ELMO
largesmall
Learned with the
down stream tasks
= 𝛼! + 𝛼%
The University of Sydney Page 33
Bidirectional Encoder Representations from
Transformers (BERT)
– BERT = Encoder of Transformer
Encode
r
BERT
five little monkeys jumping ……
Learned from a large amount of text
without annotation
……
The University of Sydney Page 34
Training of BERT
– Approach 1:
Masked LM
BERT
five 退了 monkeys jumping ……
……
[MASK]
Linear Multi-class
Classifier
Predicting the
masked word
vocabulary size
The University of Sydney Page 35
BERT
[CLS] jumping bed [SEP]
Approach 2: Next Sentence Prediction
one fell off
Linear Binary
Classifier
yes
[CLS]: the position that outputs
classification results
[SEP]: the boundary of two sentences
Approaches 1 and 2 are used at the same time.
Training of BERT
……
……
The University of Sydney Page 36
BERT
Linear Binary
Classifier
No
[CLS] jumping bed [SEP] Deep neural network……
[CLS]: the position that outputs
classification results
[SEP]: the boundary of two sentences
Approaches 1 and 2 are used at the same time.
Training of BERT Approach 2: Next Sentence Prediction
The University of Sydney Page 37
How to use BERT – Case 1
BERT
[CLS] w1 w2 w3
Linear
Classifier
class
Input: single sentence,
output: class
sentence
Example:
Sentiment analysis
Document Classification
Trained from
Scratch
Fine-tune
The University of Sydney Page 38
How to use BERT – Case 2
BERT
[CLS] w1 w2 w3
Linear
Cls
class
Input: single sentence,
output: class of each word
sentence
Example: Slot filling
Linear
Cls
class
Linear
Cls
class
The University of Sydney Page 39
Linear
Classifier
w1 w2
How to use BERT – Case 3
BERT
[CLS] [SEP]
Class
Sentence 1 Sentence 2
w3 w4 w5
Input: two sentences, output: class
Example: Natural Language Inference
Given a “premise”, determining whether a
“hypothesis” is T/F/ unknown.
The University of Sydney Page 40
How to use BERT – Case 4
– Extraction-based Question
Answering (QA) (E.g. SQuAD)
𝐷 = 𝑑$, 𝑑#, ⋯ , 𝑑?
𝑄 = 𝑞$, 𝑞#, ⋯ , 𝑞?
QA
Model
output: two integers (𝑠, 𝑒)
𝐴 = 𝑞@, ⋯ , 𝑞A
Document:
Query:
Answer:
𝐷
𝑄
𝑠
𝑒
17
77 79
𝑠 = 17, 𝑒 = 17
𝑠 = 77, 𝑒 = 79
The University of Sydney Page 41
q1 q2
How to use BERT – Case 4
BERT
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.50.3 0.2
The answer is “d2d3”.
s = 2, e = 3
Learned from
scratch
The University of Sydney Page 42
q1 q2
How to use BERT – Case 4
BERT
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.20.1 0.7
The answer is “d2d3”.
s = 2, e = 3
Learned from
scratch
The University of Sydney Page 43
What does BERT learn?
https://arxiv.org/abs/1905.05950
https://openreview.net/pdf?id=SJzSgnRcKX
lower layers of a language model encode more
local syntax while higher layers capture more
complex semantics,
The University of Sydney Page 44
Multilingual BERT
https://arxiv.org/abs/1904.09077
Trained on 104 languages
Task specific training
data for English
En Class
1
En Class
2
En Class
3
Task specific testing
data for Chinese
Zh ? Zh ?
Zh ?
Zh ? Zh ?
The University of Sydney Page 45
GPT
A transformer uses Encoder stack to model input, and uses Decoder stack to model output
(using input information from encoder side).
If we are only interested in training a language model for the input for some other tasks, then
we do not need the Decoder of the transformer, that gives us BERT.
But if we do not have input, we just want to model the “next word”, we can get rid of the
Encoder side of a transformer and output “next word” one by one. This gives us GPT.
The University of Sydney Page 46
Generative Pre-Training (GPT)
Source of image: https://huaban.com/pins/1714071707/
ELMO
(94M)
BERT
(340M)
GPT
Transformer
Decoder
The University of Sydney Page 47
The now-famous unicorn example
The University of Sydney Page 48
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝑎!𝑎”𝑎#𝑎$
*𝛼#,$ *𝛼#,#
𝑏#
Many Layers …
little
Generative Pre-Training (GPT)
The University of Sydney Page 49
𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!
𝑎!𝑎”𝑎#𝑎$
*𝛼”,# *𝛼”,”
𝑏”
Many Layers …
monkeys
little
*𝛼”,$
monkeys
Generative Pre-Training (GPT)
The University of Sydney Page 50
Improving Language Understanding by
Generative Pre-training (GPT-1):
Supervised models have two major limitations:
They need large amount of annotated data for learning a particular task
which is often not easily available.
They fail to generalize for tasks other than what they have been trained
for.
This paper proposed learning a generative language model using
unlabeled data and then fine-tuning the model by providing examples
of specific downstream tasks like classification, sentiment analysis,
textual entailment etc.
The University of Sydney Page 51
Unsupervised Language Modelling (Pre-training): For unsupervised learning,
standard language model objective was used.
where T was the set of tokens in unsupervised data {t_1,…,t_n}, k was size of context
window, θ were the parameters of neural network trained using stochastic gradient
descent.
Supervised Fine-Tuning: This part aimed at maximising the likelihood of observing
label y, given features or tokens x_1,…,x_n.
where C was the labeled dataset made up of training examples.
Improving Language Understanding by
Generative Pre-training (GPT-1):
The University of Sydney Page 52
Improving Language Understanding by
Generative Pre-training (GPT-1):
Task Specific Input Transformations: In order to make minimal changes to the architecture
of the model during fine tuning, inputs to the specific downstream tasks were transformed into
ordered sequences.
The University of Sydney Page 53
Language Models are unsupervised multitask learners
(GPT-2)
The developments in GPT-2 model were mostly in terms of using a larger dataset
and adding more parameters to the model to learn even stronger language model.
1542M762M345M117M#para
6GB -> 40GB data
The University of Sydney Page 54
Same architecture as GPT and GPT-2, vanilla Transformer with small
upgrades accumulated over time.
Language models are few shot learners (GPT-3):
GPT3
GPT2GPT1 BERT
#parameters
45TB data
285,000 CPUs
10,000 GPUs
$12,000,000
The University of Sydney Page 55
GPT-3
The University of Sydney Page 56
GPT-3
The University of Sydney Page 57
GPT-3
The University of Sydney Page 58
Thank you !