CS计算机代考程序代写 Keras The University of Sydney Page 1

The University of Sydney Page 1

Transformer Neural
Networks

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

AI’s Big Breakthroughs in 2020

GPT-3, the 3rd version of Generative Pre-Trained Transformer model
released by OpenAI.

Develop Web apps. Enter a sentence
describing Google home page layout, and
here you see GPT-3 generating the code for
it.

Building ML Models. Keras code written by
GPT-3 while input was simple plain text of
what ML model do we want to write code for,
and boom, it generated the model.

The University of Sydney Page 3

𝑎!𝑎”𝑎#𝑎$

𝑏!𝑏”𝑏#𝑏$

𝑎!𝑎”𝑎#𝑎$

𝑏!𝑏”𝑏#𝑏$

Self-Attention Layer

You can try to replace any thing that has been done by
RNN with self-attention.

Self-Attention

Attention is all
you need.

The University of Sydney Page 4

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝑎!𝑎”𝑎#𝑎$

𝑞: query (to match others)

𝑘: key (to be matched)

𝑣: value (information to be extracted)

𝑎% = 𝑊𝑥%

𝑞% = 𝑊&𝑎%

𝑘% = 𝑊’𝑎%

𝑣% = 𝑊(𝑎%

Self-Attention

The University of Sydney Page 5

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝛼$,$ 𝛼$,#

𝛼!,# = 𝑞! $ 𝑘#/ 𝑑Scaled Dot-Product Attention:

𝛼$,” 𝛼$,!
dot product

d is the dim of 𝑞 and 𝑘

𝑎!𝑎”𝑎#𝑎$

Self-Attention

The University of Sydney Page 6

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝛼$,$ 𝛼$,# 𝛼$,” 𝛼$,!

Soft-max

*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!

(𝛼!,# = 𝑒𝑥𝑝 𝛼!,# /,
$
𝑒𝑥𝑝 𝛼!,$

𝑎!𝑎”𝑎#𝑎$

Self-Attention

The University of Sydney Page 7

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!

𝑎!𝑎”𝑎#𝑎$

𝑏$

Considering the whole sequence
𝑏! =,

#

(𝛼!,#𝑣#
Self-Attention

The University of Sydney Page 8

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!

𝑎!𝑎”𝑎#𝑎$

𝑏#

𝑏% =,
#

(𝛼%,#𝑣#
Self-Attention

The University of Sydney Page 9

𝑥$ 𝑥# 𝑥” 𝑥!

𝑎!𝑎”𝑎#𝑎$

𝑏$ 𝑏# 𝑏” 𝑏!

Self-Attention Layer

𝑏$, 𝑏#, 𝑏”, 𝑏! can be parallelly computed.
Self-Attention

The University of Sydney Page 10

𝑥$ 𝑥# 𝑥” 𝑥!

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝑎!𝑎”𝑎#𝑎$

𝑞% = 𝑊&𝑎%

𝑘% = 𝑊’𝑎%

𝑣% = 𝑊(𝑎%

𝑞$𝑞#𝑞”𝑞! = 𝑊& 𝑎$𝑎#𝑎”𝑎!

= 𝑊’

= 𝑊(

𝑎$𝑎#𝑎”𝑎!

𝑎$𝑎#𝑎”𝑎!𝑣$ 𝑣”𝑣!𝑣#

𝑘$ 𝑘”𝑘!𝑘#

I

I

I

𝑄

𝐾

𝑉

Self-Attention

The University of Sydney Page 11

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

*𝛼$,$ *𝛼$,# *𝛼$,” *𝛼$,!

𝑏$

𝛼$,$ = 𝑞$𝑘$

(ignore 𝑑 for simplicity)

𝛼$,# = 𝑞$𝑘#

𝛼$,” = 𝑞$𝑘
” 𝛼$,! = 𝑞$𝑘! 𝑞

$

𝑘$

𝑘#

𝑘”

𝑘!
=

𝛼$,$
𝛼$,#
𝛼$,”
𝛼$,!

Self-Attention

The University of Sydney Page 12

*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!

𝑏#

𝑏% =,
#

(𝛼%,#𝑣#

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝑞$
𝑘$

𝑘#

𝑘”
𝑘!

=

𝛼$,$
𝛼$,#
𝛼$,”
𝛼$,!

𝑞#

𝛼#,$
𝛼#,#
𝛼#,”
𝛼#,!

𝛼”,$
𝛼”,#
𝛼”,”
𝛼”,!

𝛼!,$
𝛼!,#
𝛼!,”
𝛼!,!

𝐾&𝐴
𝑄
𝑞” 𝑞!

*𝛼$,$
*𝛼$,#
*𝛼$,”
*𝛼$,!

*𝛼#,$
*𝛼#,#
*𝛼#,”
*𝛼#,!

*𝛼”,$
*𝛼”,#
*𝛼”,”
*𝛼”,!

*𝛼!,$
*𝛼!,#
*𝛼!,”
*𝛼!,!

4𝐴

Self-Attention

softmax

The University of Sydney Page 13

*𝛼#,$ *𝛼#,# *𝛼#,” *𝛼#,!

𝑏#

𝑏% =,
#

(𝛼%,#𝑣#

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

*𝛼$,$
*𝛼$,#
*𝛼$,”
*𝛼$,!

*𝛼#,$
*𝛼#,#
*𝛼#,”
*𝛼#,!

*𝛼”,$
*𝛼”,#
*𝛼”,”
*𝛼”,!

*𝛼!,$
*𝛼!,#
*𝛼!,”
*𝛼!,!

4𝐴

𝑣$ 𝑣”𝑣!𝑣#

𝑉
=𝑏$𝑏#𝑏”𝑏!

O

Self-Attention

The University of Sydney Page 14

= 𝑊&

= 𝑊’

= 𝑊(

Q

K

V

𝐾& QA:A

:AV=

I

O

=

I

II

O
Self-Attention

The University of Sydney Page 15

𝑘% 𝑣%

𝑎%

𝑞%

(2 heads as example)

𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$

𝑘* 𝑣*

𝑎*

𝑞*

𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$

𝑏%,$

𝑞% = 𝑊&𝑎%

𝑞%,$ = 𝑊&,$𝑞%

𝑞%,# = 𝑊&,#𝑞%

Multi-head Self-Attention

The University of Sydney Page 16

𝑘% 𝑣%

𝑎%

𝑞%

(2 heads as example)

𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$

𝑘* 𝑣*

𝑎*

𝑞*

𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$

𝑏%,$

𝑏%,#

𝑞% = 𝑊&𝑥%

𝑞%,$ = 𝑊&,$𝑞%

𝑞%,# = 𝑊&,#𝑞%

Multi-head Self-Attention

The University of Sydney Page 17

𝑘% 𝑣%

𝑎%

𝑞%

(2 heads as example)

𝑞%,#𝑞%,$ 𝑘%,#𝑘%,$ 𝑣%,#𝑣%,$

𝑘* 𝑣*

𝑎*

𝑞*

𝑞*,#𝑞*,$ 𝑘*,#𝑘*,$ 𝑣*,#𝑣*,$

𝑏%,$

𝑏%,#
𝑏%

𝑏% = 𝑊+
𝑏%,$

𝑏%,#

Multi-head Self-Attention

The University of Sydney Page 18

Positional Encoding

– No position information in self-attention.
– Original paper: each position has a unique

positional vector 𝑒% (not learned from data)
– In other words: each 𝑥% appends a one-hot

vector 𝑝%

𝑥%

𝑣%𝑘%𝑞%

𝑎%𝑒% +

𝑝% =
1
0

0…


i-th dim

𝑥%

𝑝%
𝑊

𝑊, 𝑊-
𝑊,

𝑊-+

= 𝑥%

𝑝%

𝑎%

𝑒%

The University of Sydney Page 19

source of image: http://jalammar.github.io/illustrated-transformer/

𝑥%

𝑝%
𝑊

𝑊, 𝑊-

𝑊,

𝑊-+

= 𝑥%

𝑝%

𝑎%

𝑒%

-1 1
𝑒! 𝑖, 2𝑗 = sin(𝑖/10000″#/%)

𝑒! 𝑖, 2𝑗 + 1 = cos (𝑖/10000″#/%) 𝑗 is the dimension.

The University of Sydney Page 20

Seq2seq with Attention

𝑥!𝑥”𝑥#𝑥$

ℎ!ℎ”ℎ#ℎ$

Encoder

𝑐$ 𝑐#

Decoder

𝑐”𝑐#𝑐$

𝑜$ 𝑜# 𝑜”

Self-Attention Layer Self-Attention Layer

The University of Sydney Page 21

Encoder Decoder

Using Chinese to
English translation as
example

神经网络

Neural Network
Transformer

The University of Sydney Page 22

𝑏>

+

𝑎

𝑏

Masked: attend on
the generated
sequence

attend on the
input sequence

Batch Size

𝜇 = 0, 𝜎 = 1

𝜇 = 0,
𝜎 = 1

Layer
Norm

Layer

Batch

Transformer

The University of Sydney Page 23

Attention Visualization

https://arxiv.org/abs/1706.03762

The University of Sydney Page 24

Attention Visualization

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

The University of Sydney Page 25

Multi-head Attention

The University of Sydney Page 26

BERT

A transformer uses Encoder stack to model input, and uses Decoder stack to model output

(using input information from encoder side).

If we are only interested in training a language model for the input for some other tasks, then

we do not need the Decoder of the transformer, that gives us BERT.

The University of Sydney Page 27

dog

cat

rabbit

jump
run

flower
tree

apple = [ 1 0 0 0 0]

bag = [ 0 1 0 0 0]

cat = [ 0 0 1 0 0]

dog = [ 0 0 0 1 0]

elephant = [ 0 0 0 0 1]

1-of-N Encoding Word Embedding

flower
tree apple

dog
cat bird

class 1 Class 2 Class 3
ran

jumped
walk

Word
Class

The University of Sydney Page 28

A word can have multiple senses.

Have you paid that money to the bank yet ?
It is safest to deposit your money in the bank .

The victim was found lying dead on the river bank .
They stood on the river bank to fish.

The hospital has its own blood bank.

The third sense or not?

The University of Sydney Page 29

money

theContextualized
Word Embedding

Contextualized
Word Embedding

in the bank

river bank

own

Contextualized
Word Embedding

blood bank

… …

… …

… …

Contextualized Word Embedding

The University of Sydney Page 30

Embeddings from Language Model (ELMO)

– RNN-based language models (trained from lots of sentences)
e.g. given “five little monkeys jumping on the bed”

RNN

RNN

five little

RNN

RNN

RNN

RNN

five little monkeys

RNN

RNN

little monkeys jumping

RNN

RNN

RNN

RNN

five little monkeys

The University of Sydney Page 31

ELMO

RNN

RNN

five little

RNN

RNN

RNN

RNN

RNN

RNN

little monkeys jumping

RNN

RNN

RNN

RNN

RNN RNN RNN … RNN RNN RNN ……

Each layer in deep LSTM can generate a
latent representation.
Which one should we use???

ℎ$

ℎ#

… … … … … …

The University of Sydney Page 32

ELMO

five little monkeys jumping ……

ELMO

largesmall

Learned with the
down stream tasks

= 𝛼! + 𝛼%

The University of Sydney Page 33

Bidirectional Encoder Representations from
Transformers (BERT)

– BERT = Encoder of Transformer

Encode
r

BERT

five little monkeys jumping ……

Learned from a large amount of text
without annotation

……

The University of Sydney Page 34

Training of BERT

– Approach 1:
Masked LM

BERT

five 退了 monkeys jumping ……

……

[MASK]

Linear Multi-class
Classifier

Predicting the
masked word

vocabulary size

The University of Sydney Page 35

BERT

[CLS] jumping bed [SEP]

Approach 2: Next Sentence Prediction

one fell off

Linear Binary
Classifier

yes

[CLS]: the position that outputs
classification results

[SEP]: the boundary of two sentences

Approaches 1 and 2 are used at the same time.

Training of BERT

……

……

The University of Sydney Page 36

BERT

Linear Binary
Classifier

No

[CLS] jumping bed [SEP] Deep neural network……

[CLS]: the position that outputs
classification results

[SEP]: the boundary of two sentences

Approaches 1 and 2 are used at the same time.

Training of BERT Approach 2: Next Sentence Prediction

The University of Sydney Page 37

How to use BERT – Case 1

BERT

[CLS] w1 w2 w3

Linear
Classifier

class

Input: single sentence,
output: class

sentence

Example:
Sentiment analysis
Document Classification

Trained from
Scratch

Fine-tune

The University of Sydney Page 38

How to use BERT – Case 2

BERT

[CLS] w1 w2 w3

Linear
Cls

class

Input: single sentence,
output: class of each word

sentence

Example: Slot filling

Linear
Cls

class

Linear
Cls

class

The University of Sydney Page 39

Linear
Classifier

w1 w2

How to use BERT – Case 3

BERT

[CLS] [SEP]

Class

Sentence 1 Sentence 2
w3 w4 w5

Input: two sentences, output: class
Example: Natural Language Inference

Given a “premise”, determining whether a
“hypothesis” is T/F/ unknown.

The University of Sydney Page 40

How to use BERT – Case 4

– Extraction-based Question
Answering (QA) (E.g. SQuAD)

𝐷 = 𝑑$, 𝑑#, ⋯ , 𝑑?
𝑄 = 𝑞$, 𝑞#, ⋯ , 𝑞?

QA
Model

output: two integers (𝑠, 𝑒)

𝐴 = 𝑞@, ⋯ , 𝑞A

Document:
Query:

Answer:

𝐷
𝑄

𝑠
𝑒

17

77 79

𝑠 = 17, 𝑒 = 17

𝑠 = 77, 𝑒 = 79

The University of Sydney Page 41

q1 q2

How to use BERT – Case 4

BERT

[CLS] [SEP]
question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

The answer is “d2d3”.
s = 2, e = 3

Learned from
scratch

The University of Sydney Page 42

q1 q2

How to use BERT – Case 4

BERT

[CLS] [SEP]
question document

d1 d2 d3

dot product

Softmax

0.20.1 0.7

The answer is “d2d3”.
s = 2, e = 3

Learned from
scratch

The University of Sydney Page 43

What does BERT learn?

https://arxiv.org/abs/1905.05950
https://openreview.net/pdf?id=SJzSgnRcKX

lower layers of a language model encode more
local syntax while higher layers capture more
complex semantics,

The University of Sydney Page 44

Multilingual BERT
https://arxiv.org/abs/1904.09077

Trained on 104 languages

Task specific training
data for English

En Class
1

En Class
2

En Class
3

Task specific testing
data for Chinese

Zh ? Zh ?

Zh ?

Zh ? Zh ?

The University of Sydney Page 45

GPT

A transformer uses Encoder stack to model input, and uses Decoder stack to model output

(using input information from encoder side).

If we are only interested in training a language model for the input for some other tasks, then

we do not need the Decoder of the transformer, that gives us BERT.

But if we do not have input, we just want to model the “next word”, we can get rid of the

Encoder side of a transformer and output “next word” one by one. This gives us GPT.

The University of Sydney Page 46

Generative Pre-Training (GPT)

Source of image: https://huaban.com/pins/1714071707/

ELMO
(94M)

BERT
(340M)

GPT

Transformer
Decoder

The University of Sydney Page 47

The now-famous unicorn example

The University of Sydney Page 48

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝑎!𝑎”𝑎#𝑎$

five

*𝛼#,$ *𝛼#,#

𝑏#

Many Layers …

little

Generative Pre-Training (GPT)

The University of Sydney Page 49

𝑣$𝑘$𝑞$ 𝑣#𝑘#𝑞# 𝑣”𝑘”𝑞” 𝑣!𝑘!𝑞!

𝑎!𝑎”𝑎#𝑎$

five

*𝛼”,# *𝛼”,”

𝑏”

Many Layers …

monkeys

little

*𝛼”,$

monkeys

Generative Pre-Training (GPT)

The University of Sydney Page 50

Improving Language Understanding by
Generative Pre-training (GPT-1):

Supervised models have two major limitations:
They need large amount of annotated data for learning a particular task
which is often not easily available.
They fail to generalize for tasks other than what they have been trained
for.

This paper proposed learning a generative language model using
unlabeled data and then fine-tuning the model by providing examples
of specific downstream tasks like classification, sentiment analysis,
textual entailment etc.

The University of Sydney Page 51

Unsupervised Language Modelling (Pre-training): For unsupervised learning,
standard language model objective was used.

where T was the set of tokens in unsupervised data {t_1,…,t_n}, k was size of context
window, θ were the parameters of neural network trained using stochastic gradient
descent.

Supervised Fine-Tuning: This part aimed at maximising the likelihood of observing
label y, given features or tokens x_1,…,x_n.

where C was the labeled dataset made up of training examples.

Improving Language Understanding by
Generative Pre-training (GPT-1):

The University of Sydney Page 52

Improving Language Understanding by
Generative Pre-training (GPT-1):

Task Specific Input Transformations: In order to make minimal changes to the architecture
of the model during fine tuning, inputs to the specific downstream tasks were transformed into
ordered sequences.

The University of Sydney Page 53

Language Models are unsupervised multitask learners
(GPT-2)

The developments in GPT-2 model were mostly in terms of using a larger dataset
and adding more parameters to the model to learn even stronger language model.

1542M762M345M117M#para

6GB -> 40GB data

The University of Sydney Page 54

Same architecture as GPT and GPT-2, vanilla Transformer with small
upgrades accumulated over time.

Language models are few shot learners (GPT-3):

GPT3

GPT2GPT1 BERT

#parameters

45TB data
285,000 CPUs
10,000 GPUs
$12,000,000

The University of Sydney Page 55

GPT-3

The University of Sydney Page 56

GPT-3

The University of Sydney Page 57

GPT-3

The University of Sydney Page 58

Thank you !