Sequence-to-sequence models
Sequence-to-sequence models
I Sequence goes in, sequence comes out
I Sequence-to-sequence models are a powerful learning framework that have found success in a wide range of applications
I Automatic Speech Recognition (ASR): sound stream goes in, I text comes out
Machine Translation (MT): source language sentence goes in, I target language sentence comes out
I Image captioning: Image goes in, caption comes out
I text summarization: whole text goes in, summary comes out
Automatic email responses: Generating automatic responses to I incoming emails
etc. etc.
Translation
Translation
The encoder decoder architecture
I The encoder network converts the source sentence into a vector or a matrix representation; the decoder network then converts the encoding into a sentence in the target language
z = Encode(w(s)) w(t)|w(s) ⇠ Decode(z)
where the second line means the decoder defines the conditional probability P(w(t)|w(s)).
I The decoder can be a recurrent neural network (e.g., LSTM) that generates one word at a time, while recurrently updating a hidden state, or it can be a transformer.
I The encoder decoder networks are trained end-to-end from parallel sentences.
Encoder decoder
Encoder
C
we
Decoder
are students
we are students
women
shi xuesheng
Training objective
If the output layer of the decoder is a logistic function, then the entire network can be trained to maximize the conditional log-likelihood (or minimize the negative log-likelihood):
X( t ) M
is a recurrent function of the previously generated text w1:m 1 and the ecoding z, and 2 R(V(t)⇥K) is the matrix of
output word vectors for the V(t) words in the target language vocabulary
where h(t) (t) m 1
m=1 ⇣ ⌘ m 1:m 1 m 1
logP(w(t)|w(s)) =
w(t)|w(t) ,w(s) ⇠SoftMax ·h(t)
P(w(t)|w(t) ,z) m 1:m 1
The LSTM variant
I In the LSTM variant of the encoder network, the encoder is set to the final hidden state of an LSTM on the source sentence:
h(s) = LSTM(x(s), h(s) )
m m 1 z , h(s)
M(s)
where x(s) is the embedding of the source language word w(s).
I The encoding then provides the initial hidden state for the decoder LSTM:
h(t) = z 0
h(t) = LSTM(x(t), h(t) ) m mm 1
where x(t) is the embedding of the target language word w(t) mm
m
Tweaking the encoder decoder network
I Adding layers: The encoder and decoder network can be implemented as deep LSTMs with multiple layers of hidden state
I Adding attention to give more weight to particular word or words in the source language when generating a word in the target language
Multi-layered LSTMs
Each hidden state h(s,i) at layer i is treated as the input to an LSTM at layer i + 1:
h(s,1) = LSTM(x(s), h(s) ) m mm 1
h(s,i+1) = LSTM(h(s,i), h(s,i+1)), 8i 1 m mm 1
8.3. NEURALMACHINETRANSLATION
h(s,D) h(s,D) m 1 m
… … …
443
h(s,D) m+1
h(s,2) m 1
h(s,1) m 1
x(s) m 1
h(s,2) h(s,2) m m+1
h(s,1) h(s,1) m m+1
x(s) x(s) m m+1
1
Figure 18.5: A deep bidirectional LSTM encoder memory (LSTM) (see § 6.3.3) on the source sentence:
h(s) =LSTM(x(s),h(s) ) m mm 1
z ,h(s) , M (s)
[18.29] [18.30]
where x(s) is the embedding of source language word w(s). The encoding then provides mm
the initial hidden state for the decoder LSTM:
Neural attention
I Attention can be thought of as using a query to select from a memory of key-value pairs, with the keys, values and queries all being vectors
I For each key n in the memory, we compute a score with respect to the query m, which measures the “comptability” between the key and the query
I The scores are passed through an activation function, typically softmax, which results in a vector of non-negative numbers of length N, which equal to the size of the memory: [↵m!1,↵m!2,··· ,↵m!N]
I Multiply each value in the memory n by the attention ↵m!n, and sum them up, we get the output of the attention.
I The attention is typical concatenated with the decoding hidden state to output the target word
HINE TRANSLATION
“Querying”
activation
Output
↵
Value
Query
Key
l view of neural attention. The dotted box in te on value .
n
iable. For each key n in the memory, we com uery m. That score is a function of the compa e computed using a small feedforward neura
d
C
a
a
t
t l
Step by step computation of attention
I Computing compatibility with a two-layer feedforward network: (m, n) = · tanh(⇥ [h(t); h(s)])
I Softmax attention
↵m!n=Pexp ↵(m,n)
M(s) exp (m, n0) n0=1 ↵
I Compute the weighted average over columns of Z
X( s ) M
↵↵↵mn
cm =
I incorporate the context vector into the decoding model:
↵m!nzn
h ̃(t) = tanh ⇣⇥ [h(t), c ]⌘
n=1
m cmm
P(w(t) |w(t),w(s) /exp✓ (t) ·h ̃(t)◆ m+1 1:m wm+1 m
Seq2seq: Initialization
I Word embeddings
E=embedB6START7C=6 0 B6 we 7C 6 0.1
B@64 are 75CA 64 0.2 0.1 75
students 0.1
EOS 11
I RNN parameters W(s) = 0.3
02 women 31 2 0.5 B6 shi 7C 6 0.5 B6xuesheng7C 6 0.5
0.5 3 0.57 0.5 7
0 U(s) = 0.1 0.1 b(s) = 0.2 0 0.3 0.1 0.1 0.8
W(t)= 0.3 0 U(t)= 0.1 0.1 b(t)= 0.8 0 0.3 0.1 0.1 0.2
0 7 0.2 7
0.2
Encoder
z , h(s)
h(s) = tanh(W (s) ⇥ x + U(s) ⇥ h + b)
n
h(s) = tanh(W(s) ⇥ E[women] + U(s) ⇥ 0 + b(s)) =
0.3364 0.5717
1
h(s) =tanh(W(s) ⇥E[shi]+U(s) ⇥h(s) +b(s))=0.3153
2 1 0.7289
h(s) =tanh(W(s) ⇥E[xuesheng]+U(s) ⇥h(s) +b(s))=0.0086
3 2 0.5432 C = h(s) = 0.0086
3 0.5432
where C is a context vector
Decoder
h(t) = tanh(W (t) ⇥ x + U(t) ⇥ h + b)
m 0.6929 h(t) = tanh(W(t) ⇥ E[START] + U(t) ⇥ C + b(t)) =
1 0.2482 h(t) =tanh(W(t) ⇥E[we]+U(t) ⇥h(t) +b(t))=0.6203
2 1 0.2123 h(t) =tanh(W(t) ⇥E[are]+U(t) ⇥h(t) +b(t))=0.6039
h(t) =tanh(W(t) ⇥E[students]+U(t) ⇥h(t) +b(t))=0.6220 4 3 0.0980
3 2 0.1870
Softmax over similarities between hidden layers and target embeddings
02 31
02 h(t)⇥E[we] 31 1
23
0.1913 60.20007 640.173375 0.4354
23
0.1999 60.20707 640.182675 0.4116
23
0.2012 60.20987 640.186775 0.4023
23
0.2037 60.21477 640.195975 0.3857
we
B6 are 7C
B6 h(t) ⇥ E[are] 7C B6 (t) 1 7C @4h1 ⇥ E[students]5A
score1 B@64students75CA EOS
= softmax
= softmax
= softmax
h(t) ⇥E[EOS]
02 1 31
B6 h(t) ⇥ E[are] 7C B6 (t) 2 7C @4h2 ⇥ E[students]5A
=
=
=
=
02 31
h(t) ⇥E[we] 2
we
B6 are 7C
score2 B@64students75CA EOS
h(t) ⇥E[EOS]
02 2 31
B6 h(t) ⇥ E[are] 7C B6 (t) 3 7C @4h3 ⇥ E[students]5A
02 31
h(t) ⇥E[we] 3
we
B6 are 7C
score3 B@64students75CA EOS
h(t) ⇥E[EOS]
02 3 31
B6 h(t) ⇥ E[are] 7C B6 (t) 4 7C @4h4 ⇥ E[students]5A
h(t) ⇥E[EOS] 4
P (Y |X ) = score1 ⇥ score2 ⇥ score3 ⇥ score4
02 31
we
B6 are 7C
h(t) ⇥E[we] 4
score4 B@64students75CA = softmax EOS
Attention
I The idea: Di↵erent context vector Cs are used when generating target words at di↵erent time steps, e.g.,
C1 = 0.98⇥h(s) +0.01⇥h(s) +0.01⇥h(s) 123
C2 = 0.01⇥h(s) +0.98⇥h(s) +0.01⇥h(s) 123
C3 = 0.01⇥h(s) +0.01⇥h(s) +0.98⇥h(s) 123
C = X ↵ ⇥ h(s) m m!n n
n ↵m!n = P
exp(score(h(t), h(s))) m n
exp(score(h(t), h(s))) n0=1 m n0
score(h(t), h(s)) = h(t)>h(s) mnmn
I Other scoring variants exist
Computing attention
C2
a2,3 a2,2
h2src
are
h’2
h2tgt
we are
a2,1 h1src
women
h3src
shi xuesheng
students
Other attention variants
I Additive attention:
(m, n) = · tanh(⇥ [h(t) + h(s)])
I Multiplicative attention
(m,n)=h(t)>⇥ h(s) ↵m↵n
↵↵↵mn
Drawbacks of RNNs
I For RNNs, input is processed as a sequence. The computation of each state (hi) depends on the previous state hi 1.
I This prevents parallel computation for all tokens in the input sequence simultaneously, making it di cult to take full advantage of modern computation architecture to increase speed.
I We can imagine a network in which each token in the sequence interacts with any other token in the sequence. Conceptually, this can be viewed as a fully-connected graph where each token is a node in the graph, and the computation of its hidden state depends on all other tokens in the graph.
I With this approach, the computation of the hidden state of a hidden state hi does not depends on the computation of another hidden state. It only depends on the input sequence.
I With this approach, the order information would have to be captured separately, with position encoding.