Sequence modeling in language
Language modeling
I Language modeling is the task of assigning a probability to a sequence (for evaluating translation candidates).
I It is also the task of assigning probability of a word to follow a sequence of words (for language generation).
I Perfect performance in language modeling means being able to predict the next word in the sentence with a number of guesses that is fewer or less than that required by a human participant.
Language Modeling
Formally, the language model can be formulated as:
P(w1:M ) = P(w1)P(w2|w1)P(w3|w1:2)P(w4|w1:3) · · · P(wM |w1:M 1)
With Markov assumption:
YM m=1
P(w1:M ) ⇡
P(wm|wm n+1:m 1)
Text generated by GPT-2
The fact that I work at the university makes me really nervous about my ability to find more people. That’s one of the things that we’re doing. My university is a very small community. Those are things that are going to have the e↵ect, that go all these years and have the impact that it has on my life.”That’s one of the things that makes me really nervous about being a student. If you’re a computer programmer, there’s almost always somebody who you want to study. This is not how I work. Because I’m learning, I have to be very focused. If I am going to get a good job, if I want to get a career, I don’t have to go out into the kitchen, you know. Now, you know, I think if I don’t get the job, then I just don’t have the time. Maybe I feel like I can’t do much, and all of these things are going to be my own business and I want to be able to get a job. I’d certainly be looking at that. I’m at a di↵erent time. I’m going to write about my business, my business, my business and I want to talk to you about the life that I’ve been working for the last 30 years or so. So that’s why I’ve been able to be here for the last 15 years through the university. My life has been extremely productive and I’ve had a lot of great things, so I love it.”
Try it yourself: https://gpt2.ai-demo.xyz
Noisy Channel Model for MT
I The Noisy channel model is the general framework for Statistical Machine Translation, with many variants
I Language model is a crucial component of the SMT model that is used to “select” the best translation:
Pe|s(w(e)|w(s)) / Ps,e(w(e),w(s))
= Ps|e(w(s)|w(e)) ⇥ Pe(w(e))
where w(e) is generated from a language model, w(s) is a Spanish sentence generated from a translation model Ps|e(w(s)|w(e))
Perplexity: a metric for evaluating language models
I Given a text corpus of M words (where M could be in the millions) w1, w2, w3, · · · wM , a language model function LM assigns a probability to a word based on its history.
XM m=1
`(w) =
I The perplexity of the LM with respect to the corpus is:
log2 P(wm|wm 1,wm 2,··· ,w1)
Perplex(w) = 2 `(w)
M
What counts as a good language model?
I Good language models will assign high probability to the events in the corpus, resulting in low perplexity.
I Perplexities are corpus-specific, and perplexities for two language models are only comparable with respect to the same evaluation corpus.
Extreme Cases of Perplexity
I In the limit of a perfect language model, probability 1 is assigned to the held-out corpus, with
Perplex(w)=2 1 log21 =20 =1
I In the opposite limit, probability zero is assigned to the held-out corpus, which responds to an infinite perplexity:
Perplex(w)=2 1 log20 =21 =1
I Assume a uniform, unigram model in which p(wi ) = 1/V for
all words in the vocabulary. Then XM 1 XM
log2(w)= log2 V = log2V = Mlog2V m=1 m=1
M
Perplex(w)=21 Mlog2V =2log2V =V
M
M
Traditional approaches to language modeling
I Based on a n-order markov property P(wm+1|w1:m) ⇡ P(wm+1|wm n:m)
I The estimates are usually derived from corpus counts
I The role of the language model is to provide good estimates
of Pˆ(wm+1|wm n:m)
I The maximum likelihood (MLE) estimate of
Pˆ(wm+1|wm n:m) is then:
PˆMLE (wm+1|wm n:m) = #(wm n:m+1) #(wm n:m )
Addressing the zero count problem
I Zero count for any of the n-grams resulting in zero probability for the entire corpus, meaning infinite perplexity!
I Add-↵ smoothing:
pˆ (w |w )=
#(wm n:m+1) + ↵ #(wm n:m ) + ↵|V |
add ↵ m+1 m n:m
I Another technique is to back o↵ to a lower n-gram where
there is a count. The Jelinek Mercer-interpolated smoothing: Pˆ i n t ( w m + 1 | w m n : m )
= # ( w m n : m + 1 ) + ( 1 ) Pˆ ( w | w ) m n:m #(wm n:m ) m n:m int m+1 m (n 1):m
Notice this is a recursive formulation.
Limitations of smoothed MLE based models
I Smoothing based on backo↵ to lower-order events, and the sequential nature of the backo↵ makes it hard to scale toward large n-grams.
I MLE-based language models su↵er from lack of generalization across contexts
Neural language models
I Treat word prediction as a discriminative learning task, with the goal of computing the probability P(w|u), where w 2 V is a word, and u is the context that depends on previous words
I Parametrize the probability P(w|u) as a function of two K -dimensional dense vectors, w 2 RK and u 2 RK :
P(w|u)=P exp( w· u) w02V(exp( w0 · u))
The vector of probabilities can be computed by applying the SoftMax transformation to the vector of dot products:
P(·|u)=SoftMax([ w1 · u, w2 · u,···, wV · u])
I The word vectors w are parameters of the model and can be estimated directly, e.g., using the negative log likelihood of the training corpus as the objective
Computing the context vector
I There are di↵erent ways to compute the context vector , and one e↵ective way is to use a Recurrent Neural Netowrk or RNN. The basic idea is to recurrently update the context vector while moving through a sequence.
I Let hm represent the contextual information at position m in the sequence. An RNN model can be defined as:
xm , wm
hm =RNN(xm,hm 1)
P(wm+1|w1, w2, · · · , wm) = P exp( m+1 · hm) w02V exp( w0 · hm)
where is a matrix of word embeddings, and xm is the word embedding for wm
Sequence-to-sequence models
Sequence-to-sequence models
I Sequence goes in, sequence comes out
I Sequence-to-sequence models are a powerful learning framework that have found success in a wide range of applications
I Automatic Speech Recognition (ASR): sound stream goes in, I text comes out
Machine Translation (MT): source language sentence goes in, I target language sentence comes out
I Image captioning: Image goes in, caption comes out
I text summarization: whole text goes in, summary comes out
Automatic email responses: Generating automatic responses to I incoming emails
etc. etc.
The encoder decoder architecture
I The encoder network converts the source sentence into a vector or a matrix representation; the decoder network then converts the encoding into a sentence in the target language
z = Encode(w(s)) w(t)|w ⇠ Decode(z)
where the second line means the decoder defines the conditional probability P(w(t)|w(s)).
I The decoder is typically a recurrent neural network (e.g., LSTM) that generates one word at a time, while recurrently updating a hidden state.
I The encoder decoder networks are trained end-to-end from parallel sentences.
Encoder decoder
Encoder
C
we
Decoder
are students
we are students
women
shi xuesheng
Training objective
If the output layer of the decoder is a logistic function, then the entire network can be trained to maximize the conditional log-likelihood (or minimize the negative log-likelihood):
X( t ) M
is a recurrent function of the previously generated text w1:m 1 and the ecoding z, and 2 R(V(t)⇥K) is the matrix of
output word vectors for the V(t) words in the target language vocabulary
where h(t) (t) m 1
m=1 ⇣ ⌘ m 1:m 1 m 1
logP(w(t)|w(s)) =
w(t)|w(t) ,w(s) ⇠SoftMax ·h(t)
P(w(t)|w(t) ,z) m 1:m 1
The LSTM variant
I In the LSTM variant of the encoder network, the encoder is set to the final hidden state of an LSTM on the source sentence:
h(s) = LSTM(x(s), h(s) )
m m 1 z , h(s)
M(s)
where x(s) is the embedding of the source language word w(s).
I The encoding then provides the initial hidden state for the decoder LSTM:
h(t) = z 0
h(t) = LSTM(x(t), h(t) ) m mm 1
where x(t) is the embedding of the target language word w(t) mm
m
Tweaking the encoder decoder network
I Adding layers: The encoder and decoder network can be implemented as deep LSTMs with multiple layers of hidden state
I Adding attention to give more weight to particular word or words in the source language when generating a word in the target language
Multi-layered LSTMs
Each hidden state h(s,i) at layer i is treated as the input to an LSTM at layer i + 1:
h(s,1) = LSTM(x(s), h(s) ) m mm 1
h(s,i+1) = LSTM(h(s,i), h(s,i+1)), 8i 1 m mm 1
8.3. NEURALMACHINETRANSLATION
h(s,D) h(s,D) m 1 m
… … …
443
h(s,D) m+1
h(s,2) m 1
h(s,1) m 1
x(s) m 1
h(s,2) h(s,2) m m+1
h(s,1) h(s,1) m m+1
x(s) x(s) m m+1
1
Figure 18.5: A deep bidirectional LSTM encoder memory (LSTM) (see § 6.3.3) on the source sentence:
h(s) =LSTM(x(s),h(s) ) m mm 1
z ,h(s) , M (s)
[18.29] [18.30]
where x(s) is the embedding of source language word w(s). The encoding then provides mm
the initial hidden state for the decoder LSTM:
Neural attention
I Attention can be thought of as using a query to select from a memory of key-value pairs, with the keys, values and queries all being vectors
I For each key n in the memory, we compute a score with respect to the query m, which measures the “comptability” between the key and the query
I The scores are passed through an activation function, typically softmax, which results in a vector of non-negative numbers of length N, which equal to the size of the memory: [↵m!1,↵m!2,··· ,↵m!N]
I Multiply each value in the memory n by the attention ↵m!n, and sum them up, we get the output of the attention.
I The attention is typical concatenated with the decoding hidden state to output the target word
HINE TRANSLATION
“Querying”
activation
Output
↵
Value
Query
Key
l view of neural attention. The dotted box in te on value .
n
iable. For each key n in the memory, we com uery m. That score is a function of the compa e computed using a small feedforward neura
d
C
a
a
t
t l
Step by step computation of attention
I Computing compatibility with a two-layer feedforward network: (m, n) = · tanh(⇥ [h(t); h(s)])
I Softmax attention
↵m!n=Pexp ↵(m,n)
M(s) exp (m, n0) n0=1 ↵
I Compute the weighted average over columns of Z
X( s ) M
↵↵↵mn
cm =
I incorporate the context vector into the decoding model:
↵m!nzn
h ̃(t) = tanh ⇣⇥ [h(t), c ]⌘
n=1
m cmm
P(w(t) |w(t),w(s) /exp✓ (t) ·h ̃(t)◆ m+1 1:m wm+1 m
Seq2seq: Initialization
I Word embeddings
E=embedB6START7C=6 0 B6 we 7C 6 0.1
B@64 are 75CA 64 0.2 0.1 75
students 0.1
EOS 11
I RNN parameters W(s) = 0.3
02 women 31 2 0.5 B6 shi 7C 6 0.5 B6xuesheng7C 6 0.5
0.5 3 0.57 0.5 7
0 U(s) = 0.1 0.1 b(s) = 0.2 0 0.3 0.1 0.1 0.8
W(t)= 0.3 0 U(t)= 0.1 0.1 b(t)= 0.8 0 0.3 0.1 0.1 0.2
0 7 0.2 7
0.2
Encoder
z , h(s)
h(s) = tanh(W (s) ⇥ x + U(s) ⇥ h + b)
n
h(s) = tanh(W(s) ⇥ E[women] + U(s) ⇥ 0 + b(s)) =
0.3364 0.5717
1
h(s) =tanh(W(s) ⇥E[shi]+U(s) ⇥h(s) +b(s))=0.3153
2 1 0.7289
h(s) =tanh(W(s) ⇥E[xuesheng]+U(s) ⇥h(s) +b(s))=0.0086
3 2 0.5432 C = h(s) = 0.0086
3 0.5432
where C is a context vector
Decoder
h(t) = tanh(W (t) ⇥ x + U(t) ⇥ h + b)
m 0.6929 h(t) = tanh(W(t) ⇥ E[START] + U(t) ⇥ C + b(t)) =
1 0.2482 h(t) =tanh(W(t) ⇥E[we]+U(t) ⇥h(t) +b(t))=0.6203
2 1 0.2123 h(t) =tanh(W(t) ⇥E[are]+U(t) ⇥h(t) +b(t))=0.6039
h(t) =tanh(W(t) ⇥E[students]+U(t) ⇥h(t) +b(t))=0.6220 4 3 0.0980
3 2 0.1870
Softmax over similarities between hidden layers and target embeddings
02 31
02 h(t)⇥E[we] 31 1
23
0.1913 60.20007 640.173375 0.4354
23
0.1999 60.20707 640.182675 0.4116
23
0.2012 60.20987 640.186775 0.4023
23
0.2037 60.21477 640.195975 0.3857
we
B6 are 7C
B6 h(t) ⇥ E[are] 7C B6 (t) 1 7C @4h1 ⇥ E[students]5A
score1 B@64students75CA EOS
= softmax
= softmax
= softmax
h(t) ⇥E[EOS]
02 1 31
B6 h(t) ⇥ E[are] 7C B6 (t) 2 7C @4h2 ⇥ E[students]5A
=
=
=
=
02 31
h(t) ⇥E[we] 2
we
B6 are 7C
score2 B@64students75CA EOS
h(t) ⇥E[EOS]
02 2 31
B6 h(t) ⇥ E[are] 7C B6 (t) 3 7C @4h3 ⇥ E[students]5A
02 31
h(t) ⇥E[we] 3
we
B6 are 7C
score3 B@64students75CA EOS
h(t) ⇥E[EOS]
02 3 31
B6 h(t) ⇥ E[are] 7C B6 (t) 4 7C @4h4 ⇥ E[students]5A
h(t) ⇥E[EOS] 4
P (Y |X ) = score1 ⇥ score2 ⇥ score3 ⇥ score4
02 31
we
B6 are 7C
h(t) ⇥E[we] 4
score4 B@64students75CA = softmax EOS
Attention
I The idea: Di↵erent context vector Cs are used when generating target words at di↵erent time steps, e.g.,
C1 = 0.98⇥h(s) +0.01⇥h(s) +0.01⇥h(s) 123
C2 = 0.01⇥h(s) +0.98⇥h(s) +0.01⇥h(s) 123
C3 = 0.01⇥h(s) +0.01⇥h(s) +0.98⇥h(s) 123
C = X ↵ ⇥ h(s) m m!n n
n ↵m!n = P
exp(score(h(t), h(s))) m n
exp(score(h(t), h(s))) n0=1 m n0
score(h(t), h(s)) = h(t)>h(s) mnmn
I Other scoring variants exist
Computing attention
C2
a2,3 a2,2
h2src
are
h’2
h2tgt
we are
a2,1 h1src
women
h3src
shi xuesheng
students
Other attention variants
I Additive attention:
(m, n) = · tanh(⇥ [h(t) + h(s)])
I Multiplicative attention
(m,n)=h(t)>⇥ h(s) ↵m↵n
↵↵↵mn
Drawbacks of RNNs
I For RNNs, input is processed as a sequence. The computation of each state (hi) depends on the previous state hi 1.
I This prevents parallel computation for all tokens in the input sequence simultaneously, making it di cult to take full advantage of modern computation architecture to increase speed.
I We can imagine a network in which each token in the sequence interacts with any other token in the sequence. Conceptually, this can be viewed as a fully-connected graph where each token is a node in the graph, and the computation of its hidden state depends on all other tokens in the graph.
I With this approach, the computation of the hidden state of a hidden state hi does not depends on the computation of another hidden state. It only depends on the input sequence.
I With this approach, the order information would have to be captured separately, with position encoding.