Accelerated Natural Language Processing Week 2/Unit 2
N-gram models, entropy
Sharon Goldwater
(some slides based on those by Alex Lascarides and Philipp Koehn)
Video 1: Introduction and noisy channel model
Sharon Goldwater ANLP Week 2/Unit 2
Recap: Language models
• Language models tell us P(w⃗) = P(w1 …wn): How likely to occur is this sequence of words?
Roughly: Is this sequence of words a “good” one in my language?
Sharon Goldwater
ANLP Week 2/Unit 2 1
Example uses of language models
• Machine translation: reordering, word choice.
Plm(the house is small) > Plm(I am going home) > Plm(We’ll start eating) >
Plm(small the is house)
Plm(I am going house)
Plm(We shall commence consuming)
• Speech recognition: word choice:
Plm(morphosyntactic analyses) > Plm(more faux syntactic analyses)
Plm(I put it on today) > Plm(I putted onto day) But: How do systems use this information?
Sharon Goldwater
ANLP Week 2/Unit 2 2
Sharon Goldwater ANLP Week 2/Unit 2 3
This unit:
Noisy channel framework
• Concept from Information Theory, used widely in NLP
• We imagine that the observed data (output sequence) was
• What is the Noisy Channel framework and what are some example uses?
• What is a language model?
• What is an n-gram model, what is it for, and what independence
assumptions does it make?
• What are entropy and perplexity and what do they tell us?
• What’s wrong with using MLE in n-gram models?
Sharon Goldwater ANLP Week 2/Unit 2 4
Noisy channel framework
• Concept from Information Theory, used widely in NLP
• We imagine that the observed data (output sequence) was
generated as:
P(Y)
Sharon Goldwater
Example: spelling correction
noisy/ errorful encoding
P(X|Y)
ANLP Week 2/Unit 2
P(X)
generated as:
P(Y)
Application
Speech recognition
Machine translation words in L1
• P (Y ): Distribution over the words (sequences) the user intended to type. A language model.
• P(X|Y ): Distribution describing what user is likely to type, given what they meant. Could incorporate information about common spelling errors, key positions, etc. Call it a noise model.
• P(X): Resulting distribution over what we actually see.
• Given some particular observation x (say, effert), we want to
recover the most probable y that was intended.
5
noisy/ errorful encoding
output sequence
symbol sequence
output sequence
symbol sequence
P(X|Y)
P(X)
X
acoustic signal words in L2 typed words
Y
true words
true words Sharon Goldwater ANLP Week 2/Unit 2
Spelling correction
6
Sharon Goldwater ANLP Week 2/Unit 2 7
Noisy channel as probabilistic inference
Noisy channel as probabilistic inference
So to recover the best y, we will need
• a language model P (Y ): relatively task-independent.
• a noise model P (X|Y ), which depends on the task.
– acoustic model, translation model, misspelling model, etc.
– won’t discuss here; see courses on ASR, MT. Both are normally trained on corpus data.
• Mathematically, what we want is argmaxy P(y|x). – Read as “the y that maximizes P(y|x)”
• Rewrite using Bayes’ Rule:
argmaxP(y|x) = argmax P(x)
P(x|y)P(y) = argmax P (x|y)P (y)
yy y
Sharon Goldwater ANLP Week 2/Unit 2 8
You may be wondering
If we can train P (X|Y ), why can’t we just train P (Y |X)? Who needs Bayes’ Rule?
• Answer 1: sometimes we do train P (Y |X ) directly. Stay tuned…
• Answer 2: training P(X|Y) or P(Y|X) requires input/output
pairs, which are often limited:
– Misspelled words with their corrections; transcribed speech;
translated text
But LMs can be trained on huge unannotated corpora: a better model. Can help improve overall performance.
Sharon Goldwater
ANLP Week 2/Unit 2 9
Video 2: Unigram model
Sharon Goldwater ANLP Week 2/Unit 2 10
Sharon Goldwater
ANLP Week 2/Unit 2 11
Estimating a language model
A first attempt to solve the problem
• Y is really a sequence of words w⃗ = w1 …wn.
• So we want to know P(w1 …wn) for big n (e.g., sentence).
• What will not work: try to directly estimate probability of each full sentence.
– Say, using MLE (relative frequencies): C(w⃗)/(tot # sentences).
– For nearly all w⃗ (grammatical or not), C(w⃗) = 0.
– A sparse data problem: not enough observations to estimate probabilities well.
Sharon Goldwater ANLP Week 2/Unit 2 12
A first attempt to solve the problem
Perhaps the simplest model of sentence probabilities: a unigram model.
• Generative process: choose each word in sentence independently.
Perhaps the simplest model of sentence probabilities: a unigram model.
• Generative process: choose eachword in sentence independently. n
• Resulting model: Pˆ(w⃗) = P(wi) i=1
• Resulting model: Pˆ(w⃗) =
• So, P(the cat slept quietly) = P(the quietly cat slept)
• Resulting model:
• So, P(the cat slept quietly) = P(the quietly cat slept)
n i=1
n
Pˆ(w⃗) = P(wi)
P(wi)
Sharon Goldwater ANLP Week 2/Unit 2 13
A first attempt to solve the problem
Perhaps the simplest model of sentence probabilities: a unigram model.
• Generative process: choose each word in sentence independently.
i=1
– Not a good model, but still a model.
• Of course, P(wi) also needs to be estimated!
Sharon Goldwater ANLP Week 2/Unit 2
14
Sharon Goldwater ANLP Week 2/Unit 2 15
MLE for unigrams
Unigram models in practice
• How to estimate P(w), e.g., P(the)?
• Remember that MLE is just relative frequencies:
PML(w) = C(w) W
• Seems like a pretty bad model of language: probability of word obviously does depend on context.
• Yet unigram (or bag-of-words) models are surprisingly useful for some applications.
– Can model “aboutness”: topic of a document, semantic usage of a word
– Applications: lexical semantics (disambiguation), information retrieval, text classification. (See later in this course)
– But, for now we will focus on models that capture at least some syntactic information.
Sharon Goldwater ANLP Week 2/Unit 2 17
General N-gram language models
Step 1: rewrite using chain rule:
P(w⃗) = P(w1 …wn)
= P(wn|w1,w2,…,wn−1)P(wn−1|w1,w2,…,wn−2)…P(w1)
• Example: w⃗ = the cat slept quietly yesterday.
P (the, cat, slept, quietly, yesterday) =
P (yesterday|the, cat, slept, quietly) · P (quietly|the, cat, slept)· P (slept|the, cat) · P (cat|the) · P (the)
• But for long sequences, many of the conditional probs are also too sparse!
– C(w) is the token count of w in a large corpus
– W = x′ C(x′) is the total number of word tokens in the
corpus.
Sharon Goldwater
ANLP Week 2/Unit 2
Video 3: N-gram model
16
Sharon Goldwater
ANLP Week 2/Unit 2
18
Sharon Goldwater ANLP Week 2/Unit 2 19
General N-gram language models
Step 2: make an independence assumption:
P(w⃗) = P(w1 …wn)
= P(wn|w1,w2,…,wn−1)P(wn−1|w1,w2,…,wn−2)…P(w1) ≈ P (wn|wn−2, wn−1)P (wn−1|wn−3, wn−2) . . . P (w1)
• Markov assumption: only a finite history matters.
• Here, two word history (trigram model):
wi is cond. indep. of w1 . . . wi−3 given wi−1, wi−2.
P (the, cat, slept, quietly, yesterday) ≈
P (yesterday|slept, quietly) · P (quietly|cat, slept)· P (slept|the, cat) · P (cat|the) · P (the)
Sharon Goldwater ANLP Week 2/Unit 2 20
Another example: bigram model
• Bigram model assumes one word history:
n
i=2
w1 w2 w3 w4
(1) the cats slept quietly (2) feeds cats slept quietly (3) the cats slept on
• What’s wrong with (2) and (3)? Does the model capture these problems?
Trigram independence assumption
P (w⃗ ) = P (w1) • But consider these sentences:
P (wi|wi−1)
• Put another way, a trigram model assumes these are all equal:
– P(slept|the cat)
– P(slept|after lunch the cat)
– P(slept|the dog chased the cat) – P(slept|except for the cat)
because all are estimated as P(slept|the cat)
• Not always a good assumption! But it does reduce the sparse
data problem.
Sharon Goldwater ANLP Week 2/Unit 2 21
Video 5: Entropy for Measuring Uncertainty
(No slides for Video 4)
Sharon Goldwater ANLP Week 2/Unit 2 22
Sharon Goldwater
ANLP Week 2/Unit 2 23
Example: bigram model
Estimating N-Gram Probabilities
• To capture behaviour at beginning/end of sentence, we need to augment the input:
w0 w1 w2 w3 w4 w5 (1) the cats slept quietly (2) feeds cats slept quietly (3) the cats slept on
• That is, assume w0 = and wn+1 = so we have:
n+1 n+1
P(w⃗) = P(w0) P(wi|wi−1) = P(wi|wi−1)
i=1 i=1
Sharon Goldwater ANLP Week 2/Unit 2 24
Estimating N-Gram Probabilities
• Maximum likelihood (relative frequency) estimation for bigrams: – How many times we saw w2 following w1,
out of all the times we saw anything following w1:
• Similarly for trigrams:
PML(w3|w1, w2) = C(w1, w2, w3)
Evaluating a language model
• Intuitively, trigram model captures more context than bigram model, so should be a “better” model.
• That is, more accurately predict the probabilities of sentences. • But how can we measure this?
Sharon Goldwater
PML(w2|w1) = C(w1,w2) C(w1,·)
= C(w1,w2) C(w1)
ANLP Week 2/Unit 2 25
C(w1,w2) • Collect counts over a large text corpus
– Millions to billions of words are usually easy to get – (trillions of English words available on the web)
Sharon Goldwater ANLP Week 2/Unit 2 26
Sharon Goldwater ANLP Week 2/Unit 2 27
P(a) = 1
P(a)=0.5 P(b) = 0.5
Entropy
Reminder: logarithms
• Definitionof the entropy of a random variable X: H(X) = x −P(x) log2 P(x)
• Intuitively: a measure of uncertainty/disorder • Also: the expected value of −log2 P(X)
loga x = b iff
ab = x
Sharon Goldwater
ANLP Week 2/Unit 2 28
Entropy Example
One event (outcome)
H(X)= −1log21 =0
Sharon Goldwater
ANLP Week 2/Unit 2 29
Entropy Example
2 equally likely events:
H(X)= −0.5log20.5−0.5log20.5 = − log2 0.5
=1
Sharon Goldwater
ANLP Week 2/Unit 2
30
Sharon Goldwater
ANLP Week 2/Unit 2 31
P(a)=0.25 P(b)=0.25 P(c)=0.25 P(d) = 0.25
4 equally likely events:
H(X)= −0.25log20.25−0.25log20.25 −0.25log20.25−0.25log20.25
P(a)=0.7 P(b) = 0.1 P(c)=0.1 P(d)=0.1
Entropy Example
Entropy Example
3 equally likely events and one more likely than the others:
H(X)= −0.7log20.7−0.1log20.1 −0.1log20.1−0.1log20.1 = −0.7log20.7−0.3log20.1
= −(0.7)(−0.5146)−(0.3)(−3.3219) = 0.36020 + 0.99658
= 1.35678
ANLP Week 2/Unit 2 33
Sharon Goldwater
= −log20.25 =2
ANLP Week 2/Unit 2
Entropy Example
3 equally likely events and one much more likely than the others:
32
Sharon Goldwater
Entropy take-aways
Entropy of a uniform distribution over N outcomes is log2 N:
P(a) = 0.97 P(b) = 0.01 P(c)=0.01 P(d) = 0.01
Sharon Goldwater
H(X)= − 0.97 log2 0.97 − 0.01 log2 0.01 − 0.01 log2 0.01 − 0.01 log2 0.01 = −0.97log20.97−0.03log20.01
= −(0.97)(−0.04394)−(0.03)(−6.6439) = 0.04262 + 0.19932
= 0.24194
ANLP Week 2/Unit 2 34
H(X) = 0
H(X) = 3 Sharon Goldwater
H(X) = 1
H(X) = 2.585 ANLP Week 2/Unit 2
H(X) = 2
35
Entropy take-aways
Any non-uniform distribution over N outcomes has lower entropy than the corresponding uniform distribution:
Entropy as y/n questions
H(X) = 2
Sharon Goldwater
H(X) = 1.35678
H(X) = 0.24194
How many yes-no questions (bits) do we need to find out the outcome?
• Uniform distribution with 2n outcomes: n q’s.
• Other cases: entropy is the average number of questions per outcome in a (very) long sequence of outcomes, where questions can consider multiple outcomes at once.
Sharon Goldwater ANLP Week 2/Unit 2 37
Video 6: Cross-entropy and Perplexity
ANLP Week 2/Unit 2 36
Entropy as encoding sequences
• Assume that we want to encode a sequence of events X
• Each event is encoded by a sequence of bits
• For example
– Coin flip: heads = 0, tails = 1
– 4equallylikelyevents: a=00,b=01,c=10,d=11
– 3events,onemorelikelythanothers: a=0,b=10,c=11 – Morse code: e has shorter code than q
• Average number of bits needed to encode X ≥ entropy of X Sharon Goldwater ANLP Week 2/Unit 2 38
Sharon Goldwater
ANLP Week 2/Unit 2 39
The Entropy of English
How good is the LM?
• Given the start of a text, can we guess the next word?
• Humans do pretty well: the entropy is only about 1.3.
• But what about N-gram models?
– Ideal language model would match the true entropy of English. – The closer we get to that, the better the model.
– Put another way, a good model assigns high prob to real
sentences (and low prob to everything else).
Sharon Goldwater ANLP Week 2/Unit 2 40
Perplexity
• Cross entropy measures how well model M predicts the data. • For data w1 . . . wn with large n, well approximated by:
HM(w1 …wn) = −1 log2 PM(w1 …wn) n
– Avg neg log prob our model assigns to each word we saw
• Or, perplexity:
PPM(w⃗) = 2HM(w⃗)
Sharon Goldwater ANLP Week 2/Unit 2 41
Example: trigram (Europarl)
• On paper, there’s a simpler expression for perplexity: H (w⃗)
PML
-log2 PML 2.791
prediction
PML(i|) 3.197
0.109
0.144
0.489
0.905
0.002
0.472
0.147
0.056
0.194
0.089
0.290
0.99999
PPM(w⃗) = 2 M
= 2−1 log P (w …wn)
PML(would|i)|work .)
PML(like|i would) PML(to|would like) PML(commend|like to) PML(the|to commend) PML(rapporteur|commend the) PML(on|the rapporteur) PML(his|rapporteur on) PML(work|on his)
PML(.|his work) PML(
1.031 0.144 8.794 1.084 2.763 4.150 2.367 3.498 1.785 0.000014 2.634
n2M1
= 2log2 PM (w1…wn) n
−1 −1
= PM(w1…wn) n
– 1 over the geometric average of the probabilities of each wi.
• But in practice, when computing perplexity for long sequences, we use the version with logs (see week 3 lab for reasons…)
average
Sharon Goldwater ANLP Week 2/Unit 2 42
Sharon Goldwater ANLP Week 2/Unit 2
43
Comparison: 1–4-Gram
word unigram bigram trigram 4-gram
Unseen N-Grams
6.684
3.197
3.197
8.342
2.884
2.791
9.129
2.026
1.031
5.081
0.402
0.144
15.487
12.335
8.794
3.885
1.402
1.084
10.840
7.319
2.763
6.765
4.140
4.150
10.678
7.316
2.367
9.993
4.816
3.498
4.896
3.020
1.785
4.828
0.005
0.000
8.051
4.072
2.634
i would like
to commend the rapporteur on
his work
. average perplexity
Sharon Goldwater
3.197 2.791 1.290 0.113 8.633 0.880 2.350 1.862 1.978 2.394 1.510 0.000 2.251 4.758
• What happens when I try to compute P (consuming|shall commence)?
– Assume we have seen shall commence in our corpus
– But we have never seen shall commence consuming in our
corpus
265.136 16.817
ANLP Week 2/Unit 2
6.206
Unseen N-Grams
• What happens when I try to compute P (consuming|shall commence)?
– Assume we have seen shall commence in our corpus
– But we have never seen shall commence consuming in our
corpus
→ P (consuming|shall commence) = 0
• Any sentence with shall commence consuming will be assigned
probability 0
The guests shall commence consuming supper Green inked shall commence consuming garden the
• MLE estimates probabilities that make the observed data maximally probable
• by assuming anything unseen cannot happen (and also assigning too much probability to low-frequency observed events).
• It over-fits the training data.
• We tried to avoid zero-probability sentences by modelling with smaller chunks (n-grams), but even these will sometimes have zero prob under MLE.
Next time: smoothing methods, which reassign probability mass from observed to unobserved events, to avoid overfitting/zero probs.
44
Sharon Goldwater
ANLP Week 2/Unit 2 45
The problem with MLE
Sharon Goldwater ANLP Week 2/Unit 2 46
Sharon Goldwater ANLP Week 2/Unit 2 47
Questions for review:
• What is the Noisy Channel framework and what are some example uses?
• What is a language model?
• What is an n-gram model, what is it for, and what independence
assumptions does it make?
• What are entropy and perplexity and what do they tell us?
• What’s wrong with using MLE in n-gram models?
Sharon Goldwater ANLP Week 2/Unit 2 48