N-gram Language Models
COMP90042
Natural Language Processing Lecture 3
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L3
•
Language Models NLP is about explaining language
‣ Why some sentences are more fluent than others
•
•
•
•
We measure ‘goodness’ using probabilities estimated by language models
E.g. in speech recognition:
recognise speech > wreck a nice beach
Language model can also be used for generation 2
COMP90042
L3
3
COMP90042
L3
•
Useful for
‣ Speech recognition
‣ Spelling correction
‣ Query completion
‣ Optical character recog.
Language Models
• Othergenerationtasks ‣ Machine translation
‣ Summarisation
‣ Dialogue systems
4
COMP90042
L3
• • • •
Deriving n-gram language models Smoothing to deal with sparsity Generating Language
Evaluating language models
Outline
5
COMP90042
L3
Probabilities: Joint to Conditional
Our goal is to get a probability for an arbitrary sequence of m words
𝑃(𝑤1, 𝑤2, …, 𝑤𝑚)
First step is to apply the chain rule to convert joint
probabilities to conditional ones
𝑃(𝑤1,𝑤2,…,𝑤𝑚) = 𝑃(𝑤1)𝑃(𝑤2| 𝑤1)𝑃(𝑤3| 𝑤1, 𝑤2)…
𝑃(𝑤𝑚|𝑤1…𝑤𝑚−1) 6
COMP90042
L3
The Markov Assumption Still intractable, so make a simplifying assumption:
𝑃(𝑤𝑖|𝑤1…𝑤𝑖−1) ≈ 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) For some small n
When n = 1, a unigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑚 𝑃(𝑤𝑖)
𝑖=1
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑚 𝑃(𝑤𝑖|𝑤𝑖−1)
𝑖=1
When n = 3, a trigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑚 𝑃(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) 𝑖=1
When n = 2, a bigram model
7
COMP90042
L3
Maximum Likelihood Estimation
How do we calculate the probabilities? Estimate based on counts in our corpus:
For unigram models,
𝑃(𝑤𝑖) = 𝑃 (𝑤𝑖 | 𝑤𝑖−1) = 𝐶(𝑤𝑖−1𝑤𝑖)
𝐶(𝑤𝑖) 𝑀
For bigram models,
𝐶(𝑤𝑖−1) For n-gram models generally,
𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) = 𝐶(𝑤𝑖−𝑛+1…𝑤𝑖) 𝐶(𝑤𝑖−𝑛+1…𝑤𝑖−1)
8
COMP90042
L3
•
Special tags used to denote start and end of sequence
‣ = sentence start (□ in E18) ‣ = sentence end (■ in E18)
Book-ending Sequences
9
COMP90042
L3
Trigram example yes no no no no yes
no no no yes yes yes no What is the probability of
yes no no yes under a trigram language model?
Corpus:
P(sent = yes no no yes)
=P(yes|) *P(no|yes)*P(no|yesno)
| no yes)
* P(yes | no no) * P(
=1⁄2*1*1⁄2* 2⁄5*1⁄2 =0.1
5 “no no” bigrams
10
COMP90042
L3
Several Problems
Language has long distance effects — need large n
‣ The lecture/s that took place last week was/were on preprocessing.
• Resultingprobabilitiesareoftenverysmall
‣ Use log probability to avoid numerical underflow
•
• Wordsinnewcontexts
‣ By default, zero count for any n-gram we’ve never seen
•
No probabilities for unseen words
‣ Special symbol to represent them (e.g.
before, zero probability for the sentence ‣ Need to smooth the LM!
11
COMP90042
L3
Smoothing
12
COMP90042
L3
Smoothing
• Basic idea: give events you’ve never seen before some probability
• Must be the case that P(everything) = 1
• Many different kinds of smoothing ‣ Laplacian (add-one) smoothing
‣ Add-k smoothing
‣ Jelinek-Mercer interpolation
‣ Katz backoff
‣ Absolute discounting ‣ Kneser-Ney
‣ And others…
13
COMP90042
L3
Laplacian (Add-one) Smoothing
•
For unigram models (V= the vocabulary),
𝐶(𝑤𝑖) + 1 𝑀+|𝑽|
𝑃 (𝑤|𝑤 )=𝐶(𝑤𝑖−1𝑤𝑖)+1 𝑎𝑑𝑑1 𝑖 𝑖−1 𝐶(𝑤𝑖−1) + |𝑽 |
Simple idea: pretend we’ve seen each n-gram once more than we did.
𝑃𝑎𝑑𝑑1(𝑤𝑖) = For bigram models,
14
COMP90042
L3
Add-one Example the rat ate the cheese
What’s the bigram probability P(ate|rat) under add-one smoothing?
= 𝐶(𝑟𝑎𝑡𝑎𝑡𝑒)+1= 2 V={the,rat,ate,cheese,} 𝐶(𝑟𝑎𝑡) + |𝑽 | 6
What’s the bigram probability P(ate|cheese) under add-one smoothing?
= 𝐶(𝑐h𝑒𝑒𝑠𝑒 𝑎𝑡𝑒) + 1 = 𝐶(𝑐h𝑒𝑒𝑠𝑒) + | 𝑽 |
1 6
15
COMP90042
L3
Add-k Smoothing
• • •
𝑃 (𝑤|𝑤 ,𝑤 )=𝐶(𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)+𝑘 𝑎𝑑𝑑𝑘 𝑖 𝑖−1 𝑖−2 𝐶(𝑤𝑖−2,𝑤𝑖−1)+𝑘|𝑽|
• Havetochoosek
‣ number of “classes” is huge (n-grams), so can have a big effect
Adding one is often too much Instead, add a fraction k
AKA Lidstone Smoothing
16
COMP90042
L3
Lidstone Smoothing
Context = alleged
• 5 observed bi-grams • 2 unobserved bi-grams
20 1.0 20 1.0
(8 + 0.1) / (20 + 7 x 0.1) 17
COMP90042
L3
Lidstone Smoothing
Context = alleged
• 5 observed n-grams • 2 unobserved n-grams
20 1.0 20 1.0
(0 + 0.1) / (20 + 7 x 0.1)
(8 + 0.1) / (20 + 7 x 0.1) 18
COMP90042
L3
Lidstone Smoothing
Context = alleged
• 5 observed n-grams • 2 unobserved n-grams
20
1.0 20 1.0 0.391 x 20
(0 + 0.1) / (20 + 7 x 0.1)
(8 + 0.1) / (20 + 7 x 0.1) 19
COMP90042
L3
•
•
‘Borrows’ a fixed probability mass from observed n-gram counts
Absolute Discounting
Redistributes it to unseen n-grams
20
COMP90042
L3
Absolute Discounting
Context = alleged
• 5 observed n-grams • 2 unobserved n-grams
8 – 0.1
(0.1 x 5) / 2
total amount of discounted probability mass (0.1 x 5) / 20
21
COMP90042
L3
•
•
Backoff
Absolute discounting redistributes the probability
mass equally for all unseen n-grams
Katz Backoff: redistributes the mass based on a
lower order model (e.g. unigram)
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
if C(wi−1, wi) > 0 otherwise
unigram probability for wi
katz i i−1
⎩α(wi−1) ×
sum unigram probabilities for all words
that do not co-occur with context wi-1
wj:C(wi−1,wj)=0 P(wj)
the amount of probability mass that has been discounted for context wi-1 ((0.1 x 5) / 20 in previous slide)
22
COMP90042
L3
katz i i−1
⎩α(wi−1) ×
Issues with Katz Backoff
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
wj:C(wi−1,wj)=0 P(wj)
if C(wi−1, wi) > 0 otherwise
•
•
•
•
Francisco
I can’t see without my reading ___
C(reading, glasses) = C(reading, Francisco) = 0 C(Fransciso) > C(glasses)
Katz backoff will give higher probability to
23
COMP90042
L3
•
•
Redistribute probability mass based on how many number of different contexts word w has appeared in
Kneser-Ney Smoothing
This measure is called “continuation probability”
24
COMP90042
L3
PKN (wi|wi−1) =
C(wi−1,wi)−D , C(wi−1)
α(wi−1 )Pcont (wi ), where
if C(w , w ) > 0 i−1 i
otherwise
Kneser-Ney Smoothing
|{wi−1 : C(wi−1, wi) > 0}|
wi |{wi−1 : C(wi−1, wi) > 0}|
Pcont(wi) =
• Highcontinuationcountsforglasses(men’sglasses,
black glasses, buy glasses, prescription glasses, etc) • LowcontinuationcountsforFranciso(SanFrancisco)
25
COMP90042
L3
Interpolation
• Abetterwaytocombinedifferentordern-grammodels
• Weightedsumofprobabilitiesacrossprogressively shorter contexts
• Interpolatedtrigrammodel:
Learned based on held out data
26
COMP90042
L3
Interpolated Kneser-Ney Smoothing
•
Interpolation instead of back-off
PIKN (wi|wi−1) = C(wi−1, wi) − D + β(wi−1)Pcont(wi) C (wi−1 )
• where β(wi−1) =
‣ normalising constant so that sums to 1.0
PIKN (wi|wi−1)
27
COMP90042
L3
•
•
Commonly used Kneser-Ney language models use 5-grams as max order
In Practice
Has different discount values for each n-gram order.
28
COMP90042 L3
Generating Language
29
COMP90042
L3
Generation
• Given an initial word, draw the next word according to the probability distribution defined by the language model.
• Include (n-1) start tokens for n-gram model, to provide context to generate first word
‣ never generate terminates the sequence
‣ generating
30
COMP90042
L3
• •
Generation (Bigram LM) Sentence =
P(? | ) = “a” 0.5
0.375 0.25 0.125 0
cow grass
eats a
31
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a
P(? | a) = “cow” 0.7
0.525 0.35 0.175 0
cow grass
eats a
32
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow
P(? | cow) = “eats” 0.6
0.45 0.3 0.15 0
cow grass eats
a
33
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats
P(? | eats) = “grass” 0.8
0.6 0.4 0.2
0
cow grass
eats a
34
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats grass” 0.6
P(? | grass) = “
0.45 0.3 0.15 0
cow grass
eats a
35
COMP90042
L3
• •
Sentence = a cow eats grass Done!
Generation (Bigram LM)
36
COMP90042
L3
How to Select Next Word?
• Argmax:takeshighestprobabilitywordeachturn ‣ Greedy search
•
Beam search decoding:
‣ Keeps track of top-N highest probability words each turn
‣ Produces sentences with near-optimal sentence probability
• Randomlysamplesfromthedistribution(e.g. temperature sampling)
37
COMP90042 L3
Evaluating Language Models
38
COMP90042
L3
Evaluation
‣ E.g. spelling correction, machine translation
• Intrinsic
‣ Perplexity on held-out test set
•
Extrinsic
39
COMP90042
L3
Perplexity
• Inverseprobabilityofentiretestset
‣ Normalized by number of word tokens (including ) The lower the better
•
𝑃𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = equivalently
𝑚
1
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚)
𝑃𝑃(𝑤,𝑤,..𝑤 )=2−log2𝑃(𝑤1,𝑤2,..𝑤𝑚) 12𝑚𝑚
•
Unknown (OOV) words a problem (omit)
40
COMP90042
L3
Example Perplexity Scores
•
• Trainpartition:38millionwordtokens,almost20K
word types (vocabulary)
Test partition: 1.5 million word tokens
•
Corpus: Wall Street Journal
41
COMP90042
L3
•
•
•
N-gram language models are a simple but effective way to capture the predictability of language
A Final Word
Information can be derived in an unsupervised fashion, scalable to large corpora
Require smoothing to be effective, due to sparsity
42
COMP90042
L3
•
Reading E18 Chapter 6 (skip 6.3)
43