N-gram Language Models
COMP90042
Natural Language Processing
Lecture 3
Semester 1 2021 Week 2 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L3
•
Language Models
One application NLP is about explaining language
‣ Why some sentences are more fluent than others
•
•
•
•
We measure ‘goodness’ using probabilities estimated by language models
E.g. in speech recognition:
recognise speech > wreck a nice beach
Language model can also be used for generation 2
COMP90042
L3
3
COMP90042
L3
•
•
Useful for
‣ Query completion
‣ Optical character recog.
Language Models
Other generation tasks ‣ Machine translation
‣ Summarisation
‣ Dialogue systems
• Nowadays pretrained language models are the backbone of modern NLP systems
4
COMP90042
L3
• • •
Deriving n-gram language models Smoothing to deal with sparsity Generating language
Outline
5
COMP90042
L3
Probabilities: Joint to Conditional Our goal is to get a probability for an arbitrary sequence
of m words
P(w1,w2,…,wm)
First step is to apply the chain rule to convert joint
probabilities to conditional ones
P(w1,w2,…,wm) = P(w1)P(w2|w1)P(w3|w1,w2)… P(wm|w1,…,wm−1)
6
COMP90042
L3
The Markov Assumption Still intractable, so make a simplifying assumption:
𝑃(𝑤𝑖|𝑤1…𝑤𝑖−1) ≈ 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) For some small n
When n = 1, a unigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑃(𝑤𝑖)
wi
the dog barks
wi
the dog barks
wi
the dog barks 7
When n = 2, a bigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑃(𝑤𝑖|𝑤𝑖−1)
When n = 3, a trigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = 𝑃(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) 𝑖=1
𝑖=1 ∏𝑚
𝑖=1 𝑚
𝑚
COMP90042
L3
Maximum Likelihood Estimation
How do we calculate the probabilities? Estimate based on counts in our corpus:
For unigram models,
𝐶(𝑤𝑖) 𝑀
C(barks) M
C(dog barks) C(dog)
For bigram models,
𝑃(𝑤𝑖) =
𝑃 (𝑤𝑖 | 𝑤𝑖−1) = 𝐶(𝑤𝑖−1𝑤𝑖)
𝐶(𝑤𝑖−1)
For n-gram models generally,
𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) =
𝐶(𝑤𝑖−𝑛+1…𝑤𝑖) 𝐶(𝑤𝑖−𝑛+1…𝑤𝑖−1)
8
COMP90042
L3
•
Special tags used to denote start and end of sequence
‣ = sentence start ‣ = sentence end
Book-ending Sequences
9
COMP90042
L3
Trigram example yes no no no no yes
no no no yes yes yes no
What is the probability of
yes no no yes
under a trigram language model? P(yes no no yes) =
P(yes | ) × P(no | because it’s the end of sentence! yes) × P(no | yes no) × P(yes | no no) × P( | no yes)
Corpus:
Need to predict
10
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
11
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
12
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
13
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no
14
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no
P(yes | no no)
?
PollEv.com/jeyhanlau569
15
COMP90042
L3
16
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no
P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
17
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no no yes yes
P(no | yes) | no yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no
P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
P(
1/2
no yes
18
COMP90042 Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes) = 1⁄2 × 1 × 1⁄2 × 2⁄5 × 1⁄2 = 0.05
L3
P(yes | )
1/2
yes no yes no
P(no | yes) | no yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no
P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
P(
1/2
no yes
19
COMP90042
L3
Several Problems
• Languagehaslongdistanceeffects—needlargen
‣ The lecture/s that took place last week was/were on preprocessing.
• Resultingprobabilitiesareoftenverysmall
‣ Use log probability to avoid numerical underflow
• Whataboutunseenwords?
‣ Special symbol to represent them (e.g.
whole term = 0 if P(w2|w1) = 0
• Unseenn-grams?
P(w1, w2, . . . , wm) = P(w1 | < s > ) × P(w2 | w1) . . . ‣ Need to smooth the LM!
20
‣
COMP90042
L3
Smoothing
21
COMP90042
L3
Smoothing
• Basic idea: give events you’ve never seen before some probability
• Must be the case that P(everything) = 1
• Many different kinds of smoothing ‣ Laplacian (add-one) smoothing
‣ Add-k smoothing
‣ Absolute discounting
‣ Kneser-Ney ‣ And others…
22
COMP90042
L3
Laplacian (Add-one) Smoothing
•
For unigram models (V= the vocabulary),
𝐶(𝑤𝑖) + 1 𝑀+|𝑽|
𝑃 (𝑤|𝑤 )=𝐶(𝑤𝑖−1𝑤𝑖)+1 𝑎𝑑𝑑1 𝑖 𝑖−1 𝐶(𝑤𝑖−1) + |𝑽 |
Simple idea: pretend we’ve seen each n-gram once more than we did.
𝑃𝑎𝑑𝑑1(𝑤𝑖) = For bigram models,
23
COMP90042
L3
Add-one Example the rat ate the cheese
What’s the bigram probability P(ate | rat) under add-one smoothing?
= 𝐶(𝑟𝑎𝑡𝑎𝑡𝑒)+1= 2 V={the,rat,ate,cheese,} 𝐶(𝑟𝑎𝑡) + |𝑽 | 6
What’s the bigram probability P(ate | cheese) under add-one smoothing?
𝐶(𝑐h𝑒𝑒𝑠𝑒 𝑎𝑡𝑒) + 1
= 𝐶(𝑐h𝑒𝑒𝑠𝑒)+|𝑽| =
1 6
is not part of vocabulary because we never need to infer its conditional probability (e.g. P( | …))
Recall: P(yes no no yes) =
P(yes | ) × P(no | yes) × P(no | yes no) × P(yes | no no) × P( | no yes)
24
COMP90042
L3
Add-k Smoothing
• • •
𝑃 (𝑤|𝑤 ,𝑤 )=𝐶(𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)+𝑘 𝑎𝑑𝑑𝑘 𝑖 𝑖−1 𝑖−2 𝐶(𝑤𝑖−2,𝑤𝑖−1)+𝑘|𝑽|
•
Adding one is often too much Instead, add a fraction k
AKA Lidstone Smoothing
Have to choose k
25
COMP90042
L3
impropriety
offense
damage
deficiencies
outbreak
infirmity
alleged
Lidstone Smoothing
Context = alleged
• 5 observed bi-grams • 2 unobserved bi-grams
20
1.0 20 1.0 0.391 x 20
(0 + 0.1) / (20 + 7 x 0.1)
(8 + 0.1) / (20 + 7 x 0.1) 26
COMP90042
L3
•
•
‘Borrows’ a fixed probability mass from observed n-gram counts
Absolute Discounting
Redistributes it to unseen n-grams
27
COMP90042
L3
Absolute Discounting
Context = alleged
• 5 observed bi-grams • 2 unobserved bi-grams
impropriety
offense
damage
deficiencies
outbreak
infirmity
alleged
20 1.0 20 1.0 20 1.0
(0.1 x 5) / 2 8 – 0.1
0.25 / 20
28
COMP90042
L3
•
•
Backoff
Absolute discounting redistributes the probability
mass equally for all unseen n-grams
Katz Backoff: redistributes the mass based on a
lower order model (e.g. unigram)
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
wj:C(wi−1,wj)=0 P(wj)
if C(wi−1, wi) > 0 otherwise
unigram probability for wi
e.g. P(infirmity)
katz i i−1
⎩α(wi−1) ×
sum unigram probabilities for all words that do not co-occur with context wi-1
e.g. P(infirmity) + P(alleged)
the amount of probability mass that has been discounted for context wi-1 ((0.1 x 5) / 20 in previous slide)
29
COMP90042
L3
katz i i−1
⎩α(wi−1) ×
Issues with Katz Backoff
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
wj:C(wi−1,wj)=0 P(wj)
if C(wi−1, wi) > 0 otherwise
•
•
•
•
Francisco
I can’t see without my reading ___
C(reading, glasses) = C(reading, Francisco) = 0 C(Fransciso) > C(glasses)
Katz backoff will give higher probability to
30
COMP90042
L3
Kneser-Ney Smoothing
• Redistributeprobabilitymassbasedontheversatilityofthe lower order n-gram
• AKA“continuationprobability” • Whatisversatility?
‣ High versatility -> co-occurs with a lot of unique words, e.g. glasses
– men’s glasses, black glasses, buy glasses, etc
‣ Low versatility -> co-occurs with few unique words, e.g.
francisco
– san francisco
31
COMP90042
L3
Kneser-Ney Smoothing
C(wi−1, wi) − D , if C(wi−1, wi) > 0 C(wi−1)
PKN(wi|wi−1) =
β(wi−1)Pcont(wi), otherwise
the amount of probability mass that Pcont(wi) = | {wi−1 : C(wi−1, wi) > 0} | has been discounted for context wi-1 ∑wi | {wi−1 : C(wi−1, wi) > 0} |
• Intuitively the numerator of Pcont counts the number of unique wi-1 that co-occurs with wi
• Highcontinuationcountsforglasses • LowcontinuationcountsforFranciso
32
COMP90042
L3
Interpolation
• Abetterwaytocombinedifferentordersofn-grammodels
• Weightedsumofprobabilitiesacrossprogressivelyshorter contexts
• Interpolatedtrigrammodel:
+ λ1P1(wi) λ1 + λ2 + λ3 = 1
PIN(wi | wi−1, wi − 2) = λ3P3(wi | wi−2, wi−1) +λ2P2(wi|wi−1)
Learned based on held out data
33
COMP90042
L3
Interpolated Kneser-Ney Smoothing
•
Interpolation instead of back-off
PIKN(wi|wi−1)= C(wi−1,wi)−D+γ(wi−1)Pcont(wi)
•
C(wi−1)
w∑here γ(wi−1) is normalising constant such that
PIKN(wi | wi−1) = 1.0 wi−1∈V
34
COMP90042
L3
•
•
Commonly used Kneser-Ney language models use 5-grams as max order
In Practice
Has different discount values for each n-gram order.
35
COMP90042
L3
How can we use n-gram language model to fix spelling errors?
• Build a character-level LM and fix it character by character
• Build a word-level LM and use left/right context words to find the most appropriate substituting word
• Build a LM using common typos as training data
• N-gram LM can’t fix spelling errors
PollEv.com/jeyhanlau569
36
COMP90042
L3
37
COMP90042
L3
Generating Language
38
COMP90042
L3
Generation
• Given an initial word, draw the next word according to the probability distribution produced by the language model.
• Include (n-1) tokens for n-gram model, to provide context to generate first word
‣ never generate terminates the sequence
‣ generating
39
COMP90042
L3
• •
Generation (Bigram LM) Sentence =
P(? | ) = “a” 0.5
0.375 0.25 0.125 0
cow grass
eats a
40
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a
P(? | a) = “cow” 0.7
0.525 0.35 0.175 0
cow grass
eats a
41
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow
P(? | cow) = “eats” 0.6
0.45 0.3 0.15 0
cow grass eats
a
42
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats
P(? | eats) = “grass” 0.8
0.6 0.4 0.2
0
cow grass
eats a
43
COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats grass” 0.6
P(? | grass) = “
0.45 0.3 0.15 0
cow grass
eats a
44
COMP90042
L3
• •
Sentence = a cow eats grass Done!
Generation (Bigram LM)
45
COMP90042
L3
•
•
Argmax: takes highest probability word each turn ‣ Greedy search
•
Randomly samples from the distribution
How to Select Next Word?
Beam search decoding (lecture 17):
‣ Keeps track of top-N highest probability words each turn
‣ Select sequence of words that produce the best sentence probability
46
COMP90042
L3
•
•
• •
N-gram language models are a simple but effective way to capture the predictability of language
A Final Word
Can be trained in an unsupervised fashion, scalable to large corpora
Require smoothing to be effective
Modern language models uses neural networks (lecture 8)
47
COMP90042
L3
•
Reading E18 Chapter 6 (skip 6.3)
48