l3-ngram-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 3
Semester 1 2021 Week 2
Jey Han Lau
N-gram Language Models
COMP90042 L3
2
Language Models
• One application NLP is about explaining language
‣ Why some sentences are more fluent than
others
• E.g. in speech recognition:
• recognise speech > wreck a nice beach
• We measure ‘goodness’ using probabilities
estimated by language models
• Language model can also be used for generation
COMP90042 L3
3
COMP90042 L3
4
Language Models
• Useful for
‣ Query completion
‣ Optical character recog.
• Other generation tasks
‣ Machine translation
‣ Summarisation
‣ Dialogue systems
• Nowadays pretrained language models are the
backbone of modern NLP systems
COMP90042 L3
5
Outline
• Deriving n-gram language models
• Smoothing to deal with sparsity
• Generating language
COMP90042 L3
6
Probabilities: Joint to Conditional
Our goal is to get a probability for an arbitrary sequence
of m words
First step is to apply the chain rule to convert joint
probabilities to conditional ones
P(w1, w2, . . . , wm)
P(w1, w2, . . . , wm) = P(w1)P(w2 |w1)P(w3 |w1, w2) . . .
P(wm |w1, . . . , wm−1)
COMP90042 L3
7
The Markov Assumption
Still intractable, so make a simplifying assumption:
) ≈ )
For some small n
When n = 1, a unigram model
, , ..
When n = 2, a bigram model
, , ..
When n = 3, a trigram model
, , ..
𝑃(𝑤𝑖 |𝑤1…𝑤𝑖−1 𝑃(𝑤𝑖 |𝑤𝑖−𝑛+1…𝑤𝑖−1
𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚
∏
𝑖=1
𝑃(𝑤𝑖)
𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚
∏
𝑖=1
𝑃(𝑤𝑖 |𝑤𝑖−1)
𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚
∏
𝑖=1
𝑃(𝑤𝑖 |𝑤𝑖−2𝑤𝑖−1)
the dog barks
wi
the dog barks
wi
the dog barks
wi
COMP90042 L3
8
Maximum Likelihood Estimation
How do we calculate the probabilities? Estimate based on counts in our
corpus:
For unigram models,
For bigram models,
For n-gram models generally,
𝑃(𝑤𝑖) =
𝐶(𝑤𝑖)
𝑀
𝑃(𝑤𝑖 |𝑤𝑖−1) =
𝐶(𝑤𝑖−1𝑤𝑖)
𝐶(𝑤𝑖−1)
𝑃(𝑤𝑖 |𝑤𝑖−𝑛+1…𝑤𝑖−1) =
𝐶(𝑤𝑖−𝑛+1…𝑤𝑖)
𝐶(𝑤𝑖−𝑛+1…𝑤𝑖−1)
C(barks)
M
C(dog barks)
C(dog)
COMP90042 L3
9
Book-ending Sequences
• Special tags used to denote start and end of
sequence
‣ = sentence start = sentence end
‣
COMP90042 L3
10
Trigram example
Corpus:
yes no no no no yes
no no no yes yes yes no
What is the probability of
yes no no yes
under a trigram language model?
P(yes no no yes) =
P(yes | )
P(no | yes) | no yes)
P(no | yes no)
P(yes | no no)
P(
×
×
×
×
Need to predict because it’s the end of sentence!
COMP90042 L3
11
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes) P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
12
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
P(yes | ) 1/2 yes no
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
13
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
14
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(no | yes no) 1/2 yes no noyes no
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
15
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
PollEv.com/jeyhanlau569
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(no | yes no) 1/2 yes no noyes no
P(yes | no no) ?
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L3
16
COMP90042 L3
17
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(no | yes no) 1/2 yes no noyes no
P(yes | no no) 2/5
no no no
no no no
no no yes
no no no
no no yes
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
18
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes)
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(no | yes no) 1/2 yes no noyes no
P(yes | no no) 2/5
no no no
no no no
no no yes
no no no
no no yes
P( | no yes) 1/2 no yes no yes yes
P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)
C(wi−2, wi−1)
COMP90042 L3
19
Corpus:
yes no no no no yes
no no no yes yes yes no
Compute: P(yes no no yes) = ½ 1 ½ ⅖ ½ = 0.05× × × ×
P(yes | ) 1/2 yes no
P(no | yes) 1/1 yes no
P(no | yes no) 1/2 yes no noyes no
P(yes | no no) 2/5
no no no
no no no
no no yes
no no no
no no yes
P( | no yes) 1/2 no yes yes no
COMP90042 L3
20
Several Problems
• Language has long distance effects — need large n
‣ The lecture/s that took place last week was/were on
preprocessing.
• Resulting probabilities are often very small
‣ Use log probability to avoid numerical underflow
• What about unseen words?
‣ Special symbol to represent them (e.g.
• Unseen n-grams?
‣
‣ Need to smooth the LM!
P(w1, w2, . . . , wm) = P(w1 | < s > ) × P(w2 |w1) . . .
whole term = 0 if P(w2|w1) = 0
COMP90042 L3
21
Smoothing
COMP90042 L3
22
Smoothing
• Basic idea: give events you’ve never seen before
some probability
• Must be the case that P(everything) = 1
• Many different kinds of smoothing
‣ Laplacian (add-one) smoothing
‣ Add-k smoothing
‣ Absolute discounting
‣ Kneser-Ney
‣ And others…
COMP90042 L3
23
Laplacian (Add-one) Smoothing
• Simple idea: pretend we’ve seen each n-gram
once more than we did.
For unigram models (V= the vocabulary),
For bigram models,
𝑃𝑎𝑑𝑑1(𝑤𝑖) =
𝐶(𝑤𝑖) + 1
𝑀 + |𝑽 |
𝑃𝑎𝑑𝑑1(𝑤𝑖 |𝑤𝑖−1) =
𝐶(𝑤𝑖−1𝑤𝑖) + 1
𝐶(𝑤𝑖−1) + |𝑽 |
COMP90042 L3
24
Add-one Example
the rat ate the cheese
What’s the bigram probability P(ate | rat) under add-one
smoothing?
= = V = { the, rat, ate, cheese, }
What’s the bigram probability P(ate | cheese) under add-one
smoothing?
= =
𝐶(𝑟𝑎𝑡 𝑎𝑡𝑒) + 1
𝐶(𝑟𝑎𝑡) + |𝑽 |
2
6
𝐶(𝑐h𝑒𝑒𝑠𝑒 𝑎𝑡𝑒) + 1
𝐶(𝑐h𝑒𝑒𝑠𝑒) + |𝑽 |
1
6
is not part of vocabulary because we never need to
infer its conditional probability (e.g. P( | …))
Recall: P(yes no no yes) =
P(yes | ) P(no | yes) | no yes)
P(no | yes no) P(yes | no no)
P(
× ×
× ×
COMP90042 L3
25
Add-k Smoothing
• Adding one is often too much
• Instead, add a fraction k
• AKA Lidstone Smoothing
• Have to choose k
𝑃𝑎𝑑𝑑𝑘(𝑤𝑖 |𝑤𝑖−1, 𝑤𝑖−2) =
𝐶(𝑤𝑖−2, 𝑤𝑖−1, 𝑤𝑖) + 𝑘
𝐶(𝑤𝑖−2, 𝑤𝑖−1) + 𝑘 |𝑽 |
COMP90042 L3
26
Lidstone Smoothing
20
(8 + 0.1) / (20 + 7 x 0.1)
201.0 1.0
Context = alleged
• 5 observed bi-grams
• 2 unobserved bi-grams
(0 + 0.1) / (20 + 7 x 0.1)
0.391 x 20
impropriety
offense
damage
deficiencies
outbreak
infirmity
alleged
COMP90042 L3
27
Absolute Discounting
• ‘Borrows’ a fixed probability mass from observed
n-gram counts
• Redistributes it to unseen n-grams
COMP90042 L3
28
Absolute Discounting
8 – 0.1
Context = alleged
• 5 observed bi-grams
• 2 unobserved bi-grams
0.25 / 20
(0.1 x 5) / 2
impropriety
offense
damage
deficiencies
outbreak
infirmity
alleged
20 1.0 20 1.020 1.0
COMP90042 L3
29
Backoff
• Absolute discounting redistributes the probability
mass equally for all unseen n-grams
• Katz Backoff: redistributes the mass based on a
lower order model (e.g. unigram)
the amount of probability mass that has been discounted for context wi-1
((0.1 x 5) / 20 in previous slide)
Pkatz(wi|wi−1) =
⎧
⎨
⎩
C(wi−1,wi)−D
C(wi−1)
, if C(wi−1, wi) > 0
α(wi−1)×
P (wi)!
wj :C(wi−1,wj)=0
P (wj)
, otherwise
sum unigram probabilities for all words
that do not co-occur with context wi-1
e.g. P(infirmity) + P(alleged)
unigram probability for wi
e.g. P(infirmity)
COMP90042 L3
30
Issues with Katz Backoff
• I can’t see without my reading ___
• C(reading, glasses) = C(reading, Francisco) = 0
• C(Fransciso) > C(glasses)
• Katz backoff will give higher probability to
Francisco
Pkatz(wi|wi−1) =
⎧
⎨
⎩
C(wi−1,wi)−D
C(wi−1)
, if C(wi−1, wi) > 0
α(wi−1)×
P (wi)!
wj :C(wi−1,wj)=0
P (wj)
, otherwise
COMP90042 L3
31
Kneser-Ney Smoothing
• Redistribute probability mass based on the versatility of the
lower order n-gram
• AKA “continuation probability”
• What is versatility?
‣ High versatility -> co-occurs with a lot of unique words,
e.g. glasses
– men’s glasses, black glasses, buy glasses, etc
‣ Low versatility -> co-occurs with few unique words, e.g.
francisco
– san francisco
COMP90042 L3
32
Kneser-Ney Smoothing
the amount of probability mass that
has been discounted for context wi-1
PKN(wi |wi−1) =
C(wi−1, wi) − D
C(wi−1)
, if C(wi−1, wi) > 0
β(wi−1)Pcont(wi), otherwise
Pcont(wi) =
|{wi−1 : C(wi−1, wi) > 0} |
∑
wi
|{wi−1 : C(wi−1, wi) > 0} |
• Intuitively the numerator of Pcont counts the number of
unique wi-1 that co-occurs with wi
• High continuation counts for glasses
• Low continuation counts for Franciso
COMP90042 L3
33
Interpolation
• A better way to combine different orders of n-gram models
• Weighted sum of probabilities across progressively shorter
contexts
• Interpolated trigram model:
PIN(wi |wi−1, wi − 2) = λ3P3(wi |wi−2, wi−1)
+ λ2P2(wi |wi−1)
+ λ1P1(wi)
Learned based on held out data
λ1 + λ2 + λ3 = 1
COMP90042 L3
34
Interpolated Kneser-Ney Smoothing
• Interpolation instead of back-off
• where is normalising constant such that
PIKN(wi |wi−1) =
C(wi−1, wi) − D
C(wi−1)
+γ(wi−1)Pcont(wi)
γ(wi−1)
∑
wi−1∈V
PIKN(wi |wi−1) = 1.0
COMP90042 L3
35
In Practice
• Commonly used Kneser-Ney language models
use 5-grams as max order
• Has different discount values for each n-gram
order.
COMP90042 L3
36
• Build a character-level LM and fix it character by
character
• Build a word-level LM and use left/right context words
to find the most appropriate substituting word
• Build a LM using common typos as training data
• N-gram LM can’t fix spelling errors
PollEv.com/jeyhanlau569
How can we use n-gram language model
to fix spelling errors?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L3
37
COMP90042 L3
38
Generating Language
COMP90042 L3
39
Generation
• Given an initial word, draw the next word according to
the probability distribution produced by the language
model.
• Include (n-1) tokens for n-gram model, to provide
context to generate first word
‣ never generate terminates the sequence
‣ generating
COMP90042 L3
40
Generation (Bigram LM)
• Sentence =
• P(? | ) = “a”
0
0.125
0.25
0.375
0.5
cow grass eats a
COMP90042 L3
41
Generation (Bigram LM)
• Sentence = a
• P(? | a) = “cow”
0
0.175
0.35
0.525
0.7
cow grass eats a
COMP90042 L3
42
Generation (Bigram LM)
• Sentence = a cow
• P(? | cow) = “eats”
0
0.15
0.3
0.45
0.6
cow grass eats a
COMP90042 L3
43
Generation (Bigram LM)
• Sentence = a cow eats
• P(? | eats) = “grass”
0
0.2
0.4
0.6
0.8
cow grass eats a
COMP90042 L3
44
Generation (Bigram LM)
• Sentence = a cow eats grass
• P(? | grass) = “”
0
0.15
0.3
0.45
0.6
cow grass eats a
COMP90042 L3
45
Generation (Bigram LM)
• Sentence = a cow eats grass
• Done!
COMP90042 L3
46
How to Select Next Word?
• Argmax: takes highest probability word each turn
‣ Greedy search
• Beam search decoding (lecture 17):
‣ Keeps track of top-N highest probability words
each turn
‣ Select sequence of words that produce the best
sentence probability
• Randomly samples from the distribution
COMP90042 L3
47
A Final Word
• N-gram language models are a simple but
effective way to capture the predictability of
language
• Can be trained in an unsupervised fashion,
scalable to large corpora
• Require smoothing to be effective
• Modern language models uses neural networks
(lecture 8)
COMP90042 L3
48
Reading
• E18 Chapter 6 (skip 6.3)