PowerPoint Presentation
LECTURE 4
Advanced Language Models: Generalisaton and Zeros
Arkaitz Zubiaga, 22nd January, 2018
2
Generalisaton of Language Models and Zeros
Smoothing Approaches for Generalisaton
● Laplace smoothing.
● Interpolaton, backof and web-scale LMs.
● Good Turing Smoothing.
● Kneser-Ney Smoothing.
LECTURE 4: CONTENTS
3
Language models: P(I, am, fne) → high
P(fne, am, I) → low or 0
Remember: language models will work best on similarly looking
data.
If we train on social media data, that may not work for novels.
OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!
GENERALISATION OF LANGUAGE MODELS
4
Limitaton: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.
Is this the reality though?
We need to consider the possibility of new n-grams in unseen
data.
GENERALISATION OF LANGUAGE MODELS
5
Choose a random startng bigram
(, w) based on probability.
While x != :
Choose next random bigram
(w, x) based on probability.
Concatenate all bigrams.
THE SHANNON VISUALISATION METHOD
I
I want
want to
to eat
eat Chinese
Chinese food
food
I want to eat Chinese food
6
SHANNON VIS. METHOD FOR SHAKESPEARE
7
Signifcant improvement as we train longer n-grams.
But it’s a tradeof:
Longer n-grams → more accurate.
Longer n-grams → more sparse.
We need to fnd a balance.
SHANNON VIS. METHOD FOR SHAKESPEARE
8
Shakespeare used:
N = 884,647 tokens.
V = 29,066 types → V2 ≈ 845M possible bigrams!
Shakespeare only produced ~300,000 bigram types (0.04% of all
possible bigrams!)
Other bigrams may be possible, but we haven’t observed
them.
SHAKESPEARE’S BIGRAMS
9
Training set:
… found a penny
… found a soluton
… found a tenner
… found a book
P(card | found a) = 0
ZEROS
Test set:
… found a fver
… found a card
10
We have sparse statstcs:
P(w | “found a”)
3 → penny
2 → soluton
1 → tenner
1 → book
7 → total
THE INTUITION OF SMOOTHING
p
e
n
n
y
so
lu
ti
o
n
te
n
n
e
r
fiv
e
r
b
o
o
k
ca
rd
..
.
11
We’d like to improve the distributon:
P(w | “found a”)
3 → penny → 2.5
2 → soluton → 1.5
1 → tenner → 0.5
1 → book → 0.5
other → 2
7 → total
THE INTUITION OF SMOOTHING
a
ll
e
g
a
ti
o
n
s
fiv
e
r
c
a
rd
..
.p
e
n
n
y
so
lu
ti
o
n
te
n
n
e
r
b
o
o
k
12
We’ll see 4 possible solutons:
1) Laplace smoothing.
2) Interpolaton, backof and web-scale LMs.
3) Good Turing Smoothing.
4) Kneser-Ney Smoothing.
FOUR POSSIBLE SOLUTIONS
LAPLACE SMOOTHING
14
Pretend we have seen each word one more tme than we
actually did.
MLE estmate:
Add-one estmate:
LAPLACE SMOOTHING: ADD-ONE ESTIMATION
PMLE (wi |wi-1) =
c(wi-1,wi )
c(wi-1)
PAdd-1(wi |wi-1) =
c(wi-1,wi )+1
c(wi-1)+V
15
Add 1 to ALL values in the matrix.
LAPLACE SMOOTHED BIGRAM COUNTS
16
LAPLACE SMOOTHED BIGRAM PROBABILITIES
17
RECONSTITUTED COUNTS
18
Original:
Smoothed:
COMPARING ORIGINAL VS SMOOTHED
19
Original:
Smoothed:
COMPARING ORIGINAL VS SMOOTHED
dividing by 10!
20
MORE GENERAL LAPLACE SMOOTHING: ADD-k
PAdd- k(wi |wi-1) =
c(wi-1,wi )+k
c(wi-1)+kV
21
It’s generally not a good soluton for language models.
With such a sparse matrix, we’re replacing too many 0’s with
1’s.
It’s stll ofen used for other NLP taskss, with fewer 0’s.
e.g. text classifcaton using unigrams.
LAPLACE SMOOTHING IS NOT IDEAL FOR LMs
INTERPOLATION, BACKOFF AND
WEB-SCALE LMs
23
For rare cases → you may choose to use less context
Backsof:
use trigram if you have good evidence,
otherwise bigram, otherwise unigram
Interpolaton:
mix unigram, bigram, trigram
Interpolaton works beter
BACKOFF AND INTERPOLATION
24
Simple interpolaton:
Lambdas conditonal on context:
LINEAR INTERPOLATION
25
Use a held-out (or development) corpus:
Fix the N-gram probabilites (on the training data)
Then search for λs that give largest probability to held-out set:
HOW DO WE SET THOSE LAMBDAS?
logP(w1…wn |M(l1…lk)) = logPM (l1…lk)(wi |wi-1)
i
å
Training DataTraining Data
Held-Out
Data
Held-Out
Data
Test
Data
Test
Data
26
Two scenarios:
1) We ksnow ALL words (including those in the test) in advance.
We have a fxed vocabulary V – closed vocab. Tasks
2) We may fnd new, unseen words (generally the case)
Out Of Vocabulary (OOV) words – open vocab. tasks
OPEN VS CLOSED VOCABULARY TASKS
27
We can use new token for all unknown words:
How to train:
Create a fxed lexicon L of size V.
While preprocessing text, change words not in L to
Train the LM, where
In test: new words are assigned the probability of
HOW DO WE DEAL WITH OOV WORDS?
28
“Stupid backof” (Brants et al. 2007)
No discountng, just use relatve frequencies
SMOOTHING FOR VERY LARGE CORPORA
S(wi ) =
count(wi )
N
S(wi |wi- k+1
i-1 ) =
count(wi- k+1
i )
count(wi- k+1
i-1 )
if count(wi- k+1
i ) > 0
0.4S(wi |wi- k+2
i-1 ) otherwise
ì
í
ïï
î
ï
ï
GOOD TURING SMOOTHING
30
To estmate counts of unseen things
→ use the count of things we’ve seen once
INTUITION OF GOOD TURING SMOOTHING
31
N
C
= count of tokens observed C tmes
e.g. I am happy and I am fne and I have to go
I → 3
am → 2
and → 2
happy → 1
fne → 1
have → 1
to → 1
go → 1
FREQUENCY OF FREQUENCY “C”
N
1
= 5
N
2
= 2
N
3
= 1
32
You are fshing, and caught:
10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh
How likely is it that next fsh is trout?
1/18
INTUITION OF GOOD TURING SMOOTHING
33
You are fshing, and caught:
10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh
How liksely is it that next fsh is new, e.g. haddock?
Use our estmate of things-we-saw-once → 3/18
OK, but then need to revise all other probabilites.
INTUITION OF GOOD TURING SMOOTHING
34
GOOD TURING CALCULATIONS
, where:
You are fshing, and caught:
10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh
3/18
35
Problem: what about the word “and”? (say c=12156)
● For small c, N
c
> N
c+1
● For large c? N
12157
will likely be zero.
GOOD TURING ISSUES
N1
N2 N3
???
36
Soluton: to avoid zeros for large values of c.
● Simple Good-Turing [Gale and Sampson]:
replace N
c
→ best-ft power law for unreliable counts
GOOD TURING ISSUES
N1
N2
37
Corpus with 22 million words of AP Newswire
GOOD TURING NUMBERS
c* =
(c+1)Nc+1
Nc
Count c Good Turing c*
0 .0000270
1 0.446
2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
KNESER-NEY SMOOTHING
39
Corpus with 22 million words of AP Newswire
For c >= 2:
● c* ≈ (c – .75)
GOOD TURING NUMBERS
c* =
(c+1)Nc+1
Nc
Count c Good Turing c*
0 .0000270
1 0.446
2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
40
Save ourselves some tme and just subtract 0.75 (or some d)!
But should we really just use the regular unigram P(w)?
ABSOLUTE DISCOUNTING INTERPOLATION
discounted bigram
unigram
Interpolation weight
PAbsoluteDiscounting (wi |wi-1) =
c(wi-1,wi ) – d
c(wi-1)
+l(wi-1)P(w)
41
Beter estmate for probabilites of lower-order unigrams!
● Shannon game: I can’t see without my reading _______?
● We have never seen “reading glasses” in training data.
● P(“Kingdom”) > P(“glasses”)
→ but “Kingdom” always comes afer “United”
KNESER-NEY SMOOTHING
42
Beter estmate for probabilites of lower-order unigrams!
● Shannon game: I can’t see without my reading _______?
Instead of P(w): “How likely is w”
● P
contnuaton
(w): “How likely is w to appear as a novel
contnuaton?
For each unigram, count the number of bigram types it
completes → i.e. always paired with a specifc word, or likely in
many bigrams?
KNESER-NEY SMOOTHING
43
How many tmes does w appear as a novel contnuaton:
Normalized by the total number of bigram types
KNESER-NEY SMOOTHING
PCONTINUATION (w)µ {wi-1 :c(wi-1,w) > 0}
PCONTINUATION (w) =
{wi-1 :c(wi-1,w) > 0}
{(wj-1,wj ) :c(wj-1,wj ) > 0}
44
In other words, P
contnuaton
is:
● The number of # of word types seen to precede w,
● normalised by the # of words preceding all words.
I am
We are
You are
They are
She is
KNESER-NEY SMOOTHING
45
In other words, P
contnuaton
is:
● The number of # of word types seen to precede w,
● normalised by the # of words preceding all words.
I am
We are
You are P
contnuaton
(“are”) = 3/5
They are
She is
KNESER-NEY SMOOTHING
46
In other words, P
contnuaton
is:
● The number of # of word types seen to precede w,
● normalised by the # of words preceding all words.
I am
We are
You are P
contnuaton
(“am”) = 1/5
They are
She is
KNESER-NEY SMOOTHING
47
In other words, P
contnuaton
is:
● The number of # of word types seen to precede w,
● normalised by the # of words preceding all words.
I am
We are
You are P
contnuaton
(“is”) = 1/5
They are
She is
KNESER-NEY SMOOTHING
48
KNESER-NEY SMOOTHING
λ is a normalising constant; the probability mass we’ve discounted
the normalised discount
The number of word types that can follow wi-1
= # of word types we discounted
= # of times we applied normalised discount
PKN (wi |wi-1) =
max(c(wi-1,wi ) – d, 0)
c(wi-1)
+l(wi-1)PCONTINUATION(wi )
l(wi-1) =
d
c(wi-1)
{w:c(wi-1,w) > 0}
49
Laplace smoothing:
● OK for text classifcaton, not for language modelling
Kneser-Ney is most widely used.
Stupid backsof for very large collectons (e.g. the Web)
SUMMARY
50
RESOURCES
KenLM Language Model Toolkit:
https://kheafield.com/code/kenlm/
CMU Statistical Language Modeling Toolkit:
http://www.speech.cs.cmu.edu/SLM/toolkit.html
Google Books N-gram Counts:
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
https://kheafield.com/code/kenlm/
http://www.speech.cs.cmu.edu/SLM/toolkit.html
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
51
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters
4.3-4.6.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51