Language Modeling
Lizhen Qu
Recap
2
Sentence Splitting
Tokenization
Stemming
POS Tagging
Overview of the NLP Lectures
§ Introduction to natural language processing (NLP).
§ Regular expressions, sentence splitting,
tokenization, part-of-speech tagging.
§ Language models.
§ Vector semantics.
§ Parsing.
§ Compositional semantics.
3
Language Models
§ Goal: assign a probability to a word sequence.
– Speech recognition:
• P(I ate a cherry) > P(Eye eight Jerry)
– Spelling correction:
• P(Australian National University) > P(Australian National Univerisity)
– Collocation error correction:
• P(high wind) > P(large wind)
– Machine Translation:
• P(The magic of Usain Bolt on show…) > P(The magic of Usain Bolt
at the show …)
– Question-answering, summarization, etc.
4
Probabilistic Language Modeling
§ A language model computes the probability of a
sequence of words.
– A vocabulary .
–
–
§ Related task: probability of an upcoming word.
–
§ LM: Either or .
5
p(x1, x2, …, xl) � 0
V
X
(x1,x2,…,xl)2V⇤
p(x1, x2, …, xl) = 1
p(x4|x1, x2, x3)
p(x4|x1, x2, x3) p(x1, x2, …, xl)
How to Compute
§ Apply chain rule:
§ Compute
6
p(x1, x2, …, xl)
P (x1, x2, …, xl) = P (x1)P (x2|x1)P (x3|x1, x2)…P (xl|x1, …, xl�1)
P(All Blacks’ hotel room bugged) = P(All) P(Blacks | All) P(’| All Blacks)
… P(bugged | All Blacks’s hotel room)
Estimate the Probabilities
§ P(bugged | All Blacks’s hotel room) =
§ Not enough data!
7
Count(All Blacks’s hotel room bugged)
Count(All Blacks’s hotel room)
Markov Assumption
§ Simplification:
– P(bugged | All Blacks’s hotel room) = P(bugged | room)
– or P(bugged | All Blacks’s hotel room) = P(bugged | hotel
room)
§ First-order Markov assumption:
8
P (x1, x2, …, xl) = P (x1)
lY
i=2
P (xi|x1, …, xi�1)
= P (x1)
lY
i=2
P (xi|xi�1)
Unigram Model
§ Zero-order Markov assumption.
§ Examples generated from a unigram model.
9
P (x1, x2, …, xl) =
lY
i
P (xi)
Months the my and issue of year foreign new exchange’s
september were recession exchange new endorsed a acquire
to six executives
Bigram Model
§ First-order assumption.
§ P(I want to eat Chinese food.) = ?
§ Estimate bigram probabilities from a training corpus.
10
P (x1, x2, …, xl) = P (x1)
lY
i=2
P (xi|xi�1)
Trigram Models
§ Second order assumption.
§ long-distance dependencies of language.
§ We can extend to 4-grams , 5-grams …
11
P (x1, x2, …, xl) = P (x1)P (x2|x1)
lY
i=3
P (xi|xi�1, xi�2)
“The iphone which I bought one week ago does
not stand the cold. “
Restaurant Corpus
0 1 0 0 0 0 4 lunch
0 0 0 0 17 0 19 food
0 120 0 0 0 0 2 Chinese
52 2 19 0 2 0 0 eat
12 0 3 860 10 0 3 to
6 8 6 0 786 0 3 want
0 0 0 13 0 1087 8 I
lunch food Chinese eat to want I
Bigrams :
459 213 938 3256 1265 809 3437
lunch food Chinese eat to want I
Unigrams :
Total : 11024
Compute Bigram Probabilities
§ Maximum likelihood estimation:
§ Bigram probabilities:
– P(want|I) = 1087 / 3437 = 0.32
§ Log probabilities:
– logP(I want to eat Chinese food) = logP(I) + logP(want |
I) + logP(to|want) + logP(eat|to) + logP(Chinese|eat) +
logP(food|Chinese)
P (xi|xi�1) =
count(xi�1, xi)
count(xi�1)
Sequence Generation
§ Compute conditional probabilities.
– P(want | I) = 0.32
– P(to| I) = 0
– P(eat|I) = 0.004
– P(Chinese|I) = 0
– P (I | I) = 0.002
– …
§ Generate a random number in [0,1].
§ See which region it falls into.
14
Approximating Shakespeare
§ Generate sentences from a unigram model.
– Every enter now severally so, let
– Hill he late speaks; or! a more to leg less first you enter
§ from a bigram model
– What means, sir. I confess she? then all sorts, he is trim,
captain.
– Why dost stand forth thy canopy, forsooth; he is this
palpable hit the King Henry.
§ from a trigram model.
– Sweet prince, Falstaff shall die.
– This shall forbid it should be branded, if renown made it
empty.
15
The Perils of Overfitting
§ P(I want to eat Chinese lunch.) = ? when
count(Chinese lunch) = 0.
§ In real life, the test corpus is often different than the
training corpus.
§ Unknown words!
16
Generalization by avoiding zeros!
Interpolation
§ Key idea: mix of lower-order n-gram probabilities.
§ For bigram model:
§ For trigram model:
17
P̂ (xi|xi�1, xi�2) = �1P (xi|xi�1, xi�2) + �2P (xi|xi�1) + �3P (xi)
P̂ (xi|xi�1) = �1P (xi|xi�1) + �2P (xi)
�1 � 0,�2 � 0,�3 � 0
�1 � 0,�2 � 0
X
i
�i = 1
�1 + �2 = 1
How to Set the Lambdas?
§ Estimate on the held-out data.
§ One simple estimation (Collins et al.).
18
Training data held-out data test data
�i
�1 =
count(xi�1, xi�2)
count(xi�1, xi�2) + �
�2 = (1� �1)⇥
count(xi�1)
count(xi�1) + �
�3 = 1� �1 � �2
Absolute Discounting Interpolation
§ Absolute discounting.
§ Often use d = 0.75 in practice.
§ Is it sufficient to use ?
19
P
AbsDiscount
(xi|xi�1) =
count(xi, xi�1)� d
count(xi�1)
+ �(xi�1)P (xi�1)
P (xi�1)
Knerser-Ney Smoothing (i)
§ Be#er es’mate for probabili’es of lower-order
unigrams!
– Shannon game: I can’t see without my reading___________?
– “Francisco” is more common than “glasses”
– … but “Francisco” always follows “San”
§ Pcon’nua’on(x) : How likely is a word to appear as a
novel continuation?
– For each word x, count the number of bigram types it
completes.
20
P
continuation
(xi) / |{xi�1|count(xi�1, xi) > 0}|
Knerser-Ney Smoothing (ii)
§ Example:
§ Normalized by the total number of bigram types.
21
|{xi�1|count(xi�1,Francisco) > 0}| << |{xi�1|count(xi�1, glasses) > 0}|
P
continuation
(xi) =
|{xi�1|count(xi�1, xi) > 0}|
|{(xj�1, xj)|count(xj�1, xj) > 0}|
|{(xj�1, xj)|count(xj�1, xj) > 0}|
Kneser-Ney Smoothing (iv)
§ definition for Bigrams:
where
22
�(xi�1) =
d
count(xi�1)
|{x|count(xi�1, x) > 0}|
https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
P
KN
(xi|xi�1) =
max(count(xi�1, xi)� d, 0)
count(xi�1)
+ �(xi�1)Pcontinuation(xi)
Evaluation of Language Models
§ Extrinsic evaluation:
– Put each model in a task
• Spelling correction, machine translation etc.
– Time consuming.
– Task dependent.
§ Intrinsic evaluation:
– perplexity.
– Useful in pilot experiments.
23
P(X) = P(x1x2…xL )
−
1
L
=
1
P(x1x2…xL )
L
=
1
P(xi | xi−1)∏
L
Unigrams Bigram Trigram
Perplexity 962 170 109
Google N-Gram Release
24
§ serve as the incoming 92
§ serve as the incubator 99
§ serve as the independent 794
§ serve as the index 223
§ serve as the indication 72
§ serve as the indicator 120
§ serve as the indicators 45
§ serve as the indispensable 111
§ serve as the indispensible 40
§ serve as the individual 234
http://ngrams.googlelabs.com/
Smoothing for Web-scale N-grams
§ “Stupid backoff” (Brants et al. 2007)
§ No discounting, just use relative frequencies
25
S(wi |wi−k+1
i−1 ) =
count(wi−k+1
i )
count(wi−k+1
i−1 )
if count(wi−k+1
i )> 0
0.4S(wi |wi−k+2
i−1 ) otherwise
”
#
$$
%
$
$
S(wi ) =
count(wi )
N
Tools
§ SRILM
– h#p://www.speech.sri.com/projects/srilm/
§ Berkeley LM
– https://code.google.com/archive/p/berkeleylm/
§ KenLM
– https://kheafield.com/code/kenlm/
§ Available LM models
– http://www.keithv.com/software/csr/
26