lecture03.pptx
LECTURE 3
Introducton to Language Models
Arkaitz Zubiaga, 17
th
January, 2018
2
LECTURE 3: CONTENTS
Statstcal language models.
N-grams.
Estmatng probabilites of n-grams.
Evaluaton and perplexgity.
3
N-GRAMS
N-gram: sequence of n words.
e.g. I want to go to the cinema.
2-grams (bigrams): I want, want to, to go, go to, to the,…
3-grams (trigrams): I want to, want to go, to go to,…
4-grams: I want to go, want to go to, to go to the,…
…
4
STATISTICAL LANGUAGE MODELLING
Statitial languag m l l: probability distributon over
sequence of words.
* I want to * → high probability, common sequence in English
* want I to * → low probability or zero
5
STATISTICAL LANGUAGE MODELLING
How are these probability distributons useful?
Maihin tranilat n: P(“please, call me”) > P(“please, call I”)
Sp lling i rr it n:
“its 5pm now” → correct to “it’s 5pm now”, higher P
Sp ih r i gnit n:
P(“I saw a van”) >> P(“eyes awe of an”)
And in many th r NLP taiki!
6
STATISTICAL LANGUAGE MODELLING
Probability of sequence of words (W):
P(W) = P(w
1
, w
2
, w
3
, …, w
n
)
Also, we ofen look at probability of upcoming word:
P(w
4
|w
1
, w
2
, w
3
)
Both of the above are known as languag m l li.
7
HOW DO WE COMPUTE P(W)?
How likely is the following sequence?
P(I, have, seen, my, friend, in, the, library)
We can rely on the Chain Rul f Pr bability.
8
THE CHAIN RULE OF PROBABILITY
Definiton of the rule:
More variables:
Generalisaton of the rule:
9
COMPUTING P(W) USING THE CHAIN RULE
P(I, found, two, pounds, in, the, library) =
P(I) *
P(found | I) *
P(two | I found) *
P(pounds | I found two) *
P(in | I found two pounds) *
P(the | I found two pounds in) *
P(library | I found two pounds in the)
10
AND HOW DO WE COMPUTE PROBABILITIES?
Diving the number of occurrences?
P(library | I found two pounds in the) =
There are so many diferent sequences, we won’t observe
enough instances in our data!
11
MARKOV ASSUMPTION
Appr ximat the probability by iimplifying it:
P(library | I found two pounds in the) ≈ P(library | the)
Or:
P(library | I found two pounds in the) ≈ P(library | in the)
It’s much m r lik ly that we’ll observe “in the library” in our
training data.
12
MARKOV ASSUMPTION
Whiih w ian g n ralii ai:
P(w
i
| w
1
, w
2
, …, w
i-1
) ≈ P(w
i
| w
i-k
, w
i-k+1
, …, w
i-1
)
i.e., we will only look at the lait k w rli.
13
LANGUAGE MODELS
We can go with bigrams, trigrams,…
Even exgtend it to 4-grams, 5-grams,… but th l ng r we pick, th
m r ipari our counts.
14
LANGUAGE MODELS
N t : it’s not a perfect soluton, e.g.:
My request for using the department’s supercomputer to run
exgperiments nexgt month is … approved.
“month is approved” → unlikely
“request … approved” → more likely, not captured by n-grams
N-grami are however ofen a g l i lut n.
ESTIMATING N-GRAM
PROBABILITIES
16
ESTIMATING BIGRAM PROBABILITIES
Maximum Lik lih l Eitmat (MLE):
e.g.
P(wi |wi- 1) =
count(wi- 1,wi)
count(wi- 1)
17
AN EXAMPLE OF MLE COMPUTATION
For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here
18
AN EXAMPLE OF MLE COMPUTATION
For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here
P(am | I) = count(“I am”) / count(“I”) = 2 / 3 = 0.667
19
AN EXAMPLE OF MLE COMPUTATION
For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here
P(I | ) = count(“ I”) / count(“”) = 1 / 3 = 0.333
20
AN EXAMPLE OF MLE COMPUTATION
For the following sentences:
Yes, I am watching the TV right n w
The laptop I gave you is very good
I am very happy to be here
P(now | right) = count(“right n w”) / count(“right”) = 1 / 1 = 1
21
EXAMPLE OF BIGRAM COUNTS
Counts exgtracted from corpus of 9,222 sentences.
22
INFERRING BIGRAM PROBABILITIES
Divide values by unigram counts.
Outcome.
23
BIGRAM ESTIMATE OF SENTENCE PROBABILITY
P(I want to eat chinese food) =
P(want | I) [0.33]
* P(to | want) [0.66]
* P(eat | to) [0.28]
* P(chinese | eat) [0.021]
* P(food | chinese) [0.52]
= 0.000665945
24
PRACTICAL SOLUTION
0.000665945 is a very low probability.
We may end up with a fl atng p int unl rfl w!
S lut n:
Use l g ipai instead.
Sum instead of multplicaton.
log(p1´p2 ´p3´p4 )=logp1 + log p2 + logp3 + logp4
25
LANGUAGE MODELLING TOOLKITS
The CMU Statstcal Language Modeling (SLM) Toolkit:
htp://www.speech.cs.cmu.edu/SLMiinfo.html
SRILM – The SRI Language Modeling Toolkit:
htp://www.speech.sri.com/projects/srilm/
26
COLLECTIONS OF LANGUAGE MODELS
Collecton of Google N-grams:
27
COLLECTIONS OF LANGUAGE MODELS
Collecton of Google N-grams:
28
COLLECTIONS OF LANGUAGE MODELS
G gl N-grami is the b it i ll it n of language models today.
Cav at: it has a cost, $150.
htps://catalog.ldc.upenn.edu/LDC2006T13
29
COLLECTIONS: FREE ALTERNATIVE
G gl B ki N-grami:
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
EVALUATION AND PERPLEXITY
31
EVALUATION OF OUR MODEL
Does our language model prefer good sentences to bad ones?
i.e. does it assign higher probability:
to “real” or “frequent” sentences ( .g. I want t )
than “ungrammatcal” or “rarely observed” sentences?
( .g. want I t )
32
EVALUATION OF OUR MODEL
Evaluat n:
Train model on training data.
Test model’s performance on new, unseen data.
We need an valuat n m trii → which measures how good our
model matches the sentences in the test set.
33
EVALUATION APPROACHES
Two diferent evaluaton approaches:
Extriniii r in-viv evaluaton.
Intriniii r in-vitr evaluaton.
34
EXTRINSIC EVALUATION
B it f r i mparing models A and B.
Test A and B in an NLP task (MT, spell corrector, etc.).
H w w ll l w p rf rm with A? And with B?
Which corrects spelling beter?
Which produces beter translatons?
35
DISADVANTAGE OF EXTRINSIC EVALUATION
Extriniii evaluaton ian tak v ry l ng to run.
Alternatvely intriniii valuat n.
Computaton of p rpl xity.
Rough appr ximat n.
Only valil if t it l in iimilar lata.
36
INTRINSIC EVALUATION: PERPLEXITY
P rpl xity:
given a language model, on average:
how difcult is it to pr liit th n xt w rl?
e.g. I always order pizza with cheese and iiii → ???
37
INTRINSIC EVALUATION: PERPLEXITY
Th Shann n Gam :
How well can we predict the nexgt word?
I alwayi rl r pizza with ih i anl ____
A g l m l l:
the one that gives higher probability to the actual nexgt word.
mushrooms 0.1
pepperoni 0.1
jalapeños 0.01
….
biscuits 0.000001
38
INTRINSIC EVALUATION: PERPLEXITY
To compute the p rpl xity,
we will use an unseen test set.
(diferent from the training data used to build the model)
A good language model, i.e. low perplexgity,
will be the one that maxgimises P(sentence)
39
INTRINSIC EVALUATION: PERPLEXITY
Perplexgity of a sequence of words W, PP(W):
PP(W) = P(w1w2…wN )
–
1
N
=
1
P(w1w2…wN )
N
40
INTRINSIC EVALUATION: PERPLEXITY
PP(W) = P(w1w2…wN)
–
1
N
=
1
P(w1w2…wN )
N
Chain rule
Bigrams
41
INTRINSIC EVALUATION: PERPLEXITY
Perplexgity is the w ight l quival nt f th branihing fait r .
i.e. n av rag , h w many pt ni in each case?
For perplexgity, we weight them according to pr babilit i.
42
PERPLEXITY AS A BRANCHING FACTOR
Suppose we have a i nt ni i niiitng f ranl m ligiti [0-9] .
All digits have the same probability: 1/10.
What is the perplexgity?
43
INTERPRETING PERPLEXITY
L w r p rpl xity (i.e. higher probability), b t r m l l!
e.g. WSJ corpus: 38 million words for training, 1.5 million for
testng.
Unigram Bigram Trigram
Perplexgity 962 170 109
44
LANGUAGE MODELS
R m mb r: language models will work best on similarly looking
data.
If we train on social media data, that may not work for novels.
OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!
45
LANGUAGE MODELS
Limitat n: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.
Is this the reality though?
There are diferent approaches for l aling with z r i and to
enable g n raliiat n f train l m l li .
In the nexgt lecture.
46
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martn. 2009. Speech and
Language Processing: An Introducton to Natural Language
Processing, Speech Recogniton, and Computatonal Linguistcs.
3rd editon. Chapt ri 4.1-4.2.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapt r 2,
S it n 2.