程序代写代做代考 python chain lecture03.pptx

lecture03.pptx

LECTURE 3

Introducton to Language Models

Arkaitz Zubiaga, 17
th

January, 2018

2

LECTURE 3: CONTENTS

 Statstcal language models.

 N-grams.

 Estmatng probabilites of n-grams.

 Evaluaton and perplexgity.

3

N-GRAMS

 N-gram: sequence of n words.

 e.g. I want to go to the cinema.

 2-grams (bigrams): I want, want to, to go, go to, to the,…

 3-grams (trigrams): I want to, want to go, to go to,…

 4-grams: I want to go, want to go to, to go to the,…

 …

4

STATISTICAL LANGUAGE MODELLING

 Statitial languag m l l: probability distributon over
sequence of words.

 * I want to * → high probability, common sequence in English

 * want I to * → low probability or zero

5

STATISTICAL LANGUAGE MODELLING

 How are these probability distributons useful?

 Maihin tranilat n: P(“please, call me”) > P(“please, call I”)

 Sp lling i rr it n:
“its 5pm now” → correct to “it’s 5pm now”, higher P

 Sp ih r i gnit n:
P(“I saw a van”) >> P(“eyes awe of an”)

 And in many th r NLP taiki!

6

STATISTICAL LANGUAGE MODELLING

 Probability of sequence of words (W):
P(W) = P(w

1
, w

2
, w

3
, …, w

n
)

 Also, we ofen look at probability of upcoming word:
P(w

4
|w

1
, w

2
, w

3
)

 Both of the above are known as languag m l li.

7

HOW DO WE COMPUTE P(W)?

 How likely is the following sequence?

P(I, have, seen, my, friend, in, the, library)

 We can rely on the Chain Rul f Pr bability.

8

THE CHAIN RULE OF PROBABILITY

 Definiton of the rule:

 More variables:

 Generalisaton of the rule:

9

COMPUTING P(W) USING THE CHAIN RULE

 P(I, found, two, pounds, in, the, library) =

P(I) *
P(found | I) *
P(two | I found) *
P(pounds | I found two) *
P(in | I found two pounds) *
P(the | I found two pounds in) *
P(library | I found two pounds in the)

10

AND HOW DO WE COMPUTE PROBABILITIES?

 Diving the number of occurrences?

P(library | I found two pounds in the) =

 There are so many diferent sequences, we won’t observe
enough instances in our data!

11

MARKOV ASSUMPTION

 Appr ximat the probability by iimplifying it:

P(library | I found two pounds in the) ≈ P(library | the)

 Or:

P(library | I found two pounds in the) ≈ P(library | in the)

 It’s much m r lik ly that we’ll observe “in the library” in our
training data.

12

MARKOV ASSUMPTION

 Whiih w ian g n ralii ai:

P(w
i
| w

1
, w

2
, …, w

i-1
) ≈ P(w

i
| w

i-k
, w

i-k+1
, …, w

i-1
)

i.e., we will only look at the lait k w rli.

13

LANGUAGE MODELS

 We can go with bigrams, trigrams,…

 Even exgtend it to 4-grams, 5-grams,… but th l ng r we pick, th
m r ipari our counts.

14

LANGUAGE MODELS

 N t : it’s not a perfect soluton, e.g.:

My request for using the department’s supercomputer to run
exgperiments nexgt month is … approved.

“month is approved” → unlikely
“request … approved” → more likely, not captured by n-grams

 N-grami are however ofen a g l i lut n.

ESTIMATING N-GRAM
PROBABILITIES

16

ESTIMATING BIGRAM PROBABILITIES

 Maximum Lik lih l Eitmat (MLE):

e.g.

 

P(wi |wi- 1) =
count(wi- 1,wi)
count(wi- 1)

17

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

18

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

 P(am | I) = count(“I am”) / count(“I”) = 2 / 3 = 0.667

19

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

 P(I | ) = count(“ I”) / count(“”) = 1 / 3 = 0.333

20

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right n w
The laptop I gave you is very good
I am very happy to be here

 P(now | right) = count(“right n w”) / count(“right”) = 1 / 1 = 1

21

EXAMPLE OF BIGRAM COUNTS

 Counts exgtracted from corpus of 9,222 sentences.

22

INFERRING BIGRAM PROBABILITIES

 Divide values by unigram counts.

 Outcome.

23

BIGRAM ESTIMATE OF SENTENCE PROBABILITY

 P(I want to eat chinese food) =

 P(want | I) [0.33]
* P(to | want) [0.66]
* P(eat | to) [0.28]

 * P(chinese | eat) [0.021]

 * P(food | chinese) [0.52]

 = 0.000665945

24

PRACTICAL SOLUTION

 0.000665945 is a very low probability.

 We may end up with a fl atng p int unl rfl w!

 S lut n:

 Use l g ipai instead.

 Sum instead of multplicaton.

log(p1´p2 ´p3´p4 )=logp1 + log p2 + logp3 + logp4

25

LANGUAGE MODELLING TOOLKITS

 The CMU Statstcal Language Modeling (SLM) Toolkit:
htp://www.speech.cs.cmu.edu/SLMiinfo.html

 SRILM – The SRI Language Modeling Toolkit:
htp://www.speech.sri.com/projects/srilm/

26

COLLECTIONS OF LANGUAGE MODELS

 Collecton of Google N-grams:

27

COLLECTIONS OF LANGUAGE MODELS

 Collecton of Google N-grams:

28

COLLECTIONS OF LANGUAGE MODELS

 G gl N-grami is the b it i ll it n of language models today.

 Cav at: it has a cost, $150.

htps://catalog.ldc.upenn.edu/LDC2006T13

29

COLLECTIONS: FREE ALTERNATIVE

 G gl B ki N-grami:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

EVALUATION AND PERPLEXITY

31

EVALUATION OF OUR MODEL

 Does our language model prefer good sentences to bad ones?

 i.e. does it assign higher probability:

 to “real” or “frequent” sentences ( .g. I want t )

 than “ungrammatcal” or “rarely observed” sentences?
( .g. want I t )

32

EVALUATION OF OUR MODEL

 Evaluat n:

 Train model on training data.

 Test model’s performance on new, unseen data.

 We need an valuat n m trii → which measures how good our
model matches the sentences in the test set.

33

EVALUATION APPROACHES

 Two diferent evaluaton approaches:

 Extriniii r in-viv evaluaton.

 Intriniii r in-vitr evaluaton.

34

EXTRINSIC EVALUATION

 B it f r i mparing models A and B.

 Test A and B in an NLP task (MT, spell corrector, etc.).

 H w w ll l w p rf rm with A? And with B?

 Which corrects spelling beter?

 Which produces beter translatons?

35

DISADVANTAGE OF EXTRINSIC EVALUATION

 Extriniii evaluaton ian tak v ry l ng to run.

 Alternatvely intriniii valuat n.

 Computaton of p rpl xity.

 Rough appr ximat n.

 Only valil if t it l in iimilar lata.

36

INTRINSIC EVALUATION: PERPLEXITY

 P rpl xity:

given a language model, on average:
how difcult is it to pr liit th n xt w rl?

e.g. I always order pizza with cheese and iiii → ???

37

INTRINSIC EVALUATION: PERPLEXITY

 Th Shann n Gam :

 How well can we predict the nexgt word?

I alwayi rl r pizza with ih i anl ____

 A g l m l l:
the one that gives higher probability to the actual nexgt word.

mushrooms 0.1

pepperoni 0.1

jalapeños 0.01

….

biscuits 0.000001

38

INTRINSIC EVALUATION: PERPLEXITY

 To compute the p rpl xity,
we will use an unseen test set.

(diferent from the training data used to build the model)

 A good language model, i.e. low perplexgity,
will be the one that maxgimises P(sentence)

39

INTRINSIC EVALUATION: PERPLEXITY

 Perplexgity of a sequence of words W, PP(W):

PP(W) = P(w1w2…wN )

1

N

=
1

P(w1w2…wN )
N

40

INTRINSIC EVALUATION: PERPLEXITY

PP(W) = P(w1w2…wN)

1
N

=
1

P(w1w2…wN )
N

Chain rule

Bigrams

41

INTRINSIC EVALUATION: PERPLEXITY

 Perplexgity is the w ight l quival nt f th branihing fait r .
i.e. n av rag , h w many pt ni in each case?

 For perplexgity, we weight them according to pr babilit i.

42

PERPLEXITY AS A BRANCHING FACTOR

 Suppose we have a i nt ni i niiitng f ranl m ligiti [0-9] .

 All digits have the same probability: 1/10.

 What is the perplexgity?

43

INTERPRETING PERPLEXITY

 L w r p rpl xity (i.e. higher probability), b t r m l l!

 e.g. WSJ corpus: 38 million words for training, 1.5 million for
testng.

Unigram Bigram Trigram
Perplexgity 962 170 109

44

LANGUAGE MODELS

 R m mb r: language models will work best on similarly looking
data.

 If we train on social media data, that may not work for novels.

 OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!

45

LANGUAGE MODELS

 Limitat n: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.

 Is this the reality though?

 There are diferent approaches for l aling with z r i and to
enable g n raliiat n f train l m l li .

 In the nexgt lecture.

46

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martn. 2009. Speech and
Language Processing: An Introducton to Natural Language
Processing, Speech Recogniton, and Computatonal Linguistcs.
3rd editon. Chapt ri 4.1-4.2.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapt r 2,
S it n 2.