程序代写代做代考 python chain PowerPoint Presentation

PowerPoint Presentation

LECTURE 3

Introducton to Language Models

Arkaitz Zubiaga, 17th January, 2018

2

LECTURE 3: CONTENTS

 Statstcal language models.

 N-grams.

 Estmatng probabilites of n-grams.

 Evaluaton and perplexity.

3

N-GRAMS

 N-gram: sequence of n words.

 e.g. I want to go to the cinema.

 2-grams (bigrams): I want, want to, to go, go to, to the,…

 3-grams (trigrams): I want to, want to go, to go to,…

 4-grams: I want to go, want to go to, to go to the,…

 …

4

STATISTICAL LANGUAGE MODELLING

 Statstcal language model: probability distributon over
sequence of words.

 * I want to * → high probability, common sequence in English

 * want I to * → low probability or zero

5

STATISTICAL LANGUAGE MODELLING

 How are these probability distributons useful?

 Machine translaton: P(“please, call me”) > P(“please, call I”)

 Spelling correcton:
“its 5pm now” → correct to “it’s 5pm now”, higher P

 Speech recogniton:
P(“I saw a van”) >> P(“eyes awe of an”)

 And in many other NLP tasks!

6

STATISTICAL LANGUAGE MODELLING

 Probability of sequence of words (W):
P(W) = P(w

1
, w

2
, w

3
, …, w

n
)

 Also, we ofen look at probability of upcoming word:
P(w

4
|w

1
, w

2
, w

3
)

 Both of the above are known as language models.

7

HOW DO WE COMPUTE P(W)?

 How likely is the following sequence?

P(I, have, seen, my, friend, in, the, library)

 We can rely on the Chain Rule of Probability.

8

THE CHAIN RULE OF PROBABILITY

 Defniton of the rule:

 More variables:

 Generalisaton of the rule:

9

COMPUTING P(W) USING THE CHAIN RULE

 P(I, found, two, pounds, in, the, library) =

P(I) *
P(found | I) *
P(two | I found) *
P(pounds | I found two) *
P(in | I found two pounds) *
P(the | I found two pounds in) *
P(library | I found two pounds in the)

10

AND HOW DO WE COMPUTE PROBABILITIES?

 Diving the number of occurrences?

P(library | I found two pounds in the) =

 There are so many diferent sequences, we won’t observe
enough instances in our data!

11

MARKOV ASSUMPTION

 Approximate the probability by simplifying it:

P(library | I found two pounds in the) ≈ P(library | the)

 Or:

P(library | I found two pounds in the) ≈ P(library | in the)

 It’s much more likely that we’ll observe “in the library” in our
training data.

12

MARKOV ASSUMPTION

 Which we can generalise as:

P(w
i
| w

1
, w

2
, …, w

i-1
) ≈ P(w

i
| w

i-k
, w

i-k+1
, …, w

i-1
)

i.e., we will only look at the last k words.

13

LANGUAGE MODELS

 We can go with bigrams, trigrams,…

 Even extend it to 4-grams, 5-grams,… but the longer we pick, the
more sparse our counts.

14

LANGUAGE MODELS

 Note: it’s not a perfect soluton, e.g.:

My request for using the department’s supercomputer to run
experiments next month is … approved.

“month is approved” → unlikely
“request … approved” → more likely, not captured by n-grams

 N-grams are however ofen a good soluton.

ESTIMATING N-GRAM
PROBABILITIES

16

ESTIMATING BIGRAM PROBABILITIES

 Maximum Likelihood Estmate (MLE):

e.g.

 

P(wi |wi- 1) =
count(wi- 1,wi)
count(wi- 1)

17

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

18

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

 P(am | I) = count(“I am”) / count(“I”) = 2 / 3 = 0.667

19

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

 P(I | ) = count(“ I”) / count(“”) = 1 / 3 = 0.333

20

AN EXAMPLE OF MLE COMPUTATION

 For the following sentences:
Yes, I am watching the TV right now
The laptop I gave you is very good
I am very happy to be here

 P(now | right) = count(“right now”) / count(“right”) = 1 / 1 = 1

21

EXAMPLE OF BIGRAM COUNTS

 Counts extracted from corpus of 9,222 sentences.

22

INFERRING BIGRAM PROBABILITIES

 Divide values by unigram counts.

 Outcome.

23

BIGRAM ESTIMATE OF SENTENCE PROBABILITY

 P(I want to eat chinese food) =

 P(want | I) [0.33]
* P(to | want) [0.66]
* P(eat | to) [0.28]

 * P(chinese | eat) [0.021]

 * P(food | chinese) [0.52]

 = 0.000665945

24

PRACTICAL SOLUTION

 0.000665945 is a very low probability.

 We may end up with a floatng point underflow!

 Soluton:

 Use log space instead.

 Sum instead of multplicaton.

log(p1 ´p2 ´p3 ´p4 )=log p1 + log p2 + log p3 + log p4

25

LANGUAGE MODELLING TOOLKITS

 The CMU Statstcal Language Modeling (SLM) Toolkit:
http://www.speech.cs.cmu.edu/SLM_Linfo.html

 SRILM – The SRI Language Modeling Toolkit:
http://www.speech.sri.com/projeects/srilm/

http://www.speech.cs.cmu.edu/SLM_info.html
http://www.speech.sri.com/projects/srilm/

26

COLLECTIONS OF LANGUAGE MODELS

 Collecton of Google N-grams:

27

COLLECTIONS OF LANGUAGE MODELS

 Collecton of Google N-grams:

28

COLLECTIONS OF LANGUAGE MODELS

 Google N-grams is the best collecton of language models today.

 Caveat: it has a cost, $150.

https://catalog.ldc.upenn.edu/LDC2006T13

https://catalog.ldc.upenn.edu/LDC2006T13

29

COLLECTIONS: FREE ALTERNATIVE

 Google Books N-grams:

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

EVALUATION AND PERPLEXITY

31

EVALUATION OF OUR MODEL

 Does our language model prefer good sentences to bad ones?

 i.e. does it assign higher probability:

 to “real” or “frequent” sentences (e.g. I want to)

 than “ungrammatcal” or “rarely observed” sentences?
(e.g. want I to)

32

EVALUATION OF OUR MODEL

 Evaluaton:

 Train model on training data.

 Test model’s performance on new, unseen data.

 We need an evaluaton metric → which measures how good our
model matches the sentences in the test set.

33

EVALUATION APPROACHES

 Two diferent evaluaton approaches:

 Extrinsic or in-vivo evaluaton.

 Intrinsic or in-vitro evaluaton.

34

EXTRINSIC EVALUATION

 Best for comparing models A and B.

 Test A and B in an NLP task (MT, spell corrector, etc.).

 How well do we perform with A? And with B?

 Which corrects spelling better?

 Which produces better translatons?

35

DISADVANTAGE OF EXTRINSIC EVALUATION

 Extrinsic evaluaton can take very long to run.

 Alternatvely intrinsic evaluaton.

 Computaton of perplexity.

 Rough approximaton.

 Only valid if tested in similar data.

36

INTRINSIC EVALUATION: PERPLEXITY

 Perplexity:

given a language model, on average:
how difcult is it to predict the next word?

e.g. I always order pizza with cheese and _L_L_L_L → ???

37

INTRINSIC EVALUATION: PERPLEXITY

 The Shannon Game:

 How well can we predict the next word?

I always order pizza with cheese and ____

 A good model:
the one that gives higher probability to the actual next word.

mushrooms 0.1

pepperoni 0.1

jalapeños 0.01

….

biscuits 0.000001

38

INTRINSIC EVALUATION: PERPLEXITY

 To compute the perplexity,
we will use an unseen test set.

(diferent from the training data used to build the model)

 A good language model, i.e. low perplexity,
will be the one that maximises P(sentence)

39

INTRINSIC EVALUATION: PERPLEXITY

 Perplexity of a sequence of words W, PP(W):

PP(W) = P(w1w2…wN )

1

N

=
1

P(w1w2…wN )
N

40

INTRINSIC EVALUATION: PERPLEXITY

PP(W) = P(w1w2…wN )

1

N

=
1

P(w1w2…wN )
N

Chain rule

Bigrams

41

INTRINSIC EVALUATION: PERPLEXITY

 Perplexity is the weighted equivalent of the branching factor.
i.e. on average, how many optons in each case?

 For perplexity, we weight them according to probabilites.

42

PERPLEXITY AS A BRANCHING FACTOR

 Suppose we have a sentence consistng of random digits [0-9].

 All digits have the same probability: 1/10.

 What is the perplexity?

43

INTERPRETING PERPLEXITY

 Lower perplexity (i.e. higher probability), beter model!

 e.g. WSJ corpus: 38 million words for training, 1.5 million for
testng.

Unigram Bigram Trigram
Perplexity 962 170 109

44

LANGUAGE MODELS

 Remember: language models will work best on similarly looking
data.

 If we train on social media data, that may not work for novels.

 OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!

45

LANGUAGE MODELS

 Limitaton: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.

 Is this the reality though?

 There are diferent approaches for dealing with zeros and to
enable generalisaton of trained models.

 In the next lecture.

46

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martn. 2009. Speech and
Language Processing: An Introducton to Natural Language
Processing, Speech Recogniton, and Computatonal Linguistcs.
3rd editon. Chapters 4.1-4.2.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 2,
Secton 2.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46