程序代写代做代考 PowerPoint Presentation

PowerPoint Presentation

LECTURE 4

Advanced Language Models: Generalisaton and Zeros

Arkaitz Zubiaga, 22nd January, 2018

 Generalisaton of Language Models and Zeros

 Smoothing Approaches for Generalisaton

● Laplace smoothing.

● Interpolaton, backof and web-scale LMs.

● Good Turing Smoothing.

● Kneser-Ney Smoothing.

LECTURE 4: CONTENTS

 Language models: P(I, am, fne) → high
P(fne, am, I) → low or 0

 Remember: language models will work best on similarly looking
data.

 If we train on social media data, that may not work for novels.

 OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!

GENERALISATION OF LANGUAGE MODELS

 Limitaton: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.

 Is this the reality though?

 We need to consider the possibility of new n-grams in unseen
data.

GENERALISATION OF LANGUAGE MODELS

 Choose a random startng bigram

 (~~, w) based on probability.~~

~~ While x !=~~ :

 Choose next random bigram
(w, x) based on probability.

 Concatenate all bigrams.

THE SHANNON VISUALISATION METHOD

I
I want
want to
to eat
eat Chinese
Chinese food
food
I want to eat Chinese food

SHANNON VIS. METHOD FOR SHAKESPEARE

 Signifcant improvement as we train longer n-grams.

 But it’s a tradeof:

 Longer n-grams → more accurate.

 Longer n-grams → more sparse.

 We need to fnd a balance.

SHANNON VIS. METHOD FOR SHAKESPEARE

 Shakespeare used:

 N = 884,647 tokens.

 V = 29,066 types → V2 ≈ 845M possible bigrams!

 Shakespeare only produced ~300,000 bigram types (0.04% of all
possible bigrams!)

 Other bigrams may be possible, but we haven’t observed
them.

SHAKESPEARE’S BIGRAMS

 Training set:

 … found a penny

 … found a soluton

 … found a tenner

 … found a book

 P(card | found a) = 0

ZEROS

 Test set:

 … found a fver

 … found a card

 We have sparse statstcs:
P(w | “found a”)
3 → penny
2 → soluton
1 → tenner
1 → book

7 → total

THE INTUITION OF SMOOTHING

p
e
n

n
y

so
lu

ti
o
n

te
n

n
e
r

fiv
e

b
o
o
k

ca
rd

..
.

 We’d like to improve the distributon:
P(w | “found a”)
3 → penny → 2.5
2 → soluton → 1.5
1 → tenner → 0.5
1 → book → 0.5

other → 2

7 → total

THE INTUITION OF SMOOTHING

a
ll
e
g
a
ti

o
n

fiv
e

c
a
rd

..
.p

e
n

n
y

so
lu

ti
o
n

te
n
n
e
r

b
o
o
k

 We’ll see 4 possible solutons:

1) Laplace smoothing.

2) Interpolaton, backof and web-scale LMs.

3) Good Turing Smoothing.

4) Kneser-Ney Smoothing.

FOUR POSSIBLE SOLUTIONS

LAPLACE SMOOTHING

 Pretend we have seen each word one more tme than we
actually did.

 MLE estmate:

 Add-one estmate:

LAPLACE SMOOTHING: ADD-ONE ESTIMATION

PMLE (wi |wi-1) =
c(wi-1,wi )
c(wi-1)

PAdd-1(wi |wi-1) =
c(wi-1,wi )+1
c(wi-1)+V

 Add 1 to ALL values in the matrix.

LAPLACE SMOOTHED BIGRAM COUNTS

LAPLACE SMOOTHED BIGRAM PROBABILITIES

RECONSTITUTED COUNTS

 Original:

 Smoothed:

COMPARING ORIGINAL VS SMOOTHED

 Original:

 Smoothed:

COMPARING ORIGINAL VS SMOOTHED

dividing by 10!

MORE GENERAL LAPLACE SMOOTHING: ADD-k

PAdd- k(wi |wi-1) =
c(wi-1,wi )+k
c(wi-1)+kV

 It’s generally not a good soluton for language models.

 With such a sparse matrix, we’re replacing too many 0’s with
1’s.

 It’s stll ofen used for other NLP taskss, with fewer 0’s.

 e.g. text classifcaton using unigrams.

LAPLACE SMOOTHING IS NOT IDEAL FOR LMs

INTERPOLATION, BACKOFF AND
WEB-SCALE LMs

 For rare cases → you may choose to use less context

 Backsof:

 use trigram if you have good evidence,

 otherwise bigram, otherwise unigram

 Interpolaton:

 mix unigram, bigram, trigram

 Interpolaton works beter

BACKOFF AND INTERPOLATION

 Simple interpolaton:

 Lambdas conditonal on context:

LINEAR INTERPOLATION

 Use a held-out (or development) corpus:

 Fix the N-gram probabilites (on the training data)

 Then search for λs that give largest probability to held-out set:

HOW DO WE SET THOSE LAMBDAS?

logP(w1…wn |M(l1…lk)) = logPM (l1…lk)(wi |wi-1)
i

Training DataTraining Data
Held-Out

Data

Held-Out
Data

Test
Data

 Two scenarios:

1) We ksnow ALL words (including those in the test) in advance.

 We have a fxed vocabulary V – closed vocab. Tasks

2) We may fnd new, unseen words (generally the case)

 Out Of Vocabulary (OOV) words – open vocab. tasks

OPEN VS CLOSED VOCABULARY TASKS

 We can use new token for all unknown words:

 How to train:

 Create a fxed lexicon L of size V.

 While preprocessing text, change words not in L to

 Train the LM, where is just another token.

 In test: new words are assigned the probability of .

HOW DO WE DEAL WITH OOV WORDS?

 “Stupid backof” (Brants et al. 2007)

 No discountng, just use relatve frequencies

SMOOTHING FOR VERY LARGE CORPORA

S(wi ) =
count(wi )

S(wi |wi- k+1
i-1 ) =

count(wi- k+1
i )

count(wi- k+1
i-1 )

if count(wi- k+1
i ) > 0

0.4S(wi |wi- k+2
i-1 ) otherwise

í
ïï

î
ï
ï

GOOD TURING SMOOTHING

 To estmate counts of unseen things

→ use the count of things we’ve seen once

INTUITION OF GOOD TURING SMOOTHING

 N
C
= count of tokens observed C tmes

 e.g. I am happy and I am fne and I have to go

I → 3
am → 2
and → 2
happy → 1
fne → 1
have → 1
to → 1
go → 1

FREQUENCY OF FREQUENCY “C”

N
1
= 5

N
2
= 2

N
3
= 1

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

 How likely is it that next fsh is trout?

 1/18

INTUITION OF GOOD TURING SMOOTHING

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

 How liksely is it that next fsh is new, e.g. haddock?

 Use our estmate of things-we-saw-once → 3/18

 OK, but then need to revise all other probabilites.

INTUITION OF GOOD TURING SMOOTHING

GOOD TURING CALCULATIONS

, where:

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

3/18

 Problem: what about the word “and”? (say c=12156)

● For small c, N
c
> N

c+1

● For large c? N
12157

will likely be zero.

GOOD TURING ISSUES

N1
N2 N3

???

 Soluton: to avoid zeros for large values of c.

● Simple Good-Turing [Gale and Sampson]:
replace N

c
→ best-ft power law for unreliable counts

GOOD TURING ISSUES

N1
N2

 Corpus with 22 million words of AP Newswire

GOOD TURING NUMBERS

c* =
(c+1)Nc+1

Count c Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

KNESER-NEY SMOOTHING

 Corpus with 22 million words of AP Newswire

 For c >= 2:

● c* ≈ (c – .75)

GOOD TURING NUMBERS

c* =
(c+1)Nc+1

Count c Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

 Save ourselves some tme and just subtract 0.75 (or some d)!

 But should we really just use the regular unigram P(w)?

ABSOLUTE DISCOUNTING INTERPOLATION

discounted bigram

unigram

Interpolation weight

PAbsoluteDiscounting (wi |wi-1) =
c(wi-1,wi ) – d

c(wi-1)
+l(wi-1)P(w)

 Beter estmate for probabilites of lower-order unigrams!

● Shannon game: I can’t see without my reading _______?

● We have never seen “reading glasses” in training data.

● P(“Kingdom”) > P(“glasses”)
→ but “Kingdom” always comes afer “United”

KNESER-NEY SMOOTHING

 Beter estmate for probabilites of lower-order unigrams!

● Shannon game: I can’t see without my reading _______?

 Instead of P(w): “How likely is w”

● P
contnuaton

(w): “How likely is w to appear as a novel

contnuaton?

 For each unigram, count the number of bigram types it
completes → i.e. always paired with a specifc word, or likely in
many bigrams?

KNESER-NEY SMOOTHING

 How many tmes does w appear as a novel contnuaton:

 Normalized by the total number of bigram types

KNESER-NEY SMOOTHING

PCONTINUATION (w)µ {wi-1 :c(wi-1,w) > 0}

PCONTINUATION (w) =
{wi-1 :c(wi-1,w) > 0}

{(wj-1,wj ) :c(wj-1,wj ) > 0}

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are
They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“are”) = 3/5

They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“am”) = 1/5

They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“is”) = 1/5

They are
She is

KNESER-NEY SMOOTHING

λ is a normalising constant; the probability mass we’ve discounted

the normalised discount
The number of word types that can follow wi-1
= # of word types we discounted
= # of times we applied normalised discount

PKN (wi |wi-1) =
max(c(wi-1,wi ) – d, 0)

c(wi-1)
+l(wi-1)PCONTINUATION(wi )

l(wi-1) =
d

c(wi-1)
{w:c(wi-1,w) > 0}

 Laplace smoothing:

● OK for text classifcaton, not for language modelling

 Kneser-Ney is most widely used.

 Stupid backsof for very large collectons (e.g. the Web)

SUMMARY

RESOURCES

 KenLM Language Model Toolkit:
https://kheafield.com/code/kenlm/

 CMU Statistical Language Modeling Toolkit:
http://www.speech.cs.cmu.edu/SLM/toolkit.html

 Google Books N-gram Counts:
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

https://kheafield.com/code/kenlm/
http://www.speech.cs.cmu.edu/SLM/toolkit.html
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters
4.3-4.6.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51

Related Posts