程序代写代做代考 lecture04.pptx

lecture04.pptx

LECTURE 4

Advanced Language Models: Generalisaton and Zeros

Arkaitz Zubiaga, 22
nd

January, 2018

 Generalisaton of Language Models and Zeros

 Smoothing Approaches for Generalisaton

● Laplace smoothing.

● Interpolaton, backof and web-scale LMs.

● Good Turing Smoothing.

● Kneser-Ney Smoothing.

LECTURE 4: CONTENTS

 Language models: P(I, am, fne) → high
P(fne, am, I) → low or 0

 Remember: language models will work best on similarly looking
data.

 If we train on social media data, that may not work for novels.

 OMG I’m ROTFL! → may be frequent in SM, unlikely in
novels!

GENERALISATION OF LANGUAGE MODELS

 Limitaton: We’re assuming that all n-grams in new, unseen data
will have been observed in the training data.

 Is this the reality though?

 We need to consider the possibility of new n-grams in unseen
data.

GENERALISATION OF LANGUAGE MODELS

 Choose a random startng bigram

 (~~, w) based on probability.~~

~~ While x !=~~ :

 Choose next random bigram
(w, x) based on probability.

 Concatenate all bigrams.

THE SHANNON VISUALISATION METHOD

I
I want
want to
to eat
eat Chinese
Chinese food
food
I want to eat Chinese food

SHANNON VIS. METHOD FOR SHAKESPEARE

 Signifcant improvement as we train longer n-grams.

 But it’s a tradeof:

 Longer n-grams → more accurate.

 Longer n-grams → more sparse.

 We need to fnd a balance.

SHANNON VIS. METHOD FOR SHAKESPEARE

 Shakespeare used:

 N = 884,647 tokens.

 V = 29,066 types → V2 ≈ 845M possible bigrams!

 Shakespeare only produced ~300,000 bigram types ( 0.04% of all
possible bigrams!)

 Other bigrams may be possible, but we haven’t observed
them.

SHAKESPEARE’S BIGRAMS

 Training set:

 … found a penny

 … found a soluton

 … found a tenner

 … found a book

 P(card | found a) = 0

ZEROS

 Test set:

 … found a fver

 … found a card

 We have sparse statstcs:
P(w | “found a”)
3 → penny
2 → soluton
1 → tenner
1 → book

7 → total

THE INTUITION OF SMOOTHING

p
e
n
n
y

s
o
lu

ti
o
n

te
n
n
e
r

fiv
e

b
o
o
k

ca
rd

..
.

 We’d like to improve the distributon:
P(w | “found a”)
3 → penny → 2.5
2 → soluton → 1.5
1 → tenner → 0.5
1 → book → 0.5

other → 2

7 → total

THE INTUITION OF SMOOTHING

a
ll
e

g
a
ti

o
n

fiv
e
r

ca
rd

..
.p

e
n
n

so
lu

ti
o

te
n

n
e
r

b
o
o
k

 We’ll see 4 possible solutons:

1) Laplace smoothing.

2) Interpolaton, backof and web-scale LMs.

3) Good Turing Smoothing.

4) Kneser-Ney Smoothing.

FOUR POSSIBLE SOLUTIONS

LAPLACE SMOOTHING

 Pretend we have seen each word one more tme than we
actually did.

 MLE estmate:

 Add-one estmate:

LAPLACE SMOOTHING: ADD-ONE ESTIMATION

PMLE(wi |wi-1)=
c(wi-1,wi )
c(wi-1)

PAdd-1(wi |wi-1) =
c(wi-1,wi )+1
c(wi- 1)+V

 Add 1 to ALL values in the matrix.

LAPLACE SMOOTHED BIGRAM COUNTS

LAPLACE SMOOTHED BIGRAM PROBABILITIES

RECONSTITUTED COUNTS

 Original:

 Smoothed:

COMPARING ORIGINAL VS SMOOTHED

 Original:

 Smoothed:

COMPARING ORIGINAL VS SMOOTHED

dividing by 10!

MORE GENERAL LAPLACE SMOOTHING: ADD-k

PAdd- k(wi |wi-1) =
c(wi-1,wi )+k
c(wi- 1)+kV

 It’s generally not a good soluton for language models .

 With such a sparse matrix, we’re replacing too many 0’s with
1’s.

 It’s stll often used for other NLP tasks , with fewer 0’s.

 e.g. text classifcaton using unigrams.

LAPLACE SMOOTHING IS NOT IDEAL FOR LMs

INTERPOLATION, BACKOFF AND
WEB-SCALE LMs

 For rare cases → you may choose to use less context

 Backof:

 use trigram if you have good evidence,

 otherwise bigram, otherwise unigram

 Interpolaton:

 mix unigram, bigram, trigram

 Interpolaton works beter

BACKOFF AND INTERPOLATION

 Simple interpolaton:

 Lambdas conditonal on context:

LINEAR INTERPOLATION

 Use a held-out (or development) corpus:

 Fix the N-gram probabilites (on the training data)

 Then search for λs that give largest probability to held-out set:

HOW DO WE SET THOSE LAMBDAS?

logP(w1…wn |M (l1…lk)) = logPM (l1…lk)(wi |wi-1)
i

Training DataTraining Data
Held-Out

Data

Held-Out
Data

Test
Data

 Two scenarios:

1) We know ALL words (including those in the test) in advance.

 We have a fxed vocabulary V – closed vocab. Task

2) We may fnd new, unseen words (generally the case)

 Out Of Vocabulary (OOV) words – open vocab. task

OPEN VS CLOSED VOCABULARY TASKS

 We can use new token for all unknown words:

 How to train:

 Create a fxed lexicon L of size V.

 While preprocessing text, change words not in L to

 Train the LM, where is just another token.

 In test: new words are assigned the probability of .

HOW DO WE DEAL WITH OOV WORDS?

 “Stupid backof” (Brants et al. 2007)

 No discountng, just use relatve frequencies

SMOOTHING FOR VERY LARGE CORPORA

S(wi ) =
count(w

i
)

S(wi |wi- k+1
i-1 ) =

count(wi- k+1
i )

count(wi- k+1
i- 1 )

if count(wi- k+1
i )> 0

0.4S(wi |wi- k+2
i-1 ) otherwise

í
ïï

î
ï
ï

GOOD TURING SMOOTHING

 To estmate counts of unseen things

→ use the count of things we’ve seen once

INTUITION OF GOOD TURING SMOOTHING

 N
C
= count of tokens observed C tmes

 e.g. I am happy and I am fne and I have to go

I → 3
am → 2
and → 2
happy → 1
fne → 1
have → 1
to → 1
go → 1

FREQUENCY OF FREQUENCY “C”

N
1
= 5

N
2
= 2

N
3
= 1

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

 How likely is it that next fsh is trout?

 1/18

INTUITION OF GOOD TURING SMOOTHING

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

 How likely is it that next fsh is new, e.g. haddock?

 Use our estmate of things-we-saw-once → 3/18

 OK, but then need to revise all other probabilites.

INTUITION OF GOOD TURING SMOOTHING

GOOD TURING CALCULATIONS

, where:

 You are fshing, and caught:

 10 carp, 3 perch, 2 whitefsh, 1 trout, 1 salmon, 1 eel = 18 fsh

3/18

 Problem: what about the word “and”? (say c=12156)

● For small c, N
c
> N

c+1

● For large c? N
12157

will likely be zero.

GOOD TURING ISSUES

N1
N2 N3

???

 Soluton: to avoid zeros for large values of c.

● Simple Good-Turing [Gale and Sampson]:
replace N

c
→ best-ft power law for unreliable counts

GOOD TURING ISSUES

N1
N2

 Corpus with 22 million words of AP Newswire

GOOD TURING NUMBERS

c* =
(c+1)Nc+1

Count c Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

KNESER-NEY SMOOTHING

 Corpus with 22 million words of AP Newswire

 For c >= 2:

● c* ≈ (c – .75)

GOOD TURING NUMBERS

c* =
(c+1)Nc+1

Count c Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

 Save ourselves some tme and just subtract 0.75 (or some d)!

 But should we really just use the regular unigram P(w)?

ABSOLUTE DISCOUNTING INTERPOLATION

discounted bigram

unigram

Interpolation weight

PAbsoluteDiscounting (wi |wi-1) =
c(wi-1,wi ) – d

c(wi-1)
+l(wi- 1)P(w)

 Beter estmate for probabilites of lower-order unigrams!

● Shannon game: I can’t see without my reading _______?

● We have never seen “reading glasses” in training data.

● P(“Kingdom”) > P(“glasses”)
→ but “Kingdom” always comes afer “United”

KNESER-NEY SMOOTHING

 Beter estmate for probabilites of lower-order unigrams!

● Shannon game: I can’t see without my reading _______?

 Instead of P(w): “How likely is w”

● P
contnuaton

(w): “How likely is w to appear as a novel

contnuaton?

 For each unigram, count the number of bigram types it
completes → i.e. always paired with a specifc word, or likely in
many bigrams?

KNESER-NEY SMOOTHING

 How many tmes does w appear as a novel contnuaton:

 Normalized by the total number of bigram types

KNESER-NEY SMOOTHING

PCONTINUATION (w)µ {wi-1 :c(wi-1,w) > 0}

PCONTINUATION(w) =
{wi- 1 :c(wi-1,w) > 0}

{(wj-1,wj ) :c(wj-1,wj ) > 0}

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are
They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“are”) = 3/5

They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“am”) = 1/5

They are
She is

KNESER-NEY SMOOTHING

 In other words, P
contnuaton

is:

● The number of # of word types seen to precede w,

● normalised by the # of words preceding all words.

I am
We are
You are P

contnuaton
(“is”) = 1/5

They are
She is

KNESER-NEY SMOOTHING

λ is a normalising constant; the probability mass we’ve discounted

the normalised discount
The number of word types that can follow w i-1
= # of word types we discounted
= # of times we applied normalised discount

PKN(wi |wi-1) =
max(c(wi-1,wi ) – d, 0)

c(wi-1)
+l(wi-1)PCONTINUATION(wi )

l(wi-1) =
d

c(wi-1)
{w: c(wi- 1,w)> 0}

 Laplace smoothing:

● OK for text classifcaton, not for language modelling

 Kneser-Ney is most widely used.

 Stupid backof for very large collectons (e.g. the Web)

SUMMARY

RESOURCES

 KenLM Language Model Toolkit:
https://kheafield.com/code/kenlm/

 CMU Statistical Language Modeling Toolkit:
http://www.speech.cs.cmu.edu/SLM/toolkit.html

 Google Books N-gram Counts:
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters
4.3-4.6.

Related Posts