CS计算机代考程序代写 chain l3-ngram-v3

l3-ngram-v3

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
Natural Language Processing

Lecture 3
Semester 1 2021 Week 2

Jey Han Lau

N-gram Language Models

COMP90042 L3

2

Language Models

• One application NLP is about explaining language

‣ Why some sentences are more fluent than
others

• E.g. in speech recognition:

• recognise speech > wreck a nice beach

• We measure ‘goodness’ using probabilities
estimated by language models

• Language model can also be used for generation

COMP90042 L3

3

COMP90042 L3

4

Language Models

• Useful for
‣ Query completion
‣ Optical character recog.

• Other generation tasks
‣ Machine translation
‣ Summarisation
‣ Dialogue systems

• Nowadays pretrained language models are the
backbone of modern NLP systems

COMP90042 L3

5

Outline

• Deriving n-gram language models

• Smoothing to deal with sparsity

• Generating language

COMP90042 L3

6

Probabilities: Joint to Conditional
Our goal is to get a probability for an arbitrary sequence
of m words

First step is to apply the chain rule to convert joint
probabilities to conditional ones

P(w1, w2, . . . , wm)

P(w1, w2, . . . , wm) = P(w1)P(w2 |w1)P(w3 |w1, w2) . . .
P(wm |w1, . . . , wm−1)

COMP90042 L3

7

The Markov Assumption
Still intractable, so make a simplifying assumption:

) ≈ )

For some small n
When n = 1, a unigram model

, , ..

When n = 2, a bigram model

, , ..

When n = 3, a trigram model

, , ..

𝑃(𝑤𝑖 |𝑤1…𝑤𝑖−1 𝑃(𝑤𝑖 |𝑤𝑖−𝑛+1…𝑤𝑖−1

𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚


𝑖=1

𝑃(𝑤𝑖)

𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚


𝑖=1

𝑃(𝑤𝑖 |𝑤𝑖−1)

𝑃(𝑤1 𝑤2 𝑤𝑚) =
𝑚


𝑖=1

𝑃(𝑤𝑖 |𝑤𝑖−2𝑤𝑖−1)

the dog barks
wi

the dog barks
wi

the dog barks
wi

COMP90042 L3

8

Maximum Likelihood Estimation
How do we calculate the probabilities? Estimate based on counts in our
corpus:

For unigram models,

For bigram models,

For n-gram models generally,

𝑃(𝑤𝑖) =
𝐶(𝑤𝑖) 

𝑀

𝑃(𝑤𝑖 |𝑤𝑖−1) =
𝐶(𝑤𝑖−1𝑤𝑖)
𝐶(𝑤𝑖−1)

𝑃(𝑤𝑖 |𝑤𝑖−𝑛+1…𝑤𝑖−1) =
𝐶(𝑤𝑖−𝑛+1…𝑤𝑖)

𝐶(𝑤𝑖−𝑛+1…𝑤𝑖−1)

C(barks)
M

C(dog barks)
C(dog)

COMP90042 L3

9

Book-ending Sequences

• Special tags used to denote start and end of
sequence
= sentence start
= sentence end

COMP90042 L3

10

Trigram example
Corpus:

yes no no no no yes 

no no no yes yes yes no

What is the probability of
yes no no yes

under a trigram language model?
P(yes no no yes) =

P(yes | )
P(no | yes)
P(no | yes no)
P(yes | no no)
P(
| no yes)

×
×

×
×

Need to predict because it’s the end of sentence!

COMP90042 L3

11

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes) P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

12

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

P(yes | ) 1/2 yes no

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

13

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

14

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(no | yes no) 1/2 yes no noyes no

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

15

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

PollEv.com/jeyhanlau569

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(no | yes no) 1/2 yes no noyes no

P(yes | no no) ?

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L3

16

COMP90042 L3

17

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(no | yes no) 1/2 yes no noyes no

P(yes | no no) 2/5

no no no
no no no
no no yes
no no no
no no yes

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

18

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes)

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(no | yes no) 1/2 yes no noyes no

P(yes | no no) 2/5

no no no
no no no
no no yes
no no no
no no yes

P( | no yes) 1/2 no yes no yes yes

P(wi |wi−2, wi−1) =
C(wi−2, wi−1, wi)

C(wi−2, wi−1)

COMP90042 L3

19

Corpus:
yes no no no no yes
no no no yes yes yes no

Compute: P(yes no no yes) = ½ 1 ½ ⅖ ½ = 0.05× × × ×

P(yes | ) 1/2 yes no

P(no | yes) 1/1 yes no

P(no | yes no) 1/2 yes no noyes no

P(yes | no no) 2/5

no no no
no no no
no no yes
no no no
no no yes

P( | no yes) 1/2 no yes yes no

COMP90042 L3

20

Several Problems
• Language has long distance effects — need large n

‣ The lecture/s that took place last week was/were on
preprocessing.

• Resulting probabilities are often very small
‣ Use log probability to avoid numerical underflow

• What about unseen words?
‣ Special symbol to represent them (e.g. )

• Unseen n-grams?

‣ Need to smooth the LM!

P(w1, w2, . . . , wm) = P(w1 | < s > ) × P(w2 |w1) . . .

whole term = 0 if P(w2|w1) = 0

COMP90042 L3

21

Smoothing

COMP90042 L3

22

Smoothing

• Basic idea: give events you’ve never seen before
some probability

• Must be the case that P(everything) = 1
• Many different kinds of smoothing

‣ Laplacian (add-one) smoothing
‣ Add-k smoothing
‣ Absolute discounting
‣ Kneser-Ney
‣ And others…

COMP90042 L3

23

Laplacian (Add-one) Smoothing

• Simple idea: pretend we’ve seen each n-gram
once more than we did.

For unigram models (V= the vocabulary),

For bigram models,

𝑃𝑎𝑑𝑑1(𝑤𝑖) =
𝐶(𝑤𝑖) + 1
𝑀 + |𝑽 |

𝑃𝑎𝑑𝑑1(𝑤𝑖 |𝑤𝑖−1) =
𝐶(𝑤𝑖−1𝑤𝑖) + 1
𝐶(𝑤𝑖−1) + |𝑽 |

COMP90042 L3

24

Add-one Example
the rat ate the cheese

What’s the bigram probability P(ate | rat) under add-one
smoothing?

= = V = { the, rat, ate, cheese, }

What’s the bigram probability P(ate | cheese) under add-one
smoothing?

= =

𝐶(𝑟𝑎𝑡 𝑎𝑡𝑒) + 1
𝐶(𝑟𝑎𝑡) + |𝑽 |

2
6

𝐶(𝑐h𝑒𝑒𝑠𝑒 𝑎𝑡𝑒) + 1
𝐶(𝑐h𝑒𝑒𝑠𝑒) + |𝑽 |

1
6

is not part of vocabulary because we never need to
infer its conditional probability (e.g. P( | …))

Recall: P(yes no no yes) =

P(yes | ) P(no | yes)
P(no | yes no) P(yes | no no)
P(
| no yes)

× ×
× ×

COMP90042 L3

25

Add-k Smoothing

• Adding one is often too much

• Instead, add a fraction k

• AKA Lidstone Smoothing

• Have to choose k

𝑃𝑎𝑑𝑑𝑘(𝑤𝑖 |𝑤𝑖−1, 𝑤𝑖−2) =
𝐶(𝑤𝑖−2,  𝑤𝑖−1, 𝑤𝑖) + 𝑘
𝐶(𝑤𝑖−2, 𝑤𝑖−1) + 𝑘 |𝑽 |

COMP90042 L3

26

Lidstone Smoothing

20

(8 + 0.1) / (20 + 7 x 0.1)

201.0 1.0

Context = alleged
• 5 observed bi-grams
• 2 unobserved bi-grams

(0 + 0.1) / (20 + 7 x 0.1)

0.391 x 20

impropriety

offense 

damage 

deficiencies

outbreak

infirmity

alleged

COMP90042 L3

27

Absolute Discounting

• ‘Borrows’ a fixed probability mass from observed
n-gram counts

• Redistributes it to unseen n-grams

COMP90042 L3

28

Absolute Discounting

8 – 0.1

Context = alleged
• 5 observed bi-grams
• 2 unobserved bi-grams

0.25 / 20
(0.1 x 5) / 2

impropriety

offense 

damage 

deficiencies

outbreak

infirmity

alleged

20 1.0 20 1.020 1.0

COMP90042 L3

29

Backoff

• Absolute discounting redistributes the probability
mass equally for all unseen n-grams

• Katz Backoff: redistributes the mass based on a
lower order model (e.g. unigram)

the amount of probability mass that has been discounted for context wi-1
((0.1 x 5) / 20 in previous slide)

Pkatz(wi|wi−1) =

C(wi−1,wi)−D
C(wi−1)

, if C(wi−1, wi) > 0

α(wi−1)×
P (wi)!

wj :C(wi−1,wj)=0
P (wj)

, otherwise

sum unigram probabilities for all words
that do not co-occur with context wi-1 


e.g. P(infirmity) + P(alleged)

unigram probability for wi 

e.g. P(infirmity)

COMP90042 L3

30

Issues with Katz Backoff

• I can’t see without my reading ___

• C(reading, glasses) = C(reading, Francisco) = 0

• C(Fransciso) > C(glasses)

• Katz backoff will give higher probability to
Francisco

Pkatz(wi|wi−1) =

C(wi−1,wi)−D
C(wi−1)

, if C(wi−1, wi) > 0

α(wi−1)×
P (wi)!

wj :C(wi−1,wj)=0
P (wj)

, otherwise

COMP90042 L3

31

Kneser-Ney Smoothing
• Redistribute probability mass based on the versatility of the

lower order n-gram

• AKA “continuation probability”

• What is versatility?

‣ High versatility -> co-occurs with a lot of unique words,
e.g. glasses

– men’s glasses, black glasses, buy glasses, etc

‣ Low versatility -> co-occurs with few unique words, e.g.
francisco

– san francisco

COMP90042 L3

32

Kneser-Ney Smoothing

the amount of probability mass that
has been discounted for context wi-1

PKN(wi |wi−1) =
C(wi−1, wi) − D

C(wi−1)
, if C(wi−1, wi) > 0

β(wi−1)Pcont(wi), otherwise

Pcont(wi) =
|{wi−1 : C(wi−1, wi) > 0} |


wi

|{wi−1 : C(wi−1, wi) > 0} |

• Intuitively the numerator of Pcont counts the number of
unique wi-1 that co-occurs with wi

• High continuation counts for glasses

• Low continuation counts for Franciso

COMP90042 L3

33

Interpolation
• A better way to combine different orders of n-gram models

• Weighted sum of probabilities across progressively shorter
contexts

• Interpolated trigram model:


PIN(wi |wi−1, wi − 2) = λ3P3(wi |wi−2, wi−1)

+ λ2P2(wi |wi−1)
+ λ1P1(wi)

Learned based on held out data

λ1 + λ2 + λ3 = 1

COMP90042 L3

34

Interpolated Kneser-Ney Smoothing

• Interpolation instead of back-off


• where is normalising constant such that

PIKN(wi |wi−1) =
C(wi−1, wi) − D

C(wi−1)
+γ(wi−1)Pcont(wi)

γ(wi−1)


wi−1∈V

PIKN(wi |wi−1) = 1.0

COMP90042 L3

35

In Practice

• Commonly used Kneser-Ney language models
use 5-grams as max order

• Has different discount values for each n-gram
order.

COMP90042 L3

36

• Build a character-level LM and fix it character by
character

• Build a word-level LM and use left/right context words
to find the most appropriate substituting word

• Build a LM using common typos as training data
• N-gram LM can’t fix spelling errors

PollEv.com/jeyhanlau569

How can we use n-gram language model
to fix spelling errors?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L3

37

COMP90042 L3

38

Generating Language

COMP90042 L3

39

Generation

• Given an initial word, draw the next word according to
the probability distribution produced by the language
model.

• Include (n-1) tokens for n-gram model, to provide
context to generate first word
‣ never generate
‣ generating
terminates the sequence

COMP90042 L3

40

Generation (Bigram LM)

• Sentence =

• P(? | ) = “a”

0

0.125

0.25

0.375

0.5

cow grass eats a

COMP90042 L3

41

Generation (Bigram LM)

• Sentence = a

• P(? | a) = “cow”

0

0.175

0.35

0.525

0.7

cow grass eats a

COMP90042 L3

42

Generation (Bigram LM)

• Sentence = a cow

• P(? | cow) = “eats”

0

0.15

0.3

0.45

0.6

cow grass eats a

COMP90042 L3

43

Generation (Bigram LM)

• Sentence = a cow eats

• P(? | eats) = “grass”

0

0.2

0.4

0.6

0.8

cow grass eats a

COMP90042 L3

44

Generation (Bigram LM)

• Sentence = a cow eats grass

• P(? | grass) = “

0

0.15

0.3

0.45

0.6

cow grass eats a

COMP90042 L3

45

Generation (Bigram LM)

• Sentence = a cow eats grass

• Done!

COMP90042 L3

46

How to Select Next Word?

• Argmax: takes highest probability word each turn

‣ Greedy search

• Beam search decoding (lecture 17):

‣ Keeps track of top-N highest probability words
each turn

‣ Select sequence of words that produce the best
sentence probability

• Randomly samples from the distribution

COMP90042 L3

47

A Final Word

• N-gram language models are a simple but
effective way to capture the predictability of
language

• Can be trained in an unsupervised fashion,
scalable to large corpora

• Require smoothing to be effective

• Modern language models uses neural networks
(lecture 8)

COMP90042 L3

48

Reading

• E18 Chapter 6 (skip 6.3)