CS计算机代考程序代写 chain N-gram Language Models

N-gram Language Models
COMP90042
Natural Language Processing
Lecture 3
Semester 1 2021 Week 2 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L3

Language Models
One application NLP is about explaining language
‣ Why some sentences are more fluent than others




We measure ‘goodness’ using probabilities estimated by language models
E.g. in speech recognition:
recognise speech > wreck a nice beach
Language model can also be used for generation 2

COMP90042
L3
3

COMP90042
L3


Useful for
‣ Query completion
‣ Optical character recog.
Language Models
Other generation tasks ‣ Machine translation
‣ Summarisation
‣ Dialogue systems
• Nowadays pretrained language models are the backbone of modern NLP systems
4

COMP90042
L3
• • •
Deriving n-gram language models Smoothing to deal with sparsity Generating language
Outline
5

COMP90042
L3
Probabilities: Joint to Conditional Our goal is to get a probability for an arbitrary sequence
of m words
P(w1,w2,…,wm)

First step is to apply the chain rule to convert joint
probabilities to conditional ones
P(w1,w2,…,wm) = P(w1)P(w2|w1)P(w3|w1,w2)… P(wm|w1,…,wm−1)
6

COMP90042
L3
The Markov Assumption Still intractable, so make a simplifying assumption:
𝑃(𝑤𝑖|𝑤1…𝑤𝑖−1) ≈ 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) For some small n
When n = 1, a unigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑃(𝑤𝑖)
wi
the dog barks
wi
the dog barks
wi
the dog barks 7
When n = 2, a bigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = ∏𝑃(𝑤𝑖|𝑤𝑖−1)
When n = 3, a trigram model
𝑃(𝑤1, 𝑤2, .. 𝑤𝑚) = 𝑃(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) 𝑖=1
𝑖=1 ∏𝑚
𝑖=1 𝑚
𝑚

COMP90042
L3
Maximum Likelihood Estimation
How do we calculate the probabilities? Estimate based on counts in our corpus:
For unigram models,
𝐶(𝑤𝑖) 𝑀
C(barks) M
C(dog barks) C(dog)
For bigram models,
𝑃(𝑤𝑖) =
𝑃 (𝑤𝑖 | 𝑤𝑖−1) = 𝐶(𝑤𝑖−1𝑤𝑖)
𝐶(𝑤𝑖−1)
For n-gram models generally,
𝑃(𝑤𝑖|𝑤𝑖−𝑛+1…𝑤𝑖−1) =
𝐶(𝑤𝑖−𝑛+1…𝑤𝑖) 𝐶(𝑤𝑖−𝑛+1…𝑤𝑖−1)
8

COMP90042
L3

Special tags used to denote start and end of sequence
= sentence start ‣ = sentence end
Book-ending Sequences
9

COMP90042
L3
Trigram example yes no no no no yes 

no no no yes yes yes no
What is the probability of
yes no no yes
under a trigram language model? P(yes no no yes) =
P(yes | ) × P(no | yes) × P(no | yes no) × P(yes | no no) × P( | no yes)
Corpus:
Need to predict
because it’s the end of sentence!
10

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
11

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
12

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
13

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no

14

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no

P(yes | no no)
?
PollEv.com/jeyhanlau569
15

COMP90042
L3
16

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no

P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
17

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
L3
Compute: P(yes no no yes)
P(wi | wi−2, wi−1) =
C(wi−2, wi−1, wi) C(wi−2, wi−1)
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no

P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
P(
| no yes)
1/2
no yes
no yes yes
18

COMP90042 Corpus:
yes no no no no yes no no no yes yes yes no
Compute: P(yes no no yes) = 1⁄2 × 1 × 1⁄2 × 2⁄5 × 1⁄2 = 0.05
L3
P(yes | )
1/2
yes no
P(no | yes)
1/1
yes no
P(no | yes no)
1/2
yes no no yes no

P(yes | no no)
2/5
no no no
no no no no no yes no no no no no yes
P(
| no yes)
1/2
no yes
yes no

19

COMP90042
L3
Several Problems
• Languagehaslongdistanceeffects—needlargen
‣ The lecture/s that took place last week was/were on preprocessing.
• Resultingprobabilitiesareoftenverysmall
‣ Use log probability to avoid numerical underflow
• Whataboutunseenwords?
‣ Special symbol to represent them (e.g. )
whole term = 0 if P(w2|w1) = 0
• Unseenn-grams?
P(w1, w2, . . . , wm) = P(w1 | < s > ) × P(w2 | w1) . . . ‣ Need to smooth the LM!
20

COMP90042
L3
Smoothing
21

COMP90042
L3
Smoothing
• Basic idea: give events you’ve never seen before some probability
• Must be the case that P(everything) = 1
• Many different kinds of smoothing ‣ Laplacian (add-one) smoothing
‣ Add-k smoothing
‣ Absolute discounting
‣ Kneser-Ney ‣ And others…
22

COMP90042
L3
Laplacian (Add-one) Smoothing

For unigram models (V= the vocabulary),
𝐶(𝑤𝑖) + 1 𝑀+|𝑽|
𝑃 (𝑤|𝑤 )=𝐶(𝑤𝑖−1𝑤𝑖)+1 𝑎𝑑𝑑1 𝑖 𝑖−1 𝐶(𝑤𝑖−1) + |𝑽 |
Simple idea: pretend we’ve seen each n-gram once more than we did.
𝑃𝑎𝑑𝑑1(𝑤𝑖) = For bigram models,
23

COMP90042
L3
Add-one Example the rat ate the cheese
What’s the bigram probability P(ate | rat) under add-one smoothing?
= 𝐶(𝑟𝑎𝑡𝑎𝑡𝑒)+1= 2 V={the,rat,ate,cheese,} 𝐶(𝑟𝑎𝑡) + |𝑽 | 6
What’s the bigram probability P(ate | cheese) under add-one smoothing?
𝐶(𝑐h𝑒𝑒𝑠𝑒 𝑎𝑡𝑒) + 1
= 𝐶(𝑐h𝑒𝑒𝑠𝑒)+|𝑽| =
1 6
is not part of vocabulary because we never need to infer its conditional probability (e.g. P( | …))

Recall: P(yes no no yes) =
P(yes | ) × P(no | yes) × P(no | yes no) × P(yes | no no) × P( | no yes)
24

COMP90042
L3
Add-k Smoothing
• • •
𝑃 (𝑤|𝑤 ,𝑤 )=𝐶(𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)+𝑘 𝑎𝑑𝑑𝑘 𝑖 𝑖−1 𝑖−2 𝐶(𝑤𝑖−2,𝑤𝑖−1)+𝑘|𝑽|

Adding one is often too much Instead, add a fraction k
AKA Lidstone Smoothing
Have to choose k
25

COMP90042
L3
impropriety
 offense
 damage
 deficiencies
 outbreak
 infirmity
 alleged
Lidstone Smoothing
Context = alleged
• 5 observed bi-grams • 2 unobserved bi-grams
20
1.0 20 1.0 0.391 x 20
(0 + 0.1) / (20 + 7 x 0.1)
(8 + 0.1) / (20 + 7 x 0.1) 26

COMP90042
L3


‘Borrows’ a fixed probability mass from observed n-gram counts
Absolute Discounting
Redistributes it to unseen n-grams
27

COMP90042
L3
Absolute Discounting
Context = alleged
• 5 observed bi-grams • 2 unobserved bi-grams
impropriety
 offense
 damage
 deficiencies
 outbreak
 infirmity
 alleged
20 1.0 20 1.0 20 1.0
(0.1 x 5) / 2 8 – 0.1
0.25 / 20
28

COMP90042
L3


Backoff
Absolute discounting redistributes the probability
mass equally for all unseen n-grams
Katz Backoff: redistributes the mass based on a
lower order model (e.g. unigram)
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
wj:C(wi−1,wj)=0 P(wj)
if C(wi−1, wi) > 0 otherwise
unigram probability for wi
 e.g. P(infirmity)
katz i i−1
⎩α(wi−1) × 􏰀
sum unigram probabilities for all words that do not co-occur with context wi-1
 e.g. P(infirmity) + P(alleged)
the amount of probability mass that has been discounted for context wi-1 ((0.1 x 5) / 20 in previous slide)
29

COMP90042
L3
katz i i−1
⎩α(wi−1) × 􏰀
Issues with Katz Backoff
⎧C(wi−1,wi)−D , P (w|w )=⎨ C(wi−1)
P(w )
i ,
wj:C(wi−1,wj)=0 P(wj)
if C(wi−1, wi) > 0 otherwise




Francisco
I can’t see without my reading ___
C(reading, glasses) = C(reading, Francisco) = 0 C(Fransciso) > C(glasses)
Katz backoff will give higher probability to
30

COMP90042
L3
Kneser-Ney Smoothing
• Redistributeprobabilitymassbasedontheversatilityofthe lower order n-gram
• AKA“continuationprobability” • Whatisversatility?
‣ High versatility -> co-occurs with a lot of unique words, e.g. glasses
– men’s glasses, black glasses, buy glasses, etc
‣ Low versatility -> co-occurs with few unique words, e.g.
francisco
– san francisco
31

COMP90042
L3
Kneser-Ney Smoothing
C(wi−1, wi) − D , if C(wi−1, wi) > 0 C(wi−1)
PKN(wi|wi−1) =
β(wi−1)Pcont(wi), otherwise
the amount of probability mass that Pcont(wi) = | {wi−1 : C(wi−1, wi) > 0} | has been discounted for context wi-1 ∑wi | {wi−1 : C(wi−1, wi) > 0} |
• Intuitively the numerator of Pcont counts the number of unique wi-1 that co-occurs with wi
• Highcontinuationcountsforglasses • LowcontinuationcountsforFranciso
32

COMP90042
L3
Interpolation
• Abetterwaytocombinedifferentordersofn-grammodels
• Weightedsumofprobabilitiesacrossprogressivelyshorter contexts
• Interpolatedtrigrammodel:

+ λ1P1(wi) λ1 + λ2 + λ3 = 1
PIN(wi | wi−1, wi − 2) = λ3P3(wi | wi−2, wi−1) +λ2P2(wi|wi−1)
Learned based on held out data
33

COMP90042
L3
Interpolated Kneser-Ney Smoothing

Interpolation instead of back-off

PIKN(wi|wi−1)= C(wi−1,wi)−D+γ(wi−1)Pcont(wi)


C(wi−1)
w∑here γ(wi−1) is normalising constant such that
PIKN(wi | wi−1) = 1.0 wi−1∈V
34

COMP90042
L3


Commonly used Kneser-Ney language models use 5-grams as max order
In Practice
Has different discount values for each n-gram order.
35

COMP90042
L3
How can we use n-gram language model to fix spelling errors?
• Build a character-level LM and fix it character by character
• Build a word-level LM and use left/right context words to find the most appropriate substituting word
• Build a LM using common typos as training data
• N-gram LM can’t fix spelling errors
PollEv.com/jeyhanlau569
36

COMP90042
L3
37

COMP90042
L3
Generating Language
38

COMP90042
L3
Generation
• Given an initial word, draw the next word according to the probability distribution produced by the language model.
• Include (n-1) tokens for n-gram model, to provide context to generate first word
‣ never generate
‣ generating
terminates the sequence
39

COMP90042
L3
• •
Generation (Bigram LM) Sentence =
P(? | ) = “a” 0.5
0.375 0.25 0.125 0
cow grass
eats a

40

COMP90042
L3
• •
Generation (Bigram LM) Sentence = a
P(? | a) = “cow” 0.7
0.525 0.35 0.175 0
cow grass
eats a

41

COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow
P(? | cow) = “eats” 0.6
0.45 0.3 0.15 0
cow grass eats
a

42

COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats
P(? | eats) = “grass” 0.8
0.6 0.4 0.2
0
cow grass
eats a

43

COMP90042
L3
• •
Generation (Bigram LM) Sentence = a cow eats grass
P(? | grass) = “
” 0.6
0.45 0.3 0.15 0
cow grass
eats a

44

COMP90042
L3
• •
Sentence = a cow eats grass Done!
Generation (Bigram LM)
45

COMP90042
L3


Argmax: takes highest probability word each turn ‣ Greedy search

Randomly samples from the distribution
How to Select Next Word?
Beam search decoding (lecture 17):
‣ Keeps track of top-N highest probability words each turn
‣ Select sequence of words that produce the best sentence probability
46

COMP90042
L3


• •
N-gram language models are a simple but effective way to capture the predictability of language
A Final Word
Can be trained in an unsupervised fashion, scalable to large corpora
Require smoothing to be effective
Modern language models uses neural networks (lecture 8)
47

COMP90042
L3

Reading E18 Chapter 6 (skip 6.3)
48