程序代写代做代考 chain Hive Language Modeling

Language Modeling

Lizhen Qu

Recap

2

Sentence Splitting

Tokenization

Stemming

POS Tagging

Overview of the NLP Lectures

§  Introduction to natural language processing (NLP).

§  Regular expressions, sentence splitting,
tokenization, part-of-speech tagging.

§  Language models.

§  Vector semantics.

§  Parsing.

§  Compositional semantics.
3

Language Models

§  Goal: assign a probability to a word sequence.
–  Speech recognition:

•  P(I ate a cherry) > P(Eye eight Jerry)
–  Spelling correction:

•  P(Australian National University) > P(Australian National Univerisity)
–  Collocation error correction:

•  P(high wind) > P(large wind)
–  Machine Translation:

•  P(The magic of Usain Bolt on show…) > P(The magic of Usain Bolt
at the show …)

–  Question-answering, summarization, etc.

4

Probabilistic Language Modeling

§  A language model computes the probability of a
sequence of words.
–  A vocabulary .
– 
– 

§  Related task: probability of an upcoming word.
– 

§  LM: Either or .

5

p(x1, x2, …, xl) � 0
V

X

(x1,x2,…,xl)2V⇤
p(x1, x2, …, xl) = 1

p(x4|x1, x2, x3)

p(x4|x1, x2, x3) p(x1, x2, …, xl)

How to Compute

§  Apply chain rule:

§  Compute

6

p(x1, x2, …, xl)

P (x1, x2, …, xl) = P (x1)P (x2|x1)P (x3|x1, x2)…P (xl|x1, …, xl�1)

P(All Blacks’ hotel room bugged) = P(All) P(Blacks | All) P(’| All Blacks)
… P(bugged | All Blacks’s hotel room)

Estimate the Probabilities

§  P(bugged | All Blacks’s hotel room) =

§  Not enough data!

7

Count(All Blacks’s hotel room bugged)

Count(All Blacks’s hotel room)

Markov Assumption

§  Simplification:
–  P(bugged | All Blacks’s hotel room) = P(bugged | room)

–  or P(bugged | All Blacks’s hotel room) = P(bugged | hotel
room)

§  First-order Markov assumption:

8

P (x1, x2, …, xl) = P (x1)
lY

i=2

P (xi|x1, …, xi�1)

= P (x1)
lY

i=2

P (xi|xi�1)

Unigram Model

§  Zero-order Markov assumption.

§  Examples generated from a unigram model.

9

P (x1, x2, …, xl) =
lY

i

P (xi)

Months the my and issue of year foreign new exchange’s
september were recession exchange new endorsed a acquire
to six executives

Bigram Model

§  First-order assumption.

§  P(I want to eat Chinese food.) = ?

§  Estimate bigram probabilities from a training corpus.

10

P (x1, x2, …, xl) = P (x1)
lY

i=2

P (xi|xi�1)

Trigram Models

§  Second order assumption.

§  long-distance dependencies of language.

§  We can extend to 4-grams , 5-grams …

11

P (x1, x2, …, xl) = P (x1)P (x2|x1)
lY

i=3

P (xi|xi�1, xi�2)

“The iphone which I bought one week ago does
not stand the cold. “

Restaurant Corpus

0 1 0 0 0 0 4 lunch

0 0 0 0 17 0 19 food

0 120 0 0 0 0 2 Chinese

52 2 19 0 2 0 0 eat

12 0 3 860 10 0 3 to

6 8 6 0 786 0 3 want

0 0 0 13 0 1087 8 I

lunch food Chinese eat to want I
Bigrams :

459 213 938 3256 1265 809 3437

lunch food Chinese eat to want I
Unigrams :

Total : 11024

Compute Bigram Probabilities

§  Maximum likelihood estimation:

§  Bigram probabilities:

–  P(want|I) = 1087 / 3437 = 0.32

§  Log probabilities:
–  logP(I want to eat Chinese food) = logP(I) + logP(want |

I) + logP(to|want) + logP(eat|to) + logP(Chinese|eat) +
logP(food|Chinese)

P (xi|xi�1) =
count(xi�1, xi)

count(xi�1)

Sequence Generation

§  Compute conditional probabilities.
–  P(want | I) = 0.32
–  P(to| I) = 0
–  P(eat|I) = 0.004
–  P(Chinese|I) = 0
–  P (I | I) = 0.002
–  …

§  Generate a random number in [0,1].

§  See which region it falls into.

14

Approximating Shakespeare

§  Generate sentences from a unigram model.
–  Every enter now severally so, let
–  Hill he late speaks; or! a more to leg less first you enter

§  from a bigram model
–  What means, sir. I confess she? then all sorts, he is trim,

captain.
–  Why dost stand forth thy canopy, forsooth; he is this

palpable hit the King Henry.

§  from a trigram model.
–  Sweet prince, Falstaff shall die.
–  This shall forbid it should be branded, if renown made it

empty.

15

The Perils of Overfitting

§  P(I want to eat Chinese lunch.) = ? when
count(Chinese lunch) = 0.

§  In real life, the test corpus is often different than the
training corpus.

§  Unknown words!

16

Generalization by avoiding zeros!

Interpolation

§  Key idea: mix of lower-order n-gram probabilities.

§  For bigram model:

§  For trigram model:

17

P̂ (xi|xi�1, xi�2) = �1P (xi|xi�1, xi�2) + �2P (xi|xi�1) + �3P (xi)

P̂ (xi|xi�1) = �1P (xi|xi�1) + �2P (xi)

�1 � 0,�2 � 0,�3 � 0

�1 � 0,�2 � 0

X

i

�i = 1

�1 + �2 = 1

How to Set the Lambdas?

§  Estimate on the held-out data.

§  One simple estimation (Collins et al.).

18

Training data held-out data test data

�i

�1 =
count(xi�1, xi�2)

count(xi�1, xi�2) + �

�2 = (1� �1)⇥
count(xi�1)

count(xi�1) + �

�3 = 1� �1 � �2

Absolute Discounting Interpolation

§  Absolute discounting.

§  Often use d = 0.75 in practice.

§  Is it sufficient to use ?

19

P

AbsDiscount

(xi|xi�1) =
count(xi, xi�1)� d

count(xi�1)
+ �(xi�1)P (xi�1)

P (xi�1)

Knerser-Ney Smoothing (i)

§  Be#er es’mate for probabili’es of lower-order
unigrams!
–  Shannon game: I can’t see without my reading___________?
–  “Francisco” is more common than “glasses”
–  … but “Francisco” always follows “San”

§  Pcon’nua’on(x) : How likely is a word to appear as a
novel continuation?
–  For each word x, count the number of bigram types it

completes.

20

P

continuation

(xi) / |{xi�1|count(xi�1, xi) > 0}|
AAACN3icbVDLSgMxFM34tr6qLt0Ei1DBlhkRfCxEcOPOClYLnTJk0rQGM8mQ3JGW6XyWGz/DnbhxoeLWPzAdu1DrgcDJOeeS3BPGghtw3SdnYnJqemZ2br6wsLi0vFJcXbsyKtGU1akSSjdCYpjgktWBg2CNWDMShYJdh7enQ//6jmnDlbyEfsxaEelK3uGUgJWC4nktSH1gPUipksBlkutZVu4FfBv7sVYxKDzw016Q8oqX4QGmKpFQHt5xBVtlB+fZY+z62SAoltyqmwOPE29ESmiEWlB89NuKJhGTQAUxpum5MbRSooFTwbKCnxgWE3pLuqxpqSQRM600XzzDW1Zp447S9kjAufpzIiWRMf0otMmIwI356w3F/7xmAp2DVsplnACT9PuhTiKw7WLYIm5zzSiIviWEam7/iukN0YSC7bpgS/D+rjxO6rvVw6p3sVc6ORq1MYc20CYqIw/toxN0hmqojii6R8/oFb05D86L8+58fEcnnNHMOvoF5/ML5uWr3Q==AAACN3icbVDLSgMxFM34tr6qLt0Ei1DBlhkRfCxEcOPOClYLnTJk0rQGM8mQ3JGW6XyWGz/DnbhxoeLWPzAdu1DrgcDJOeeS3BPGghtw3SdnYnJqemZ2br6wsLi0vFJcXbsyKtGU1akSSjdCYpjgktWBg2CNWDMShYJdh7enQ//6jmnDlbyEfsxaEelK3uGUgJWC4nktSH1gPUipksBlkutZVu4FfBv7sVYxKDzw016Q8oqX4QGmKpFQHt5xBVtlB+fZY+z62SAoltyqmwOPE29ESmiEWlB89NuKJhGTQAUxpum5MbRSooFTwbKCnxgWE3pLuqxpqSQRM600XzzDW1Zp447S9kjAufpzIiWRMf0otMmIwI356w3F/7xmAp2DVsplnACT9PuhTiKw7WLYIm5zzSiIviWEam7/iukN0YSC7bpgS/D+rjxO6rvVw6p3sVc6ORq1MYc20CYqIw/toxN0hmqojii6R8/oFb05D86L8+58fEcnnNHMOvoF5/ML5uWr3Q==AAACN3icbVDLSgMxFM34tr6qLt0Ei1DBlhkRfCxEcOPOClYLnTJk0rQGM8mQ3JGW6XyWGz/DnbhxoeLWPzAdu1DrgcDJOeeS3BPGghtw3SdnYnJqemZ2br6wsLi0vFJcXbsyKtGU1akSSjdCYpjgktWBg2CNWDMShYJdh7enQ//6jmnDlbyEfsxaEelK3uGUgJWC4nktSH1gPUipksBlkutZVu4FfBv7sVYxKDzw016Q8oqX4QGmKpFQHt5xBVtlB+fZY+z62SAoltyqmwOPE29ESmiEWlB89NuKJhGTQAUxpum5MbRSooFTwbKCnxgWE3pLuqxpqSQRM600XzzDW1Zp447S9kjAufpzIiWRMf0otMmIwI356w3F/7xmAp2DVsplnACT9PuhTiKw7WLYIm5zzSiIviWEam7/iukN0YSC7bpgS/D+rjxO6rvVw6p3sVc6ORq1MYc20CYqIw/toxN0hmqojii6R8/oFb05D86L8+58fEcnnNHMOvoF5/ML5uWr3Q==AAACN3icbVDLSgMxFM34tr6qLt0Ei1DBlhkRfCxEcOPOClYLnTJk0rQGM8mQ3JGW6XyWGz/DnbhxoeLWPzAdu1DrgcDJOeeS3BPGghtw3SdnYnJqemZ2br6wsLi0vFJcXbsyKtGU1akSSjdCYpjgktWBg2CNWDMShYJdh7enQ//6jmnDlbyEfsxaEelK3uGUgJWC4nktSH1gPUipksBlkutZVu4FfBv7sVYxKDzw016Q8oqX4QGmKpFQHt5xBVtlB+fZY+z62SAoltyqmwOPE29ESmiEWlB89NuKJhGTQAUxpum5MbRSooFTwbKCnxgWE3pLuqxpqSQRM600XzzDW1Zp447S9kjAufpzIiWRMf0otMmIwI356w3F/7xmAp2DVsplnACT9PuhTiKw7WLYIm5zzSiIviWEam7/iukN0YSC7bpgS/D+rjxO6rvVw6p3sVc6ORq1MYc20CYqIw/toxN0hmqojii6R8/oFb05D86L8+58fEcnnNHMOvoF5/ML5uWr3Q==

Knerser-Ney Smoothing (ii)

§  Example:

§  Normalized by the total number of bigram types.

21

|{xi�1|count(xi�1,Francisco) > 0}| << |{xi�1|count(xi�1, glasses) > 0}|

P

continuation

(xi) =
|{xi�1|count(xi�1, xi) > 0}|

|{(xj�1, xj)|count(xj�1, xj) > 0}|

|{(xj�1, xj)|count(xj�1, xj) > 0}|

Kneser-Ney Smoothing (iv)

§  definition for Bigrams:

where

22

�(xi�1) =
d

count(xi�1)
|{x|count(xi�1, x) > 0}|

https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

P

KN

(xi|xi�1) =
max(count(xi�1, xi)� d, 0)

count(xi�1)
+ �(xi�1)Pcontinuation(xi)

Evaluation of Language Models

§  Extrinsic evaluation:
–  Put each model in a task

•  Spelling correction, machine translation etc.
–  Time consuming.
–  Task dependent.

§  Intrinsic evaluation:
–  perplexity.
–  Useful in pilot experiments.

23

P(X) = P(x1x2…xL )

1
L

=
1

P(x1x2…xL )
L

=
1

P(xi | xi−1)∏
L

Unigrams Bigram Trigram
Perplexity 962 170 109

Google N-Gram Release

24

§  serve as the incoming 92
§  serve as the incubator 99
§  serve as the independent 794
§  serve as the index 223
§  serve as the indication 72
§  serve as the indicator 120
§  serve as the indicators 45
§  serve as the indispensable 111
§  serve as the indispensible 40
§  serve as the individual 234

http://ngrams.googlelabs.com/

Smoothing for Web-scale N-grams

§  “Stupid backoff” (Brants et al. 2007)
§  No discounting, just use relative frequencies

25

S(wi |wi−k+1
i−1 ) =

count(wi−k+1
i )

count(wi−k+1
i−1 )

if count(wi−k+1
i )> 0

0.4S(wi |wi−k+2
i−1 ) otherwise

#
$$

%
$
$

S(wi ) =
count(wi )

N

Tools

§  SRILM
–  h#p://www.speech.sri.com/projects/srilm/

§  Berkeley LM
–  https://code.google.com/archive/p/berkeleylm/

§  KenLM
–  https://kheafield.com/code/kenlm/

§  Available LM models
–  http://www.keithv.com/software/csr/

26