CS计算机代考程序代写 chain Bayesian Hidden Markov Mode Bayesian network algorithm 5b_Language_Models.dvi

5b_Language_Models.dvi

COMP9414 Language Models 1

Probabilistic Language Models

� Based on statistics derived from large corpus of text/speech

◮ Brown Corpus (1960s) – 1 million words

◮ Penn Treebank (1980s) – 7 million words

◮ North American News (1990s) – 350 million words

◮ IBM – 1 billion words

◮ Google & Facebook – Trillions of words

� Contrary to view that language ability based on (innate) knowledge

� Idea is language ability can be learnt . . . with enough data . . .

UNSW ©W. Wobcke et al. 2019–2021

COMP9414: Artificial Intelligence

Lecture 5b: Language Models

Wayne Wobcke

e-mail:w. .au

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 3

Penn Treebank Tagset

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 2

This Lecture

� Part of Speech Tagging

◮ n-gram Models

◮ Hidden Markov Models

◮ Viterbi Algorithm

� Word Sense Disambiguation

◮ Mutual Information

◮ Class-Based Models

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 5

Why is this Hard?

Ambiguity, e.g. back

� earnings growth took a back/JJ seat

� a small building in the back/NN

� a clear majority of senators back/VBP the bill

� Dave began to back/VB toward the door

� enable the country to buy back/RP about debt

� I was twenty-one back/RB then

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 4

Part of Speech Tagging

� The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN

of/IN other/JJ topics/NNS ./.

� There/EX are/VBP 70/CD children/NNS there/RB

� Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN

today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN

Medicine/NNP ./.

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 7

Unigram Model

Maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)

• Apply independence assumptions

• P(w1, · · · ,wn|t1, · · · , tn) = P(w1|t1). · · · .P(wn|tn)

• Probability of word w generated by t independent of context

• P(t1, · · · , tn) = P(t1). · · · .P(tn)

• Probability of tag sequence independent of order

• Estimate probabilities

• P(w|t) = #(w occurs with tag t)/#(words with tag t)

• P(t) = #(words with tag t)/#words

• Choose tag sequence that maximizes ΠP(ti|wi)

• Chooses most common tag for each word

• Accuracy around 90% – but still ≈1 word wrong in every sentence!

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 6

Probabilistic Formulation

� Events: Occurrence of word w, occurrence of a word with tag t

� Given sequence of words w1, · · · ,wn, choose t1, · · · , tn so that –

P(t1, · · · , tn|w1, · · · ,wn) is maximized

� Apply Bayes’ Rule

◮ P(t1, · · · , tn|w1, · · · ,wn) =
P(w1,··· ,wn|t1,··· ,tn).P(t1,··· ,tn)

P(w1,··· ,wn)

◮ Therefore maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 9

Hidden Markov Models

� Bayesian network

◮ P(S0) specifies initial conditions

◮ P(Si+1|Si) specifies dynamics

◮ P(Oi|Si) specifies “observations”

� Independence Assumptions

◮ P(Si+1|S0, · · · ,Si) = P(Si+1|Si) (Markov Chain)

◮ P(Oi|S0, · · · ,Si−1,Si,O0, · · · ,Oi−1) = P(Oi|Si)

◮ Observations (words) depend only on current state (tag)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 8

Markov Chain

� Bayesian network

◮ P(S0) specifies initial conditions

◮ P(Si+1|Si) specifies dynamics (stationary if same for each i)

� Independence assumptions

◮ P(Si+1|S0, · · · ,Si) = P(Si+1|Si)

◮ Transition probabilities dependent only on current state Si –

independent of history to reach that state S0, · · · ,Si−1

◮ The future is independent of the past, given the present

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 11

Markov Model for POS Tagging

Transition probabilities define stationary distribution

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 10

Bigram Model

Maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)

� Apply independence assumptions (Markov assumptions)

◮ P(w1, · · · ,wn|t1, · · · , tn) = ΠP(wi|ti)

◮ Observations (words) depend only on states (tags)

◮ P(t1, · · · , tn) = P(tn|tn−1). · · · .P(t1|φ), where φ = start

◮ Bigram model: state (tag) depends only on previous state (tag)

� Estimate probabilities

◮ P(ti|t j) = #((t j,ti occurs)/#(t j starts a bigram)

◮ Choose tag sequence that maximizes ΠP(wi|ti).P(ti|ti−1)

◮ Parts of speech generated by finite state machine

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 13

Computing Probabilities

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 12

Hidden Markov Model for POS Tagging

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 15

Viterbi Algorithm

1. Sweep forward (one word at a time) saving only the most likely

sequence (and its probability) for each tag ti of wi

2. Select highest probability final state

3. Follow chain backwards to extract tag sequence

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 14

Example

w P(w|ART) P(w|N) P(w|P) P(w|V)

a 0.36 0 0 0

flies 0 0.025 0 0.076

flower 0 0.063 0 0.05

like 0 0.012 0 0.1

P(flies/N like/V a/ART flower/N) =

= P(N|start).P(flies|N).P(V|N).P(like|V).P(ART|V).P(a|ART).P(N|ART).P(flower|N)

= 0.29 x 0.025 x 0.43 x 0.1 x 0.65 x 0.36 x 1 x 0.063

Most likely sequence, even though P(flies/V) > P(flies/N)

UNSW ©W. Wobcke et al. 2019–2021

COMP9414 Language Models 17

Windows

� Consider co-occurrences in a window about w

w1 w wn

� Senses of word should co-occur with classes of “related” words

� Choose sense s of w to maximize P(w as s|w1, · · · ,wn)

� Apply Bayes’ Rule

◮ Maximize
P(w1,··· ,wn|w as s).P(w as s)

P(w1,··· ,wn)

� Apply independence assumptions

◮ P(w1, · · · ,wn|w as s) = ΠP(wi|w as s)

� Estimate probabilities: P(wi|w as s)

◮ #(wi in n-word window around w as s)/#(windows on w as s)

COMP9414 Language Models 16

Word Sense Disambiguation

Example

I should have changed that stupid lock and made you leave your

key, if I’d known for just one second you’d be back to bother me.

lock = · · ·

leave = · · ·

second = · · ·

back = · · ·

COMP9414 Language Models 19

Mutual Information

MI(x,y) = log2
P(x,y)

P(x).P(y)

MI(sense(w1),w2) = log2
N.#(sense(w1),w2)

#(sense(w1)).#(w2)

where N is the size of the corpus

� MI = 0: sense(w1) and w2 are conditionally independent

� MI < 0: sense(w1) and w2 occur together less than randomly � MI > 0: sense(w1) and w2 occur together more than randomly

� Adding mutual information is equivalent to assuming independence

� Choose sense s for w = argmaxs∈senses(w)Σwi∈window(w)MI(s,wi)

COMP9414 Language Models 18

Simple (Made Up) Example

Word bridge/structure bridge/dental any window

teeth 1 10 300

suspension 200 1 2000

the 5500 180 500 000

dentist 2 35 900

TOTAL 5651 194 501 500

P(bridge/structure) = 5651/501 500 = 0.0113

P(bridge/dental) = 194/501 500 = 3.87 x 10−4

P(teeth|bridge/structure) = 1/5651 = 1.77 x 10−4

P(teeth|bridge/dental) = 10/194 = 0.052

bridge/dental preferred if window contains ‘teeth’

COMP9414 Language Models 21

Conclusion

� Statistical (and neural network) models perform well on many tasks

◮ Part-of-speech tagging

◮ Word sense disambiguation

◮ Control of traditional parser

◮ Probabilistic parsing

� Problems

◮ Unrealistic simplifying assumptions (that seem to work)

◮ Requirement for very large amount of (labelled) text

◮ Sparsity of word occurrences in (even large) text corpora

◮ Changes in word usage over time (e.g. Senator Obama)

COMP9414 Language Models 20

Class-Based Methods

� Use predefined “sense classes”, e.g. WordNet, Wikipedia

◮ lock→Mechanical Devices← tool, crank, cog, · · ·

◮ lock→ Body Part← hair, eyes, hands, · · ·

� Calculate counts for word senses by adding those for words

� Advantages

◮ Reduces space and time complexity

◮ Reduces data sparsity

◮ Allows unsupervised learning

Related Posts