5b_Language_Models.dvi
COMP9414 Language Models 1
Probabilistic Language Models
� Based on statistics derived from large corpus of text/speech
◮ Brown Corpus (1960s) – 1 million words
◮ Penn Treebank (1980s) – 7 million words
◮ North American News (1990s) – 350 million words
◮ IBM – 1 billion words
◮ Google & Facebook – Trillions of words
� Contrary to view that language ability based on (innate) knowledge
� Idea is language ability can be learnt . . . with enough data . . .
UNSW ©W. Wobcke et al. 2019–2021
COMP9414: Artificial Intelligence
Lecture 5b: Language Models
Wayne Wobcke
e-mail:w. .au
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 3
Penn Treebank Tagset
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 2
This Lecture
� Part of Speech Tagging
◮ n-gram Models
◮ Hidden Markov Models
◮ Viterbi Algorithm
� Word Sense Disambiguation
◮ Mutual Information
◮ Class-Based Models
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 5
Why is this Hard?
Ambiguity, e.g. back
� earnings growth took a back/JJ seat
� a small building in the back/NN
� a clear majority of senators back/VBP the bill
� Dave began to back/VB toward the door
� enable the country to buy back/RP about debt
� I was twenty-one back/RB then
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 4
Part of Speech Tagging
� The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
� There/EX are/VBP 70/CD children/NNS there/RB
� Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN
today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN
Medicine/NNP ./.
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 7
Unigram Model
Maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)
• Apply independence assumptions
• P(w1, · · · ,wn|t1, · · · , tn) = P(w1|t1). · · · .P(wn|tn)
• Probability of word w generated by t independent of context
• P(t1, · · · , tn) = P(t1). · · · .P(tn)
• Probability of tag sequence independent of order
• Estimate probabilities
• P(w|t) = #(w occurs with tag t)/#(words with tag t)
• P(t) = #(words with tag t)/#words
• Choose tag sequence that maximizes ΠP(ti|wi)
• Chooses most common tag for each word
• Accuracy around 90% – but still ≈1 word wrong in every sentence!
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 6
Probabilistic Formulation
� Events: Occurrence of word w, occurrence of a word with tag t
� Given sequence of words w1, · · · ,wn, choose t1, · · · , tn so that –
P(t1, · · · , tn|w1, · · · ,wn) is maximized
� Apply Bayes’ Rule
◮ P(t1, · · · , tn|w1, · · · ,wn) =
P(w1,··· ,wn|t1,··· ,tn).P(t1,··· ,tn)
P(w1,··· ,wn)
◮ Therefore maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 9
Hidden Markov Models
� Bayesian network
◮ P(S0) specifies initial conditions
◮ P(Si+1|Si) specifies dynamics
◮ P(Oi|Si) specifies “observations”
� Independence Assumptions
◮ P(Si+1|S0, · · · ,Si) = P(Si+1|Si) (Markov Chain)
◮ P(Oi|S0, · · · ,Si−1,Si,O0, · · · ,Oi−1) = P(Oi|Si)
◮ Observations (words) depend only on current state (tag)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 8
Markov Chain
� Bayesian network
◮ P(S0) specifies initial conditions
◮ P(Si+1|Si) specifies dynamics (stationary if same for each i)
� Independence assumptions
◮ P(Si+1|S0, · · · ,Si) = P(Si+1|Si)
◮ Transition probabilities dependent only on current state Si –
independent of history to reach that state S0, · · · ,Si−1
◮ The future is independent of the past, given the present
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 11
Markov Model for POS Tagging
Transition probabilities define stationary distribution
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 10
Bigram Model
Maximize P(w1, · · · ,wn|t1, · · · , tn).P(t1, · · · , tn)
� Apply independence assumptions (Markov assumptions)
◮ P(w1, · · · ,wn|t1, · · · , tn) = ΠP(wi|ti)
◮ Observations (words) depend only on states (tags)
◮ P(t1, · · · , tn) = P(tn|tn−1). · · · .P(t1|φ), where φ = start
◮ Bigram model: state (tag) depends only on previous state (tag)
� Estimate probabilities
◮ P(ti|t j) = #((t j,ti occurs)/#(t j starts a bigram)
◮ Choose tag sequence that maximizes ΠP(wi|ti).P(ti|ti−1)
◮ Parts of speech generated by finite state machine
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 13
Computing Probabilities
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 12
Hidden Markov Model for POS Tagging
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 15
Viterbi Algorithm
1. Sweep forward (one word at a time) saving only the most likely
sequence (and its probability) for each tag ti of wi
2. Select highest probability final state
3. Follow chain backwards to extract tag sequence
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 14
Example
w P(w|ART) P(w|N) P(w|P) P(w|V)
a 0.36 0 0 0
flies 0 0.025 0 0.076
flower 0 0.063 0 0.05
like 0 0.012 0 0.1
P(flies/N like/V a/ART flower/N) =
= P(N|start).P(flies|N).P(V|N).P(like|V).P(ART|V).P(a|ART).P(N|ART).P(flower|N)
= 0.29 x 0.025 x 0.43 x 0.1 x 0.65 x 0.36 x 1 x 0.063
Most likely sequence, even though P(flies/V) > P(flies/N)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 17
Windows
� Consider co-occurrences in a window about w
w1 w wn
� Senses of word should co-occur with classes of “related” words
� Choose sense s of w to maximize P(w as s|w1, · · · ,wn)
� Apply Bayes’ Rule
◮ Maximize
P(w1,··· ,wn|w as s).P(w as s)
P(w1,··· ,wn)
� Apply independence assumptions
◮ P(w1, · · · ,wn|w as s) = ΠP(wi|w as s)
� Estimate probabilities: P(wi|w as s)
◮ #(wi in n-word window around w as s)/#(windows on w as s)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 16
Word Sense Disambiguation
Example
I should have changed that stupid lock and made you leave your
key, if I’d known for just one second you’d be back to bother me.
lock = · · ·
leave = · · ·
second = · · ·
back = · · ·
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 19
Mutual Information
MI(x,y) = log2
P(x,y)
P(x).P(y)
MI(sense(w1),w2) = log2
N.#(sense(w1),w2)
#(sense(w1)).#(w2)
where N is the size of the corpus
� MI = 0: sense(w1) and w2 are conditionally independent
� MI < 0: sense(w1) and w2 occur together less than randomly � MI > 0: sense(w1) and w2 occur together more than randomly
� Adding mutual information is equivalent to assuming independence
� Choose sense s for w = argmaxs∈senses(w)Σwi∈window(w)MI(s,wi)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 18
Simple (Made Up) Example
Word bridge/structure bridge/dental any window
teeth 1 10 300
suspension 200 1 2000
the 5500 180 500 000
dentist 2 35 900
TOTAL 5651 194 501 500
P(bridge/structure) = 5651/501 500 = 0.0113
P(bridge/dental) = 194/501 500 = 3.87 x 10−4
P(teeth|bridge/structure) = 1/5651 = 1.77 x 10−4
P(teeth|bridge/dental) = 10/194 = 0.052
bridge/dental preferred if window contains ‘teeth’
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 21
Conclusion
� Statistical (and neural network) models perform well on many tasks
◮ Part-of-speech tagging
◮ Word sense disambiguation
◮ Control of traditional parser
◮ Probabilistic parsing
� Problems
◮ Unrealistic simplifying assumptions (that seem to work)
◮ Requirement for very large amount of (labelled) text
◮ Sparsity of word occurrences in (even large) text corpora
◮ Changes in word usage over time (e.g. Senator Obama)
UNSW ©W. Wobcke et al. 2019–2021
COMP9414 Language Models 20
Class-Based Methods
� Use predefined “sense classes”, e.g. WordNet, Wikipedia
◮ lock→Mechanical Devices← tool, crank, cog, · · ·
◮ lock→ Body Part← hair, eyes, hands, · · ·
� Calculate counts for word senses by adding those for words
� Advantages
◮ Reduces space and time complexity
◮ Reduces data sparsity
◮ Allows unsupervised learning
UNSW ©W. Wobcke et al. 2019–2021