Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A
CMPSC 442
Week 9, Meeting 27, Three Segments
Outline
● Decoding: the Viterbi Algorithm
● Part-of-Speech (POS) Problem
● Viterbi POS Tagger
2Outline, Wk 9, Mtg 25
Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A
CMPSC 442
Week 9, Meeting 27, Segment 1 of 3
Likelihood versus Decoding: Ice Cream Example
Given an HMM λ:
● Computing the likelihood of a sequence of observations P(o1,o2,o3 )
relies on the forward algorithm
○ The trellis entries carry forward all paths to each state that can emit oi
○ The likelihood involves summing over combinations of state sequences
that could produce a given observation sequence
● Computing the most likely sequence of states given the observations
(decoding) P(Q1,Q2,Q3|o1,o2,o3 ) relies on the Viterbi algorithm
○ The trellis entries carry forward all paths to each state that can emit oi
○ Decoding finds the single maximum probability state sequence
4Decoding
Likelihood versus Decoding: Ice Cream Example
5Decoding
Decoding in Brief
● Given as input an HMM λ = (A,B) and a sequence of observations O=o1,
o2, …, oT, find the most probable sequence of states Q=q1,q2,…,qT
● What is the single most likely sequence of states that generated o1=3,
o2=1, o3=3?
6Decoding
Forward Recursion versus Viterbi Recursion
● Forward recursion
● Viterbi recursion
7
Viterbi Recursion
● Viterbi recursion computes the maximum probability path to state j at
time t given the observation o1, . . . , ot
8Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Back Tracing
● Viterbi recursion computes the maximum probability path to state j at
time T given the observation o1, . . . , oT
● Viterbi must also identify the complete single path that gives this
maximum probability
○ Keep backpointers
○ Find
○ Trace backpointers from state j at time T to find the state sequence from
T back to 1
9Decoding
Viterbi Algorithm
● Initialization (t=1):
● Induction and backtracing (2 ≤ t ≤ T):
● Termination:
● Backpointer path:
10Decoding
Viterbi Trellis for Ice Cream Example: Induction Step
11Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Viterbi Trellis for Ice Cream Example: Backtracing
12Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Viterbi Pseudo Code
13Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A
Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A
CMPSC 442
Week 9, Meeting 27, Segment 2 of 3
Parts of Speech
● Parts of speech: traditional grammatical categories like “noun, ”“verb,”
“adjective,” “adverb” . . . (and many more)
● Origins in an ancient Greek monographs on grammar and parts of
speech (c. 100 B.C., Dionysius Thrax, Tryphon)
● Functions:
○ Help distinguish different word meanings: N(oun) vs. V(erb)
■ EG: river bank (N) vs. she banks (V) at a credit union
■ EG: a bear (N) will go after honey vs. bear (V) with me
○ Preprocessing for many other natural language processing tasks
15POS Tagging
POS Tagging
Pervasive in Natural Language Applications
● Machine Translation
○ Translations of nouns and verbs have different forms (prefixes, suffixes)
● Speech synthesis
○ “Close” as an adjective versus verb
○ see table
● Sense disambiguation of word meanings
○ “Close” (adjective) as in near something
○ “Close” (verb) as in shut something
16POS Tagging
Noun Verb
IN’ sult in SULT’
OB’ ject ob JECT’
OVER’ flow over FLOW’
DIS’ count dis COUNT’
CON’ tent con TENT’
Majority Class Baseline for POS Tagging
● Assign the most frequent tag
○ 83.3% correct here
○ Typically 90% correct
17POS Tagging
Word POS listing in Brown Corpus
heat Noun (86) Verb (4)
oil Noun (87)
in prep (18,975) noun (1) adv-particle (4)
a det (21,872) noun (1)
large adj (347) noun (2) adv (5)
pot noun (28)
POS Tagsets: Penn TreeBank
Penn Treebank, developed from 1989-1996, and very widely used ever
since (treebank ≡ a bank of syntactic trees, or other linguistic annotations)
● The first large annotated corpus used for training machine learned models for
natural language processing (NLP) tasks; 7M words total
● 7M words of part-of-speech tagged text
○ |POS| ≅ 36 – 48 (12 for punctuation and other symbols)
● 3M words of text with syntactic parses
● 2M words of text annotated with predicate-argument structure
● 1.6M words of transcripts of spoken language annotated for disfluency
18POS Tagging
Table Showing 9 of 36 Penn TreeBank POS Tags
19
Tag Description Example
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NNP proper noun, singular John
NNPS proper noun, plural Vikings
VBD verb, past tense took
VBZ verb, 3rd person sing. present takes
POS Tagging
POS Tags: Hidden States, Inherently Sequential
● Words are observed; part-of-speech tags are hidden
○ Cf. observed umbrellas vs. hidden weather states in the umbrella world
○ Cf. observed ice creams vs. hidden weather states in the ice cream world
● Likely POS sequences
○ JJ NN (delicious food, large pot)
○ NNS VBD (people voted, planes landed)
● Unlikely POS sequences
○ NN JJ (food delicious, pot large)
○ NNS VBZ (people votes, planes lands)
20POS Tagging
HMM POS Tagger as a Bayesian Network
● Condition the hidden POS tag p at time t on the POS tag at time t-1:
P(pi|pi-1 )
● Condition the word w at time t on the POS tag at time t: P(wi|pi )
21POS Tagging
Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A
CMPSC 442
Week 9, Meeting 27, Segment 3 of 3
HMM POS Tagger
● By Bayes Rule
● To tag a word, find the tag that has the maximum posterior probability
23Viterbi POS Tagger
HMM POS Tagger
● Assume the words wi are conditionally independent given the tags ti
● Assume the tags ti are conditionally independent given the tags ti-1
● For a given sequence of words W = w1w2w3…wn find the sequence of
tags T = t1t2t3…tn which maximizes
24Viterbi POS Tagger
Training Corpus Size (BOTEC)
● What parameters can we
estimate with a million words of
hand tagged training data?
○ Assume a uniform distribution
of a 5000 word vocabulary and
40 part of speech tags
25Viterbi POS Tagger
Event Count Estimate
Quality
tag unigrams 40 Excellent
tag bigrams 1,600 Excellent
tag trigrams 64,000 Okay
tag 4-grams 2.5 x 103 Poor
word unigrams 5000 Very Good
word bigrams 25M Poor
word, tag pairs 200,000 Okay
Streetlight Effect with a Positive Slant
● Rich models often require vast amounts of data
● Well estimated models that are not quite “correct” often outperform
badly estimated better models
26Viterbi POS Tagger
HMM Pos Tagger Parameters
● Q = qm ∈ {q1, q2 , . . . , qn} (|Q| = n = 36 for Penn TreeBank)
● A = aij transition probabilities for all 1,296 tag pairs qim qjn s. t. ∑j aij = 1
● O = o1o2 . . . oT sequences of T observations from a vocabulary V
(words w, arranged in sentences length T): training corpus
● B = bi(ot ) for the 36×|V| observation likelihoods of observations ot
generated from states i
● π = π1, π2, . . . , πn where πi is the probability that the Markov chain will
start with state i s. t. ∑i πi= 1
27Viterbi POS Tagger
MLE Estimates for the HMM POS Tagger
Given an annotated corpus (words with POS tags)
● Priors: probabilities that an initial state is tag qi
● Sensor model: Conditional probabilities of word wi given tag qi
● Transition model: Conditional probabilities of tag qj given tag qi
28Viterbi POS Tagger
Viterbi Decoding for POS Tagging
● Because many of these counts are small or do not occur, smoothing is
necessary for best results
● HMM taggers typically achieve about 95-96% accuracy, for the
standard 36-42 set of POS tags
29Viterbi POS Tagger
Summary
● The Viterbi algorithm is used to compute the most likely sequence of
hidden states for a sequence of observations
○ Viterbi recursion is similar to forward recursion, but carries forward the
maximum probability instead of the sum of probabilities
○ Requires a backtrace mechanism to track the actual path through the
hidden states
● Part-of-Speech tagging, an important pre-processing step for many
natural language processing applications, can be done using an HMM
● MLE estimates for an HMM POS tagger can be derived from a corpus
of text where each word is labeled with its true POS tag
30Summary, Wk 9, Mtg 27