Computational Linguistics
Lecture 4
2017
HMM
and
Part of Speech Tagging
Adam Meyers
New York University
Computational Linguistics
Lecture 4
2017
Outline
• Parts of Speech Tagsets
• Rule-based POS Tagging
• HMM POS Tagging
• Transformation-based POS Tagging
Computational Linguistics
Lecture 4
2017
Part of Speech Tags Standards
• There is no standard set of parts of speech that is used by all researchers
for all languages.
• The most commonly used English tagset is that of the Penn Treebank at
the University of Pennsylvania:
– http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports
– Page 5 of: http://ucrel.lancs.ac.uk/acl/J/J93/J93-2004.pdf
• Provides list
• To map several POS tagsets to each other, see Table 1 in:
– http://nlp.cs.nyu.edu/meyers/Annotation%20Compatibility%20Working%20Group%20Report%202006.html
• POS tagsets:
– Assume Particular Tokenizations, e.g., Mary’s → Mary + ‘s
– Distinguish inflections: e.g., eat/VB, eat/VBP, eats/VBZ, ate/VBD
– Different instances of the same string can have different tags
• She wants to eat/VB; They eat/VBP. He eats/VBZ, Those are good eats/NNS
• Annotators & POS taggers assign tags to each token in a sentence, no exceptions
http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports
http://ucrel.lancs.ac.uk/acl/J/J93/J93-2004.pdf
http://nlp.cs.nyu.edu/meyers/Annotation%20Compatibility%20Working%20Group%20Report%202006.html
Computational Linguistics
Lecture 4
2017
The Penn Treebank II POS tagset
• Verbs: VB, VBP, VBZ, VBD, VBG, VBN
– base, present-non-3rd, present-3rd, past, -ing, -en
• Nouns: NNP, NNPS, NN, NNS
– proper/common, singular/plural (singular includes mass + generic)
• Adjectives: JJ, JJR, JJS (base, comparative, superlative)
• Adverbs: RB, RBR, RBS, RP (base, comparative, superlative, particle)
• Pronouns: PRP, PP$ (personal, possessive)
• Interogatives: WP, WP$, WDT, WRB (compare to: PRP, PP$, DT, RB)
• Other Closed Class: CC, CD, DT, PDT, IN, MD
• Punctuation: # $ . , : ( ) “ ” ” ‘ `
• Weird Cases: FW(deja vu), SYM (@), LS (1, 2, a, b), TO (to), POS(‘s, ‘),
UH (no, OK, well), EX (it/there)
• Newer tags: HYPH, PU
Computational Linguistics
Lecture 4
2017
Part of Speech Tagging
• POS taggers assign 1 POS tag to each input token
– The/DT silly/JJ man/NN is/VBZ a/DT professor/NN ./PU
• Different ways of breaking down POS tagging:
– Use separate “tokenizer”, program that divides string into list of
tokens – POS tagger processes output
– Incorporate tokenizer into POS tagger
• Different ways of breaking down parsing:
– Use separate POS tagger – output of tagger is input to parser
– Assign POS tags as part of parsing (assumed previously)
• Accurate POS tagging is “easier” than accurate parsing
– POS tags may be sufficient information for some tasks
Computational Linguistics
Lecture 4
2017
Some Tokenization Rules for English
• 1) Divide at spaces and hyphens.
• 2) Divide before punctuation that is followed by: a space or the end of the line
– Define punctuation as any non-letter/non-number:
• `!@#$%^&*()-_+={[}]\|:;’”<,>.?/
– Punctuation followed by a space, other punctuation, or at the end of line should be
separated from words:
• …and he left.”) → and he left . ” )
• 3) Break off the following as separate tokens when followed by a space or end
of line:
– ‘s, n’t, ‘d, ‘ve, ‘m, ‘ll, ‘re, … (a short list)
• 4) Abbreviations are exceptions to rule 2:
– Period after abbreviations should not be separate from words
• Most cases covered by list of 100 items (or if sentence end is known)
– Final periods are not duplicated after abbreviations (consistency issues)
• These periods serve 2 functions simultaneously (argument for duplication)
• These periods occupy a single character position
– argument against duplication – difficulty with calculating character offsets
Computational Linguistics
Lecture 4
2017
Rule-based POS Tagger
• Method
– Assign lists of potential POS tags to each word based on dictionary
– Manual rules for Out of Vocabulary (OOV) words
• Ex: Non-initial capital → NNP; ends in S → VBZ|NNS; default → NN|JJ; etc.
– Apply hand-written constraints until each word has only one possible POS
• Sample Constraints:
– 1) DT cannot immediately precede a verb
– 2) No verb can immediately precede a tensed verb: VBZ, VBP, VBD
• Untensed: VB (base form), VBN & VBG (past & present participles)
• Example:
– The/DT book/{NN|VB|VBP} is/VBZ on/IN the/DT table{NN|VB|VBP}
– The/DT book/NN is/VBZ on/IN the/DT table/NN
• DT cannot precede VB or VBP
• VBZ cannot be preceded by VB or VBP
Computational Linguistics
Lecture 4
2017
Probability
• Estimate of probability of future event based on past observations
• Conditional Probability: probability of X given Y
• Examples relating to POS tags (previous examples with word N-grams):
– Out of 200 DT tags, 150 of them are tagging the word the
• If a word is tagged DT, there is a 75% chance that word is the
• Example of likelihood probability
– The POS after a DT is NN 120 times and JJ 60 times:
• A word following DT is
– 120/200 = 60% likely to be a singular noun (NN)
– 60/200 = 30% likely to be a base adjective (JJ)
• Examples of transition probability
P (event)=
num of events
num of trials
P ( X |Y )=
P ( X ,Y )
P (Y )
Computational Linguistics
Lecture 4
2017
More Math Terminology
• N instances of a variable looked at individually:
• The product of instances of X from 1 to n
• Max = the maximum number in a set
• Argmax = the formula that maximizes a
particular argument of the formula
X 1
n
is the same as{ X 1, X 2, X 3,. .. X n }in sequence
∏
i=1
n
P (X i)
Computational Linguistics
Lecture 4
2017
Probabilistic Models of POS tagging
• For tokens w1, …, wn, find the most probable corresponding
sequence of possible tags t
1
, …, t
n
– We assume that probable means something like “most frequently
observed in some manually tagged corpus of words”.
• Penn Treebank II (a common training corpus)
– 1 million words from the Wall Street Journal
– Tagged for POS (and other attributes)
• The specific sequence (sentence) is not in the training corpus
– Therefore the actual “probability” is 0
– Common practice: estimate probability given assumptions, e.g.,
• Assume that we can estimate probability of whole tag sequence by
multiplying simpler probabilities, e.g., sequences of 2 consecutive tags
Computational Linguistics
Lecture 4
2017
Probabilistic Assumptions of HMM Tagging
•
– Choose the tag sequence of length n that is most
probable given the input token sequence
• Bayes Rule:
–
– Way to derive the probability of x given y when
you know: the probability of y given x, the
probability of x and the probability of y
• Applying Bayes Rule to Tag Probability
–
t̂=
argmax
t 1
n
P( t1
n
|w1
n
)
P(x | y )=
P( y | x )P( x)
P( y )
t̂=
argmax
t 1
n
P(w1
n
| t1
n
)P(t 1
n
)
P(w1
n
)
Computational Linguistics
Lecture 4
2017
Simplifying Assumptions for HMMs
• Simplification: Drop the denominator
– Denominator is same for all the tag sequences (the word sequence is given)
–
– For each tag sequence calculate the product of:
• The probability of the word sequence given the tag sequence (likelihood)
• The probability of the tag sequence (prior probability)
– Still too hard
• 2 simplifying assumptions make it possible to estimate the probability of tag
sequences given word sequences:
– 1) If the probability of a word is only dependent on its own POS tag,
•
– 2) If the probability of a POS tag is only dependent on the previous POS tag,
•
• The result of these assumptions:
• HMM taggers are fast and achieve precision/recall scores of about 93-95%
t̂=
argmax
t 1
n
P(w1
n
| t 1
n
)P( t1
n
)
P(w1
n | t1
n
)≈∏
i=1
n
P(wi | t i)
P(t n)≈∏
i=1
n
P (ti | t i−1)
t̂≈argmax
t 1
n ∏
i=1
n
P(wi | ti)P (t i | t i−1)
Computational Linguistics
Lecture 4
2017
Estimating Probability of
• We assume that:
• Acquire frequencies from a training corpus:
– Word Frequency with given POS
• suppose book occurs 14 times in a corpus: 10 times (.001) as NN (there are 10000
instances of NN in the corpus); 3 times (.003) as VBP (the corpus has 1000 VBPs),
and 1 instance of book (.005) as VB (the corpus has 500 VBs).
– Given the previous tag, how often does each tag occur
• suppose DT is followed by NN 80,000 times (.53), JJ 30,000 times (.2), NNS 20,000
times (.13), VBN 3,000 (.02) times, … out of a total of 150,000 occurrences of DT
• All possible tags for sequence:
– The/DT book/{NN|VB|VBP} is/VBZ on/IN the/DT table/{NN|VB|VBP}
• Hypothetical probabilities for highest scoring tag sequence:
– The/DT book/NN is/VBZ on/IN the/DT table/NN
– The/DT=.4, book/NN=.001, is/VBZ=.02, on/IN=.1, the/DT=.4, table/NN=.0005,
– B DT = .61, DT NN = .53, NN VBZ =.44, VBZ IN = .12, IN DT = .05, DT NN = .53 NN E .31
–
t̂≈
argmax
t 1
n ∏
i=1
n
P(wi | ti)P (t i | t i−1)
∏
i=1
n
P (wi | t i)P (t i | t i−1)=(.4×.61)( .001× .53)( .02×.44)(.1×.12)(.4×.05) (.005×.53) (1×.31)≈2.4×10
−13
t
Computational Linguistics
Lecture 4
2017
Defining an HMM
• A Weighted Finite-state Automaton (WFSA)
– Each transition arc is associated with a probability
– The sum of all arcs outgoing from a single node is 1
• Markov chain is a WFSA in which an input string uniquely determine path
through the Automaton
• Hidden Markov Model (HMM) is a slightly different case because some
information (previous POS tags) is unknown (or hidden)
• HMM consists of the following:
– Q = set of states: q0 (start state), …, qF (final state)
– A = transition probability matrix of n X n probabilities of transitioning
between any pair of n states (n = F+1). Called: prior probability or transition
probability of a tag sequence
– O = sequence of T observations (words) from a vocabulary V
– B = sequence of observation likelihoods (probability of observation generated
at state) – Called likelihood (of word sequence given tag sequence), aka
emission probability
Computational Linguistics
Lecture 4
2017
Example HMM
IN
.12
.47
1.0
.22
.15
.41
.10
.25 .44
book: .001
table: .0005
fish: .0002
orange: .00001
…
is: .02
sees: .0012
hates: .002
sells: .004
…
START
DT
JJ
NN VBZ END
Q0
Q1
Q2
Q3 Q5
Q4
QF
.20
.61
.53
.13
.06
.06
.31
angry: .0005
blue: .0011
perfect: .003
orange: .0015
…
the: .4
an: .05
a: .3
these: .07
…
of: .2
in: .11
on: .1
before: .001
…
.34
.60
Computational Linguistics
Lecture 4
2017
Go to Ralph’s Viterbi Demo
for Fish Sleep
Computational Linguistics
Lecture 4
2017
Viterbi Algorithm for HMM
Observed_Words = w
1
… w
T
• States = qo, q1 … qN, qF
A = N ⨯ N matrix such that a
i,j
is the probability of the transition from q
i
to q
j
B = lookup table such that b
i
(w
t
) is the probability that POS i is realized as word t
viterbi = (N+2) ⨯ T matrix # columns are states, rows are words
backpointer = (N+2) ⨯ T matrix # highest scoring previous cells for viterbi
for states q from 1 to N:
initialize viterbi[q,1] to a
0,q
* b
q
(w
1
) # score transition 0→q given w
1
initialize backpointer[q,1] to 0 (start state)
for word w from 2 to T:
for state q from 1 to N: # for T-1 N (⨯ w,q) pairs
viterbi[q,w]← # score = maximum previous * prior *
likelihood
backpointer[q,w] ← # backpointer = maximum previous
viterbi[qF,T]← # score = maximum previous * prior * likelihood
backpointer[qF,T]← # backpointer = maximum previous
• return(best_path) # derive by following backpointers from (qF,T) to q0
N
max
q’=1
viterbi[q ‘ , t−1 ]∗aq ‘ , q∗bq(w t)
N
argmax
q ‘=1
viterbi[q’ , t−1 ]∗aq ‘ ,q
N
max
q=1
viterbi[q,T ]∗aq , qF
N
argmax
q=1
viterbi [q ,T ]∗aq, qF
Computational Linguistics
Lecture 4
2017
Walk Through: The orange is on the table.
(ignoring period)
0 1: The 2:orange 3: is 4: on 5: the 6: table 7
Start 1
DT .4 * .61 .4*.6
IN .1*.12
JJ .0015 * .47
VBZ .02*.44
NN .00001 * .53 .53*.0005
End .33 * 1
1 * .4 * .61 * .00001 * .53 * .02 * .33 * .1 *.12 * .4 * .6 * .54 * .0005 * .33 * 1 = 2.19 * 10-15
Computational Linguistics
Lecture 4
2017
Comments on Viterbi Trace
• Initialize scores for first column: transitions from 0 to each
possible state given: the
– The probability of reaching Q1 matching the first item on the tape
(the) will be .4 X .61 = .244 (this is also the only possibility)
• The adjective sense of orange is more likely locally, but
leads to a dead end
• The transitions from B and the transition to E are necessary parts of
the process.
Computational Linguistics
Lecture 4
2017
Unknown (OOV) Words
• Possibility 1
– Assume all POS tags have the same probability (e.g., 1/1000)
– In effect, only use transitions to predict the correct tag
• Possibility 2
– Use morphology (prefixes, suffixes), orthography (uppercase/lowercase),
hyphenation
• Possibility 3:
– Words occurring once in corpus = instances of UNKNOWN_WORD
– Distribution of UNKNOWN_WORD used for OOV words
• Possibility 4: Some combination
– Example: divide UNKNOWN_WORD into morphological classes like
UNKNOWN_WORD_ENDING_IN_S
Computational Linguistics
Lecture 4
2017
Homework
• http://cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/homework4.html
• Guidance on Program – Next few slides
• This assignment can be completed alone or with a partner
– A slightly more elaborate system will be expected if you choose to
work with a partner.
http://cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/homework4.html
Computational Linguistics
Lecture 4
2017
Implement Simple version of training stage first
• Data 2 fields (separated by tab): word and POS
– Start of file = begin of sentence
– Blank line = begin and end of sentence
– End of file = end of sentence
• Make 2 hash tables (e.g., Python dictionaries)
1. POS → list of frequencies of words that occur with that POS
● Example: likelihood[‘DT’] → {‘the’:1500,’a’:200,’an’:100, …}
● Hash table of POSs with each value a hash table from words to frequencies
2. STATE → list of frequencies of following states
● Example: Transition[‘Begin_Sent’] → {‘DT’:1000,’NNP’:500,’VB’:200, …}
● Example: Transition[‘DT’] → {‘NN’:500,’NNP:’200,’VB’:30,,…}
● Hash table of states with each a value a hash table from states to frequencies
● States = Begin_Sent, End_Sent and all POSs
• Go through the data one line at a time
– Record frequencies for both 1 and 2
– Loop thru hash table and convert frequencies into probabilities
• freq/total = probability
Computational Linguistics
Lecture 4
2017
Simple Version of Transducer
• Make a 2 dimensional array (or equivalent)
– columns represent positions in the text
• 0 = start of sentence
• N = Nth token
• Length+1 = end of sentence
– Rows represent all possible states
• Start symbol
• End symbol
• All POSs (NN, JJ, ….) found in the corpus
• Traverse the chart as per the algorithm (fish sleep slides, etc.)
– Score all possible transitions from position = 0, state = Start
– For all states at position 1 with non-zero scores, score all possible transitions from Start
– Do subsequent transitions until arriving at end of the string.
– At each position in the chart record:
• the highest score
• the previous state that resulted in this score
Computational Linguistics
Lecture 4
2017
Calculating Probabilities
• The probability of each transition to state N for token T is assumed to be the
product of 3 factors
– Probability that state N occurs with token T
• There is 100% chance that the start state will be at the beginning of the sentence
• There is 100% chance that the end state will be at the end of the sentence
• If a token was observed in the training corpus, look up probability from table
• For Out of Vocabulary words, there are several strategies
– Simple strategy (for first implementation): 100% divided by number of states
– Other strategies are a separate discussion
– Probability that state N occurs, previous state
• Look up in table, calculate for every possible previous state
– Highest Probability of previous state (calculate for each previous state)
– For each new state, choose the highest score (this is the bigram model)
• Choose the POS tag sequence resulting in the highest score in the end state
Computational Linguistics
Lecture 4
2017
OOV Strategies from slide 20
• Default (use until other parts of program are debugged)
– Assume all POS tags have the same probability (e.g., 1/1000)
– In effect, only use transitions to predict the correct tag
• Morphology
– Use prefixes, suffixes, uppercase/lowercase, hyphenation, to predict POS
classes of OOV words
– Assign “made up” values based on these features?
• Computer probability of UNKNOWN_WORD
– Treat words occurring once in training collectively as UNKNOWN_WORD
• don’t record them separately
– UNKNOWN_WORD probability used for OOV words by transducer
• Combination:
– UNKNOWN_ending_in_s, UNKNOWN_ending_in_ed,
UNKNOWN_with_capital_letter, …
Computational Linguistics
Lecture 4
2017
How you Might Improve your Score
• Do error analysis on development corpus and base changes
on what you find.
• Implement a trigram algorithm
– See Jurafsky and Martin (p. 149)
– 4-gram is a waste of time for this size corpus
• A particularly clever OOV system
• Manual rule system using constraints, e.g., slide 7.
– For words with frequency>1, assume the disjunction of
observed labels is possible
– Rule out possibilities according to constraints
– Run this and compare results with HMM system
– Figure out way of combining results based on error analysis
• Voting, weighted combinations, etc.
Computational Linguistics
Lecture 4
2017
Grading
• 1 Person can get a 9 or 10 with:
– a bigram system and an implementation of an
OOV system based on words occurring once
– an accuracy score above 94 on the test corpus
• 2 Person system can get a 9 or 10 with
– Same as one person system plus at least one
extra interesting attempt, even if unsuccessful
• Include a short write-up of what you did, so
it is easier to evaluate.
Computational Linguistics
Lecture 4
2017
2 Person Collaboration
• Indicate on your submission documents that you are collaborating
and indicate who you are collaborating with.
• Collaborators should submit the same documents twice on
NYUClasses to make sure there is no confusion
• Indicate who did what, e.g.,
– Person 1: create the tables with the probabilities
– Person 2 create the initial version of Viterbi and a very simple
OOV strategy (assume all POS have equal probability)
– Person 1: OOV strategy based on words occurring once
– Person 2: error analysis on development corpus to determine next
improvements
– Person 1 and 2: Manual Rule based system
– Etc.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28