CS计算机代考程序代写 chain Bayesian Excel Bayesian network algorithm Applying HMMs to Part-of-Speech

Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 27, Three Segments

Outline

● Decoding: the Viterbi Algorithm
● Part-of-Speech (POS) Problem
● Viterbi POS Tagger

2Outline, Wk 9, Mtg 25

Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 27, Segment 1 of 3

Likelihood versus Decoding: Ice Cream Example

Given an HMM λ:
● Computing the likelihood of a sequence of observations P(o1,o2,o3 )

relies on the forward algorithm
○ The trellis entries carry forward all paths to each state that can emit oi
○ The likelihood involves summing over combinations of state sequences

that could produce a given observation sequence

● Computing the most likely sequence of states given the observations
(decoding) P(Q1,Q2,Q3|o1,o2,o3 ) relies on the Viterbi algorithm
○ The trellis entries carry forward all paths to each state that can emit oi
○ Decoding finds the single maximum probability state sequence

4Decoding

Likelihood versus Decoding: Ice Cream Example

5Decoding

Decoding in Brief

● Given as input an HMM λ = (A,B) and a sequence of observations O=o1,
o2, …, oT, find the most probable sequence of states Q=q1,q2,…,qT

● What is the single most likely sequence of states that generated o1=3,
o2=1, o3=3?

6Decoding

Forward Recursion versus Viterbi Recursion

● Forward recursion

● Viterbi recursion

7

Viterbi Recursion

● Viterbi recursion computes the maximum probability path to state j at
time t given the observation o1, . . . , ot

8Decoding

Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Back Tracing

● Viterbi recursion computes the maximum probability path to state j at
time T given the observation o1, . . . , oT

● Viterbi must also identify the complete single path that gives this
maximum probability
○ Keep backpointers
○ Find
○ Trace backpointers from state j at time T to find the state sequence from

T back to 1

9Decoding

Viterbi Algorithm

● Initialization (t=1):
● Induction and backtracing (2 ≤ t ≤ T):

● Termination:

● Backpointer path:

10Decoding

Viterbi Trellis for Ice Cream Example: Induction Step

11Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Viterbi Trellis for Ice Cream Example: Backtracing

12Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Viterbi Pseudo Code

13Decoding
Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 27, Segment 2 of 3

Parts of Speech

● Parts of speech: traditional grammatical categories like “noun, ”“verb,”
“adjective,” “adverb” . . . (and many more)

● Origins in an ancient Greek monographs on grammar and parts of
speech (c. 100 B.C., Dionysius Thrax, Tryphon)

● Functions:
○ Help distinguish different word meanings: N(oun) vs. V(erb)

■ EG: river bank (N) vs. she banks (V) at a credit union
■ EG: a bear (N) will go after honey vs. bear (V) with me

○ Preprocessing for many other natural language processing tasks
15POS Tagging

POS Tagging

Pervasive in Natural Language Applications

● Machine Translation
○ Translations of nouns and verbs have different forms (prefixes, suffixes)

● Speech synthesis
○ “Close” as an adjective versus verb
○ see table

● Sense disambiguation of word meanings
○ “Close” (adjective) as in near something
○ “Close” (verb) as in shut something

16POS Tagging

Noun Verb

IN’ sult in SULT’

OB’ ject ob JECT’

OVER’ flow over FLOW’

DIS’ count dis COUNT’

CON’ tent con TENT’

Majority Class Baseline for POS Tagging

● Assign the most frequent tag
○ 83.3% correct here
○ Typically 90% correct

17POS Tagging

Word POS listing in Brown Corpus

heat Noun (86) Verb (4)

oil Noun (87)

in prep (18,975) noun (1) adv-particle (4)

a det (21,872) noun (1)

large adj (347) noun (2) adv (5)

pot noun (28)

POS Tagsets: Penn TreeBank

Penn Treebank, developed from 1989-1996, and very widely used ever
since (treebank ≡ a bank of syntactic trees, or other linguistic annotations)
● The first large annotated corpus used for training machine learned models for

natural language processing (NLP) tasks; 7M words total
● 7M words of part-of-speech tagged text

○ |POS| ≅ 36 – 48 (12 for punctuation and other symbols)
● 3M words of text with syntactic parses
● 2M words of text annotated with predicate-argument structure
● 1.6M words of transcripts of spoken language annotated for disfluency

18POS Tagging

Table Showing 9 of 36 Penn TreeBank POS Tags

19

Tag Description Example

IN preposition/subordinating conjunction in, of, like

JJ adjective green

JJR adjective, comparative greener

JJS adjective, superlative greenest

MD modal could, will

NN noun, singular or mass table

NNS noun plural tables

NNP proper noun, singular John

NNPS proper noun, plural Vikings

VBD verb, past tense took

VBZ verb, 3rd person sing. present takes

POS Tagging

POS Tags: Hidden States, Inherently Sequential

● Words are observed; part-of-speech tags are hidden
○ Cf. observed umbrellas vs. hidden weather states in the umbrella world
○ Cf. observed ice creams vs. hidden weather states in the ice cream world

● Likely POS sequences
○ JJ NN (delicious food, large pot)
○ NNS VBD (people voted, planes landed)

● Unlikely POS sequences
○ NN JJ (food delicious, pot large)
○ NNS VBZ (people votes, planes lands)

20POS Tagging

HMM POS Tagger as a Bayesian Network

● Condition the hidden POS tag p at time t on the POS tag at time t-1:
P(pi|pi-1 )

● Condition the word w at time t on the POS tag at time t: P(wi|pi )

21POS Tagging

Applying HMMs to Part-of-Speech
(POS) Tagging
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 27, Segment 3 of 3

HMM POS Tagger

● By Bayes Rule

● To tag a word, find the tag that has the maximum posterior probability

23Viterbi POS Tagger

HMM POS Tagger

● Assume the words wi are conditionally independent given the tags ti

● Assume the tags ti are conditionally independent given the tags ti-1

● For a given sequence of words W = w1w2w3…wn find the sequence of
tags T = t1t2t3…tn which maximizes

24Viterbi POS Tagger

Training Corpus Size (BOTEC)

● What parameters can we
estimate with a million words of
hand tagged training data?
○ Assume a uniform distribution

of a 5000 word vocabulary and
40 part of speech tags

25Viterbi POS Tagger

Event Count Estimate
Quality

tag unigrams 40 Excellent

tag bigrams 1,600 Excellent

tag trigrams 64,000 Okay

tag 4-grams 2.5 x 103 Poor

word unigrams 5000 Very Good

word bigrams 25M Poor

word, tag pairs 200,000 Okay

Streetlight Effect with a Positive Slant

● Rich models often require vast amounts of data
● Well estimated models that are not quite “correct” often outperform

badly estimated better models

26Viterbi POS Tagger

HMM Pos Tagger Parameters

● Q = qm ∈ {q1, q2 , . . . , qn} (|Q| = n = 36 for Penn TreeBank)
● A = aij transition probabilities for all 1,296 tag pairs qim qjn s. t. ∑j aij = 1
● O = o1o2 . . . oT sequences of T observations from a vocabulary V

(words w, arranged in sentences length T): training corpus
● B = bi(ot ) for the 36×|V| observation likelihoods of observations ot

generated from states i
● π = π1, π2, . . . , πn where πi is the probability that the Markov chain will

start with state i s. t. ∑i πi= 1

27Viterbi POS Tagger

MLE Estimates for the HMM POS Tagger

Given an annotated corpus (words with POS tags)

● Priors: probabilities that an initial state is tag qi

● Sensor model: Conditional probabilities of word wi given tag qi

● Transition model: Conditional probabilities of tag qj given tag qi

28Viterbi POS Tagger

Viterbi Decoding for POS Tagging

● Because many of these counts are small or do not occur, smoothing is
necessary for best results

● HMM taggers typically achieve about 95-96% accuracy, for the
standard 36-42 set of POS tags

29Viterbi POS Tagger

Summary

● The Viterbi algorithm is used to compute the most likely sequence of
hidden states for a sequence of observations
○ Viterbi recursion is similar to forward recursion, but carries forward the

maximum probability instead of the sum of probabilities
○ Requires a backtrace mechanism to track the actual path through the

hidden states

● Part-of-Speech tagging, an important pre-processing step for many
natural language processing applications, can be done using an HMM

● MLE estimates for an HMM POS tagger can be derived from a corpus
of text where each word is labeled with its true POS tag

30Summary, Wk 9, Mtg 27