CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 5:
Part-of-Speech Tagging
CS447: Natural Language Processing (J. Hockenmaier)
POS tagging
Pierre Vinken , 61 years old
, will join the board as a
nonexecutive director Nov.
29 .
Raw text
Pierre_NNP Vinken_NNP ,_, 61_CD
years_NNS old_JJ ,_, will_MD join_VB
the_DT board_NN as_IN a_DT
nonexecutive_JJ director_NN Nov._NNP
29_CD ._.
Tagged text
Tagset:
NNP: proper noun
CD: numeral,
JJ: adjective,
…
POS tagger
�2
CS447: Natural Language Processing (J. Hockenmaier)
Why POS tagging?
POS tagging is a prerequisite for further analysis:
–Speech synthesis:
How to pronounce “lead”?
INsult or inSULT, OBject or obJECT, OVERflow or overFLOW,
DIScount or disCOUNT, CONtent or conTENT
–Parsing:
What words are in the sentence?
–Information extraction:
Finding names, relations, etc.
–Machine Translation:
The noun “content” may have a different translation from the adjective.
�3 CS447: Natural Language Processing (J. Hockenmaier)
POS Tagging
Words often have more than one POS:
-The back door (adjective)
-On my back (noun)
-Win the voters back (particle)
-Promised to back the bill (verb)
The POS tagging task is to determine the POS tag
for a particular instance of a word.
Since there is ambiguity, we cannot simply look up the
correct POS in a dictionary.
These examples from Dekang Lin
�4
CS447: Natural Language Processing (J. Hockenmaier)
Defining a tagset
�5 CS447: Natural Language Processing (J. Hockenmaier)
Defining a tag set
We have to define an inventory of labels for the
word classes (i.e. the tag set)
-Most taggers rely on models that have to be trained on
annotated (tagged) corpora. Evaluation also requires
annotated corpora.
-Since human annotation is expensive/time-consuming,
the tag sets used in a few existing labeled corpora become
the de facto standard.
-Tag sets need to capture semantically or syntactically
important distinctions that can easily be made by trained
human annotators.
�6
CS447: Natural Language Processing (J. Hockenmaier)
Defining an annotation scheme
A lot of NLP tasks require systems to map
natural language text to another representation:
POS tagging: Text ⟶ POS tagged text
Syntactic Parsing: Text ⟶ parse trees
Semantic Parsing: Text ⟶ meaning representations
…: Text ⟶ …
�7 CS447: Natural Language Processing (J. Hockenmaier)
Defining an annotation scheme
Training and evaluating models for these NLP tasks
requires large corpora annotated with the desired
representations.
Annotation at scale is expensive, so a few existing
corpora and their annotations and annotation
schemes (tag sets, etc.) often become the de facto
standard for the field.
It is difficult to know what the ‘right’ annotation scheme
should be for any particular task
How difficult is it to achieve high accuracy for that annotation?
How useful is this annotation scheme for downstream tasks in the pipeline?
➩ We often can’t know the answer until we’ve annotated a lot of data…
�8
CS447: Natural Language Processing (J. Hockenmaier)
Word classes
Open classes:
Nouns, Verbs, Adjectives, Adverbs
Closed classes:
Auxiliaries and modal verbs
Prepositions, Conjunctions
Pronouns, Determiners
Particles, Numerals
(see Appendix for details)
�9 CS447: Natural Language Processing (J. Hockenmaier)
Defining a tag set
Tag sets have different granularities:
Brown corpus (Francis and Kucera 1982): 87 tags
Penn Treebank (Marcus et al. 1993): 45 tags
Simplified version of Brown tag set
(de facto standard for English now)
NN: common noun (singular or mass): water, book
NNS: common noun (plural): books
Prague Dependency Treebank (Czech): 4452 tags
Complete morphological analysis:
AAFP3—-3N—-: nejnezajímavějším
Adjective Regular Feminine Plural Dative….Superlative
[Hajic 2006, VMC tutorial]
�10
CS447: Natural Language Processing (J. Hockenmaier)
How much ambiguity is there?
Most word types are unambiguous:
Number of tags per word type:
But a large fraction of word tokens are ambiguous
Original Brown corpus: 40% of tokens are ambiguous
�11
NB: These numbers are based on word/tag combinations in the corpus.
Many combinations that don’t occur in the corpus are equally correct.
CS447: Natural Language Processing (J. Hockenmaier)
Evaluating POS
taggers
�12
CS447: Natural Language Processing (J. Hockenmaier)
Evaluation setup:
Split data into separate training, (development) and test sets.
Better setup: n-fold cross validation:
Split data into n sets of equal size
Run n experiments, using set i to test and remainder to train
This gives average, maximal and minimal accuracies
When comparing two taggers:
Use the same test and training data with the same tag set
Evaluating POS taggers
�13
D
E
V
TRAINING
T
E
S
T
D
E
V
TRAINING
T
E
S
T
or
CS447: Natural Language Processing (J. Hockenmaier)
Evaluation metric: test accuracy
How many words in the unseen test data
can you tag correctly?
State of the art on Penn Treebank: around 97%.
➩ How many sentences can you tag correctly?
Compare your model against a baseline
Standard: assign to each word its most likely tag
(use training corpus to estimate P(t|w) )
Baseline performance on Penn Treebank: around 93.7%
… and a (human) ceiling
How often do human annotators agree on the same tag? Penn
Treebank: around 97%
�14
CS447: Natural Language Processing (J. Hockenmaier)
Is POS-tagging a solved task?
Penn Treebank POS-tagging accuracy
≈ human ceiling
Yes, but:
Other languages with more complex morphology
need much larger tag sets for tagging to be useful,
and will contain many more distinct word forms
in corpora of the same size
They often have much lower accuracies
�15 CS447: Natural Language Processing (J. Hockenmaier)
Generate a confusion matrix (for development data):
How often was a word with tag i mistagged as tag j:
See what errors are causing problems:
-Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
-Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
Qualitative evaluation
�16
Correct Tags
Predicted
Tags
% of errors
caused by
mistagging
VBN as JJ
CS447: Natural Language Processing (J. Hockenmaier)
Building a POS
tagger
�17 CS447: Natural Language Processing (J. Hockenmaier)
She promised to back the bill
w = w(1) w(2) w(3) w(4) w(5) w(6)
t = t(1) t(2) t(3) t(4) t(5) t(6)
PRP VBD TO VB DT NN
What is the most likely sequence of tags t= t(1)…t(N)
for the given sequence of words w= w(1)…w(N) ?
t* = argmaxt P(t | w)
Statistical POS tagging
�18
CS447: Natural Language Processing (J. Hockenmaier)
POS tagging with generative models
P(t,w): the joint distribution of the labels we want to predict (t)
and the observed data (w).
We decompose P(t,w) into P(t) and P(w | t) since these
distributions are easier to estimate.
Models based on joint distributions of labels and observed data
are called generative models: think of P(t)P(w | t) as a stochastic
process that first generates the labels, and then generates the
data we see, based on these labels.
�19
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
CS447: Natural Language Processing (J. Hockenmaier)
Hidden Markov Models (HMMs)
HMMs are the most commonly used generative models
for POS tagging (and other tasks, e.g. in speech recognition)
HMMs make specific independence assumptions
when defining P(t) and P(w | t):
P(t) is an n-gram model over tags:
Bigram HMM: P(t) = P(t(1))P(t(2) | t(1))P(t(3) | t(2))… P(t(N) | t(N-1))
Trigram HMM: P(t) = P(t(1))P(t(2)| t(1))P(t(3)| t(2),t(1))…P(t(n) | t(N-1),t(N-2))
P(ti | tj) or P(ti | tj,tk) are called transition probabilities
In P(w | t) each word is generated by its tag:
P(w | t) = P(w(1) | t(1))P(w(2) | t(2))… P(w(N) | t(N))
P(w | t) are called emission probabilities
�20
CS447: Natural Language Processing (J. Hockenmaier)
HMMs as probabilistic automata
DT
JJ
NN
0.7
0.3
0.4
0.6
0.55
VBZ
0.45
0.5
the
0.2
a
0.1every
0.1
some 0.1
no
0.01
able
…
…
0.003
zealous
…
…
0.002
zone
0.00024
abandonment
0.001
yields
…
…
0.02
acts
An HMM defines
Transition probabilities:
P( ti | tj)
Emission probabilities:
P( wi | ti )
�21 CS447: Natural Language Processing (J. Hockenmaier)
How would the
automaton for a trigram
HMM with transition probabilities
P(ti | tjtk) look like?
What about unigrams
or n-grams?
???
???
�22
CS447: Natural Language Processing (J. Hockenmaier)
DT
JJ
NN VBZq0
Encoding a trigram model as FSA
JJ_DT
NN_DT
JJ
NN VBZDT
DT_
JJ_JJ
NN_JJ
VBZ_NN
NN_NN
Bigram model:
States = Tag Unigrams
Trigram model:
States = Tag Bigrams
�23 CS447: Natural Language Processing (J. Hockenmaier)
HMM definition
�24
A HMM � = (A,B,⇥) consists of
• a set of N states Q = {q1, ….qN
with Q0 ⇤ Q a set of initial states
and QF ⇤ Q a set of final (accepting) states
• an output vocabulary of M items V = {v1, …vm}
• an N �N state transition probability matrix A
with ai j the probability of moving from qi to q j.
(�Nj=1ai j = 1 ⇧i; 0 ⌅ ai j ⌅ 1 ⇧i, j)
• an N �M symbol emission probability matrix B
with bi j the probability of emitting symbol v j in state qi
(�Nj=1bi j = 1 ⇧i; 0 ⌅ bi j ⌅ 1 ⇧i, j)
• an initial state distribution vector ⇥ = �⇥1, …,⇥N
with ⇥i the probability of being in state qi at time t = 1.
(�Ni=1⇥i = 1 0 ⌅ ⇥i ⌅ 1 ⇧i)
A HMM � = (A,B,⇥) consists of
• a set of N states Q = {q1, ….qN
with Q0 ⇤ Q a set of initial states
and QF ⇤ Q a set of final (accepting) states
• an output vocabulary of M items V = {v1, …vm}
• an N �N state transition probability matrix A
with ai j the probability of moving from qi to q j.
(�Nj=1ai j = 1 ⇧i; 0 ⌅ ai j ⌅ 1 ⇧i, j)
• an N �M symbol emission probability matrix B
with bi j the probability of emitting symbol v j in state qi
(�Nj=1bi j = 1 ⇧i; 0 ⌅ bi j ⌅ 1 ⇧i, j)
• an initial state distribution vector ⇥ = �⇥1, …,⇥N
with ⇥i the probability of being in state qi at time t = 1.
(�Ni=1⇥i = 1 0 ⌅ ⇥i ⌅ 1 ⇧i)
A HMM � = (A,B,⇥) consists of
• a set of N states Q = {q1, ….qN
with Q0 ⇤ Q a set of initial states
and QF ⇤ Q a set of final (accepting) states
• an output vocabulary of M items V = {v1, …vm}
• an N �N state transition probability matrix A
with ai j the probability of moving from qi to q j.
(�Nj=1ai j = 1 ⇧i; 0 ⌅ ai j ⌅ 1 ⇧i, j)
• an N �M symbol emission probability matrix B
with bi j the probability of emitting symbol v j in state qi
(�Nj=1bi j = 1 ⇧i; 0 ⌅ bi j ⌅ 1 ⇧i, j)
• an initial state distribution vector ⇥ = �⇥1, …,⇥N
with ⇥i the probability of being in state qi at time t = 1.
(�Ni=1⇥i = 1 0 ⌅ ⇥i ⌅ 1 ⇧i)
A HMM � = (A,B,⇥) consists of
• a set of N states Q = {q1, ….qN
with Q0 ⇤ Q a set of initial states
and QF ⇤ Q a set of final (accepting) states
• an output vocabulary of M items V = {v1, …vm}
• an N �N state transition probability matrix A
with ai j the probability of moving from qi to q j.
(�Nj=1ai j = 1 ⇧i; 0 ⌅ ai j ⌅ 1 ⇧i, j)
• an N �M symbol emission probability matrix B
with bi j the probability of emitting symbol v j in state qi
(�Nj=1bi j = 1 ⇧i; 0 ⌅ bi j ⌅ 1 ⇧i, j)
• an initial state distribution vector ⇥ = �⇥1, …,⇥N
with ⇥i the probability of being in state qi at time t = 1.
(�Ni=1⇥i = 1 0 ⌅ ⇥i ⌅ 1 ⇧i)
A HMM � = (A,B,⇥) consists of
• a set of N states Q = {q1, ….qN
with Q0 ⇤ Q a set of initial states
and QF ⇤ Q a set of final (accepting) states
• an output vocabulary of M items V = {v1, …vm}
• an N �N state transition probability matrix A
with ai j the probability of moving from qi to q j.
(�Nj=1ai j = 1 ⇧i; 0 ⌅ ai j ⌅ 1 ⇧i, j)
• an N �M symbol emission probability matrix B
with bi j the probability of emitting symbol v j in state qi
(�Nj=1bi j = 1 ⇧i; 0 ⌅ bi j ⌅ 1 ⇧i, j)
• an initial state distribution vector ⇥ = �⇥1, …,⇥N
with ⇥i the probability of being in state qi at time t = 1.
(�Ni=1⇥i = 1 0 ⌅ ⇥i ⌅ 1 ⇧i)
}
CS498JH: Introduction to NLP
An example HMM
�25
D N V A .
D 0.8 0.2
N 0.7 0.3
V 0.6 0.4
A 0.8 0.2
.
Transition Matrix A
the man ball throws sees red blue .
D 1
N 0.7 0.3
V 0.6 0.4
A 0.8 0.2
. 1
Emission Matrix B
D N V A .
π 1
Initial state vector π
D N
V
A
.
CS447: Natural Language Processing (J. Hockenmaier)
Building an HMM tagger
To build an HMM tagger, we have to:
-Train the model, i.e. estimate its parameters
(the transition and emission probabilities)
Easy case: we have a corpus labeled with POS tags
(supervised learning)
-Define and implement a tagging algorithm that
finds the best tag sequence t* for each input
sentence w:
t* = argmaxt P(t)P(w | t)
�26
CS498JH: Introduction to NLP
We count how often we see titj and wj_ti etc. in the data
(use relative frequency estimates):
Learning the transition probabilities:
Learning the emission probabilities:
Learning an HMM from labeled data
�27
P (tj |ti) =
C(titj)
C(ti)
Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS
old_JJ ,_, will_MD join_VB the_DT board_NN
as_IN a_DT nonexecutive_JJ director_NN Nov._NNP
29_CD ._.
P (wj |ti) =
C(wj ti)
C(ti)
CS447: Natural Language Processing (J. Hockenmaier)
Learning an HMM from unlabeled data
We can’t count anymore.
We have to guess how often we’d expect to see titj
etc. in our data set. Call this expected count C(…)
-Our estimate for the transition probabilities:
-Our estimate for the emission probabilities:
-We will talk about how to obtain these counts on Friday
�28
Pierre Vinken , 61 years old , will
join the board as a nonexecutive
director Nov. 29 .
Tagset:
NNP: proper noun
CD: numeral,
JJ: adjective,…
P̂ (tj |ti) =
�C(titj)⇥
�C(ti)⇥
P̂ (wj |ti) =
�C(wj ti)⇥
�C(ti)⇥
CS447: Natural Language Processing (J. Hockenmaier)
Finding the best tag sequence
The number of possible tag sequences is
exponential in the length of the input sentence:
Each word can have up to T tags.
There are N words.
There are up to NT possible tag sequences.
We cannot enumerate all NT possible tag sequences.
But we can exploit the independence assumptions
in the HMM to define an efficient algorithm that
returns the tag sequence with the highest probability
�29 CS447: Natural Language Processing (J. Hockenmaier)
S
tates
Bookkeeping: the trellis
We use a N×T table (“trellis”) to keep track of the HMM.
The HMM can assign one of the T tags to each of the N words.
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
Words (“time steps”)
�30
word w(i) has tag tj
CS447: Natural Language Processing (J. Hockenmaier)
Computing P(t,w) for one tag sequence
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
P(w(1)|q1)
P(w(2) | qj)
P(w(i) | qi)
P(t(1)=q1)
P(qj | q1)
P(qi | q…)
P(q..| qi)
P(w(i+1) | qi+1)
P(w(N) | qj)
P(qj | q..)
�31
-One path through the trellis = one tag sequence
-We just multiply the probabilities as before
CS447: Natural Language Processing (J. Hockenmaier)
Using the trellis to find t*
Let trellis[i][j] (word w(j) and tag tj) store the probability
of the best tag sequence for w(1)…w(i) that ends in tj
trellis[i][j] = max P(w(1)…w(i), t(1)…, t(i) = tj )
We can recursively compute trellis[i][j]
from the entries in the previous column trellis[i-1][j]
trellis[i][j] = P(w(i)|tj) ⋅Max (trellis[i-1][k]P(tj |tk))
At the end of the sentence, we pick the highest
scoring entry in the last column of the trellis
�32
CS498JH: Introduction to NLP �33
w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)
q1
…
qj
…
qT
Retrieving t* = argmaxt P(t,w)
By keeping one backpointer from each cell to the cell
in the previous column that yields the highest probability,
we can retrieve the most likely tag sequence when we’re done.
CS447: Natural Language Processing (J. Hockenmaier)
More about this on
Friday…
�34
CS447: Natural Language Processing (J. Hockenmaier)
Appendix:
English parts of
speech
�35 CS447: Natural Language Processing (J. Hockenmaier)
Nouns
Nouns describe entities and concepts:
Common nouns: dog, bandwidth, dog, fire, snow,
information
– Count nouns have a plural (dogs) and need an article in the singular (the dog
barks)
– Mass nouns don’t have a plural (*snows) and don’t need an article in the
singular (snow is cold, metal is expensive). But some mass nouns can also
be used as count nouns: Gold and silver are metals.
Proper nouns (Names): Mary, Smith, Illinois, USA, France, IBM
Penn Treebank tags:
NN: singular or mass
NNS: plural
NNP: singular proper noun
NNPS: plural proper noun
�36
CS447: Natural Language Processing (J. Hockenmaier)
(Full) verbs
Verbs describe activities, processes, events:
eat, write, sleep, ….
Verbs have different morphological forms:
infinitive (to eat), present tense (I eat), 3rd pers sg. present tense (he eats),
past tense (ate), present participle (eating), past participle (eaten)
Penn Treebank tags:
VB: infinitive (base) form
VBD: past tense
VBG: present participle
VBD: past tense
VBN: past participle
VBP: non-3rd person present tense
VBZ: 3rd person singular present tense
�37 CS447: Natural Language Processing (J. Hockenmaier)
Adjectives
Adjectives describe properties of entities:
blue, hot, old, smelly,…
Adjectives have an…
… attributive use (modifying a noun):
the blue book
… and a predicative use (e.g. as argument of be):
The book is blue.
Many gradable adjectives also have a…
…comparative form: greater, hotter, better, worse
…superlative form: greatest, hottest, best, worst
Penn Treebank tags:
JJ: adjective JJR: comparative JJS: superlative
�38
CS447: Natural Language Processing (J. Hockenmaier)
Adverbs
Adverbs describe properties of events/states.
– Manner adverbs: slowly (slower, slowest) fast, hesitantly,…
– Degree adverbs: extremely, very, highly….
– Directional and locative adverbs: here, downstairs, left
– Temporal adverbs: yesterday, Monday,…
Adverbs modify verbs, sentences, adjectives or other adverbs:
Apparently, the very ill man walks extremely slowly
NB: certain temporal and locative adverbs (yesterday, here)
can also be classified as nouns
Penn Treebank tags:
RB: adverb RBR: comparative adverb RBS: superlative adverb
�39 CS447: Natural Language Processing (J. Hockenmaier)
Auxiliary and modal verbs
Copula: be with a predicate
She is a student. I am hungry. She was five years old.
Modal verbs: can, may, must, might, shall,…
She can swim. You must come
Auxiliary verbs:
-Be, have, will when used to form complex tenses:
He was being followed. She has seen him. We will have been gone.
-Do in questions, negation:
Don’t go. Did you see him?
Penn Treebank tags:
MD: modal verbs
�40
CS447: Natural Language Processing (J. Hockenmaier)
Prepositions
Prepositions occur before noun phrases
to form a prepositional phrase (PP):
on/in/under/near/towards the wall,
with(out) milk,
by the author,
despite your protest
PPs can modify nouns, verbs or sentences:
I drink [coffee [with milk]]
I [drink coffee [with my friends]]
Penn Treebank tags:
IN: preposition
TO: ‘to’ (infinitival ‘to eat’ and preposition ‘to you’)
�41 CS447: Natural Language Processing (J. Hockenmaier)
Conjunctions
Coordinating conjunctions conjoin two elements:
X and/or/but X
[ [John]NP and [Mary]NP] NP,
[ [Snow is cold]S but [fire is hot]S ]S.
Subordinating conjunctions introduce a subordinate
(embedded) clause:
[ He thinks that [snow is cold]S ]S
[ She wonders whether [it is cold outside]S ]S
Penn Treebank tags:
CC: coordinating
IN: subordinating (same as preposition)
�42
CS447: Natural Language Processing (J. Hockenmaier)
Particles
Particles resemble prepositions (but are not followed
by a noun phrase) and appear with verbs:
come on
he brushed himself off
turning the paper over
turning the paper down
Phrasal verb: a verb + particle combination that has a different
meaning from the verb itself
Penn Treebank tags:
RP: particle
�43 CS447: Natural Language Processing (J. Hockenmaier)
Pronouns
Many pronouns function like noun phrases, and refer
to some other entity:
-Personal pronouns: I, you, he, she, it, we, they
-Possessive pronouns: mine, yours, hers, ours
-Demonstrative pronouns: this, that,
-Reflexive pronouns: myself, himself, ourselves
-Wh-pronouns (question words):
what, who, whom, how, why, whoever, which
Relative pronouns introduce relative clauses
the book that [he wrote]
Penn Treebank tags:
PRP: personal pronoun PRP$ possessive WP: wh-pronoun
�44
CS447: Natural Language Processing (J. Hockenmaier)
Determiners
Determiners precede noun phrases:
the/that/a/every book
-Articles: the, an, a
-Demonstratives: this, these, that
-Quantifiers: some, every, few,…
Penn Treebank tags:
DT: determiner
�45