CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 7:
Sequence Labeling
CS447: Natural Language Processing (J. Hockenmaier)
Recap: Statistical
POS tagging with
HMMs
�2
CS447: Natural Language Processing (J. Hockenmaier)
She promised to back the bill
w = w(1) w(2) w(3) w(4) w(5) w(6)
t = t(1) t(2) t(3) t(4) t(5) t(6)
PRP VBD TO VB DT NN
What is the most likely sequence of tags t= t(1)…t(N)
for the given sequence of words w= w(1)…w(N) ?
t* = argmaxt P(t | w)
Recap: Statistical POS tagging
�3 CS447: Natural Language Processing (J. Hockenmaier)
POS tagging with generative models
P(t,w): the joint distribution of the labels we want to predict (t)
and the observed data (w).
We decompose P(t,w) into P(t) and P(w | t) since these
distributions are easier to estimate.
Models based on joint distributions of labels and observed data
are called generative models: think of P(t)P(w | t) as a stochastic
process that first generates the labels, and then generates the
data we see, based on these labels.
�4
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
CS447: Natural Language Processing (J. Hockenmaier)
Hidden Markov Models (HMMs)
HMMs are generative models for POS tagging
(and other tasks, e.g. in speech recognition)
Independence assumptions of HMMs
P(t) is an n-gram model over tags:
Bigram HMM: P(t) = P(t(1))P(t(2) | t(1))P(t(3) | t(2))… P(t(N) | t(N-1))
Trigram HMM: P(t) = P(t(1))P(t(2)| t(1))P(t(3)| t(2),t(1))…P(t(n) | t(N-1),t(N-2))
P(ti | tj) or P(ti | tj,tk) are called transition probabilities
In P(w | t) each word is generated by its own tag:
P(w | t) = P(w(1) | t(1))P(w(2) | t(2))… P(w(N) | t(N))
P(w | t) are called emission probabilities
�5 CS447: Natural Language Processing (J. Hockenmaier)
Viterbi algorithm
Task: Given an HMM, return most likely tag sequence t(1)…t(N) for a
given word sequence (sentence) w(1)…w(N)
Data structure (Trellis): N×T table for sentence w(1)…w(N) and tag
set {t1,…tT}. Cell trellis[i][j] stores score of best tag sequence for w(1)
…w(j) that ends in tag tj and a backpointer to the cell corresponding to
the tag of the preceding word trellis[i−1][k]
Basic procedure:
Fill trellis from left to right
Initalize trellis[1][k] := P(tk) × P(w(1) | tk)
For trellis[i][j]:
-Find best preceding tag k* = argmaxk(trellis[i−1][k] × P(tj | tk)),
-Add backpointer from trellis[i][j] to trellis[i−1][k*];
-Set trellis[i][j] := trellis[i−1][k*] × P(tj | tk*) × P(w(i) | tj)
Return tag sequence that ends in the highest scoring cell
argmaxk(trellis[N][k]) in the last column
�6
CS447: Natural Language Processing (J. Hockenmaier)
Viterbi: At any given cell
-For each cell in the preceding column: multiply its entry with
the transition probability to the current cell.
-Keep a single backpointer to the best (highest scoring) cell in
the preceding column
-Multiply this score with the emission probability of the current
word
�7
w(n-1) w(n)
t1 P(w(1..n-1), t(n-1)=t1)
… …
ti P(w(1..n-1), t(n-1)=ti)
… …
tN P(w(1..n-1), tn-1=ti)
P(ti |t1)
P(ti |ti)
P(ti
|tN
)
trellis[n][i] =
P(w(n)|ti)
⋅Max(trellis[n-1][j]P(ti |ti))
CS447: Natural Language Processing (J. Hockenmaier)
Other HMM algorithms
The Forward algorithm:
Computes P(w) by replacing Viterbi’s max() with sum()
Learning HMMs from raw text with the EM algorithm:
-We have to replace the observed counts (from labeled data)
with expected counts (according to the current model)
-Renormalizing these expected counts will give a new model
-This will be “better” than the previous model, but we will have
to repeat this multiple times to get to decent model
The Forward-Backward algorithm:
A dynamic programming algorithm for computing the expected
counts of tag bigrams and word-tag occurrences in a sentence
under a given HMM
�8
CS447: Natural Language Processing (J. Hockenmaier)
Sequence labeling
�9 CS447: Natural Language Processing
POS tagging
�10
Pierre Vinken , 61 years old , will join IBM ‘s board
as a nonexecutive director Nov. 29 .
Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_,
will_MD join_VB IBM_NNP ‘s_POS board_NN as_IN a_DT
nonexecutive_JJ director_NN Nov._NNP 29_CD ._.
Task: assign POS tags to words
CS447: Natural Language Processing
Noun phrase (NP) chunking
�11
Pierre Vinken , 61 years old , will join IBM ‘s board
as a nonexecutive director Nov. 29 .
[NP Pierre Vinken] , [NP 61 years] old , will join
[NP IBM] ‘s [NP board] as [NP a nonexecutive director]
[NP Nov. 2] .
Task: identify all non-recursive NP chunks
CS447: Natural Language Processing
The BIO encoding
We define three new tags:
– B-NP: beginning of a noun phrase chunk
– I-NP: inside of a noun phrase chunk
– O: outside of a noun phrase chunk
�12
[NP Pierre Vinken] , [NP 61 years] old , will join
[NP IBM] ‘s [NP board] as [NP a nonexecutive director]
[NP Nov. 2] .
Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP
old_O ,_O will_O join_O IBM_B-NP ‘s_O board_B-NP as_O
a_B-NP nonexecutive_I-NP director_I-NP Nov._B-NP
29_I-NP ._O
CS447: Natural Language Processing
Shallow parsing
�13
Pierre Vinken , 61 years old , will join IBM ‘s board
as a nonexecutive director Nov. 29 .
[NP Pierre Vinken] , [NP 61 years] old , [VP will join]
[NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive
director] [NP Nov. 2] .
Task: identify all non-recursive NP,
verb (“VP”) and preposition (“PP”) chunks
CS447: Natural Language Processing
The BIO encoding for shallow parsing
We define several new tags:
– B-NP B-VP B-PP: beginning of an NP, “VP”, “PP” chunk
– I-NP I-VP I-PP: inside of an NP, “VP”, “PP” chunk
– O: outside of any chunk
�14
Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP
old_O ,_O will_B-VP join_I-VP IBM_B-NP ‘s_O board_B-NP
as_B-PP a_B-NP nonexecutive_I-NP director_I-NP Nov._B-
NP 29_I-NP ._O
[NP Pierre Vinken] , [NP 61 years] old , [VP will join]
[NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive
director] [NP Nov. 2] .
CS447: Natural Language Processing
Named Entity Recognition
�15
Pierre Vinken , 61 years old , will join IBM ‘s board
as a nonexecutive director Nov. 29 .
[PERS Pierre Vinken] , 61 years old , will join
[ORG IBM] ‘s board as a nonexecutive director
[DATE Nov. 2] .
Task: identify all mentions of named entities
(people, organizations, locations, dates)
CS447: Natural Language Processing
The BIO encoding for NER
We define many new tags:
– B-PERS, B-DATE, …: beginning of a mention of a person/date…
– I-PERS, I-DATE, …: inside of a mention of a person/date…
– O: outside of any mention of a named entity
�16
Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O
will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O
nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O
[PERS Pierre Vinken] , 61 years old , will join
[ORG IBM] ‘s board as a nonexecutive director
[DATE Nov. 2] .
CS447: Natural Language Processing
Many NLP tasks are
sequence labeling tasks
Input: a sequence of tokens/words:
Pierre Vinken , 61 years old , will join IBM ‘s board
as a nonexecutive director Nov. 29 .
Output: a sequence of labeled tokens/words:
POS-tagging: Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS
old_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN
as_IN a_DT nonexecutive_JJ director_NN Nov._NNP
29_CD ._.
Named Entity Recognition: Pierre_B-PERS Vinken_I-PERS ,_O
61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O
board_O as_O a_O nonexecutive_O director_O Nov._B-DATE
29_I-DATE ._O
�17 CS447 Natural Language Processing
Graphical models for
sequence labeling
�18
CS447: Natural Language Processing
Directed graphical models
Graphical models are a notation for probability models.
In a directed graphical model, each node represents a
distribution over a random variable:
– P(X) =
Arrows represent dependencies (they define what other
random variables the current node is conditioned on)
– P(Y) P(X | Y ) =
– P(Y) P(Z) P(X | Y, Z) =
Shaded nodes represent observed variables.
White nodes represent hidden variables
– P(Y) P(X | Y) with Y hidden and X observed =
�19
X
XY
X
Y
Z
XY
CS447: Natural Language Processing
HMMs as graphical models
HMMs are generative models of the observed input
string w
They ‘generate’ w with P(w,t) = ∏iP(t(i)| t(i−1))P(w(i)| t(i) )
When we use an HMM to tag, we observe w, and
need to find t
t(1) t(2) t(3) t(4)
w(1) w(2) w(3) w(4)
CS447: Natural Language Processing
Models for sequence labeling
Sequence labeling: Given an input sequence w = w(1)…w(n),
predict the best (most likely) label sequence t = t(1)…t(n)
Generative models use Bayes Rule:
Discriminative (conditional) models model P(t |w) directly
�21
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
Estimate argmaxt P(t|w) directly (in a conditional model)
or use Bayes’ Rule (and a generative model):
argmax
t
P(t|w) = argmax
t
P(t,w)
P(w)
= argmax
t
P(t,w)
= argmax
t
P(t)P(w|t)
CS447: Natural Language Processing
Advantages of discriminative models
We’re usually not really interested in P(w | t).
– w is given. We don’t need to predict it!
Why not model what we’re actually interested in: P(t | w)
Modeling P(w | t) well is quite difficult:
– Prefixes (capital letters) or suffixes are good predictors for
certain classes of t (proper nouns, adverbs,…)
– Se we don’t want to model words as atomic symbols, but in
terms of features
– These features may also help us deal with unknown words
– But features may not be independent
Modeling P(t | w) with features should be easier:
– Now we can incorporate arbitrary features of the word,
because we don’t need to predict w anymore
�22
CS447: Natural Language Processing
Discriminative probability models
A discriminative or conditional model of the labels t
given the observed input string w models
P(t | w) = ∏iP(t(i) |w(i), t(i−1)) directly.
t(1) t(2) t(3) t(4)
w(1) w(2) w(3) w(4)
CS447: Natural Language Processing
Discriminative models
There are two main types of discriminative
probability models:
–Maximum Entropy Markov Models (MEMMs)
–Conditional Random Fields (CRFs)
MEMMs and CRFs:
–are both based on logistic regression
–have the same graphical model
– require the Viterbi algorithm for tagging
–differ in that MEMMs consist of independently
learned distributions, while CRFs are trained to
maximize the probability of the entire sequence
CS447: Natural Language Processing
Probabilistic classification
Classification:
Predict a class (label) c for an input x
There are only a (small) finite number of possible class labels
Probabilistic classification:
– Model the probability P( c | x)
P(c|x) is a probability if 0 ≤ P (ci | x) ≤ 1, and ∑iP( ci | x) = 1
–Return the class c* = argmaxi P (ci | x)
that has the highest probability
One standard way to model P( c | x) is logistic
regression (used by MEMMs and CRFs)
�25 CS447: Natural Language Processing
Using features
Think of feature functions as useful questions you can
ask about the input x:
– Binary feature functions:
ffirst-letter-capitalized(Urbana) = 1
ffirst-letter-capitalized(computer) = 0
– Integer (or real-valued) features:
fnumber-of-vowels(Urbana) = 3
Which specific feature functions are useful
will depend on your task (and your training data).
�26
CS447: Natural Language Processing
From features to probabilities
We associate a real-valued weight wic with each
feature function fi(x) and output class c
Note that the feature function fi(x) does not have to depend
on c as long as the weight does (note the double index wic)
This gives us a real-valued score for predicting class c
for input x: score(x,c) = ∑iwic fi(x)
This score could be negative, so we exponentiate it:
score(x,c) = exp( ∑iwic fi(x))
To get a probability distribution over all classes c,
we renormalize these scores:
P(c | x) = score(x,c)∕∑j score(x,cj)
= exp( ∑iwic fi(x))∕∑j exp( ∑iwij fi(x))
�27 CS447: Natural Language Processing
Learning = finding weights w
We use conditional maximum likelihood estimation
(and standard convex optimization algorithms)
to find/learn w
(for more details, attend CS446 and CS546)
The conditional MLE training objective:
Find the w that assigns highest probability to all observed
outputs ci given the inputs xi
Learning: finding w
ŵ = argmax
w �i
P(ci|xi,w)
= argmax
w ⇥i
log(P(ci|xi,w))
= argmax
w ⇥i
log
�
e⇥ j w j f j(xi,c)
⇥c� e⇥ j w j f j(xi,c
�)
⇥ �28
CS447: Natural Language Processing (J. Hockenmaier)
Terminology
Models that are of the form
P(c | x) = score(x,c)∕∑j score(x,cj)
= exp( ∑iwic fi(x))∕∑j exp( ∑iwij fi(x))
are also called loglinear models, Maximum Entropy
(MaxEnt) models, or multinomial logistic regression
models.
CS446 and CS546 should give you more details about these.
The normalizing term ∑j exp( ∑iwij fi(x)) is also called
the partition function and is often abbreviated as Z
�29 CS447: Natural Language Processing
MEMMs use a MaxEnt classifier for each P(t(i) |w(i), t(i−1)):
Since we use w to refer to words, let’s use λjk as the weight
for the feature function fj(t(i−1), w(i)) when predicting tag tk:
Maximum Entropy Markov Models
t(i−1) t(i)
w(i)
P(t(i) = tk | t(i�1),w(i)) =
exp(Â j l jk f j(t(i�1),w(i))
Âl exp(Â j l jl f j(t(i�1),w(i))
CS447: Natural Language Processing (J. Hockenmaier)
Viterbi for MEMMs
trellis[n][i] stores the probability of the most likely (Viterbi)
tag sequence t(1)…(n) that ends in tag ti for the prefix w(1)…w(n)
Remember that we do not generate w in MEMMs. So:
trellis[n][i] = maxt(1)..(n−1)[ P(t(1)…(n−1), t(n)=ti | w(1)…(n)) ]
= maxj [ trellis[n−1][j] × P(ti | tj, w(n)) ]
= maxj [maxt(1)..(n−2)[P(t(1)..(n−2), t(n−1)=tj | w(1)..(n−1))] ×P(ti | tj,w(n))]
�31
w(n−1) w(n)
t1 maxt(1)..(n−2) P(t(1)..(n−2), t(n−1)=t1 | w(1)..(n−1))
… …
ti maxt(1)..(n−2) P(t(1)..(n−2), t(n−1)=ti | w(1)..(n−1))
… …
tT maxt(1)..(n−2) P(t(1)..(n−2), t(n−1)=tT | w(1)..(n−1))
P(ti |t1,w (n))
P(ti |ti,w(n))
P(ti
|tT
,w
(n) )
trellis[n][i] =
maxj[trellis[n−1][j]
×P(ti |tj, w(n))]
CS447: Natural Language Processing
Today’s key concepts
Sequence labeling tasks:
POS tagging
NP chunking
Shallow Parsing
Named Entity Recognition
Discriminative models:
Maximum Entropy classifiers
MEMMs
�32
CS447: Natural Language Processing (J. Hockenmaier)
Supplementary
material: Other HMM
algorithms (very
briefly…)
�33 CS447: Natural Language Processing (J. Hockenmaier)
The Forward algorithm
trellis[n][i] stores the probability mass of all tag sequences
t(1)…(n) that end in tag ti for the prefix w(1)…w(n)
trellis[n][i] = ∑t(1)..(n−1)[ P(w(1)…(n), t(1)…(n−1), t(n)=ti) ]
= ∑j [ trellis[n−1][j] × P(ti | tj) ] × P( w(n) | ti)
= ∑j [∑t(1)..(n−2)[P(w(1)..(n−1),t(1)..(n−2),t(n−1)=tj)] ×P(ti | tj)] × P(w(n) | ti)
Last step: computing P(w): P(w(1)…(N)) = ∑j trellis[N][j]
�34
w(n−1) w(n)
t1 ∑t(1)..(n−2) P(w(1)..(n−1), t(1)..(n−2), t(n−1)=t1)
… …
ti ∑t(1)..(n−2) P(w(1)..(n−1), t(1)..(n−2), t(n−1)=ti)
… …
tT ∑t(1)..(n−2) P(w(1)..(n−1), t(1)..(n−2), t(n−1)=tT)
P(ti |t1)
P(ti |ti)
P(ti
|tT
)
trellis[n][i] =
P(w(n)|ti)×
∑j[trellis[n−1][j]×P(ti |tj)]
CS447: Natural Language Processing (J. Hockenmaier)
Learning an HMM from unlabeled text
We can’t count anymore. We have to guess how often we’d
expect to see titj etc. in our data set.
Call this expected count 〈C(…)〉
-Our estimate for the transition probabilities:
-Our estimate for the emission probabilities:
-Our estimate for the initial state probabilities:
�35
Pierre Vinken , 61 years old , will
join the board as a nonexecutive
director Nov. 29 .
Tagset:
NNP: proper noun
CD: numeral,
JJ: adjective,…
P̂ (tj |ti) =
�C(titj)⇥
�C(ti)⇥
P̂ (wj |ti) =
�C(wj ti)⇥
�C(ti)⇥
p(ti) =
hC(Tag of first word = ti)i
Number of sentences
CS447: Natural Language Processing (J. Hockenmaier)
Expected counts
Emission probabilities with observed counts C(w, t)
P(w | t) = C(w, t)∕C(t) = C(w, t)∕∑w’ C(w’, t)
Emission probabilities with expected counts C(w, t)〉
P(w | t) = 〈C(w, t)〉∕〈C(t)〉 = 〈C(w, t)〉∕∑w’ 〈C(w’, t)〉
〈C(w, t)〉: How often do we expect to see word w
with tag t in our training data (under a given HMM)?
We know how often the word w appears in the data,
but we don’t know how often it appears with tag t
We need to sum up 〈C(w(i)=w, t)〉 for any occurrence of w
We can show that 〈C(w(i)=w, t)〉 = P(t(i)=t | w)
(NB: Transition counts 〈C(t(i)=t, t(i+1)=t’)〉 work in a similar fashion)
�36
CS447: Natural Language Processing (J. Hockenmaier)
Forward-Backward: P(t(i)=t | w(1)..(N))
P( t(i)=t | w(1)…(N)) = P( t(i)=t, w(1)…(N)) ∕ P(w(1)…(N))
w(1)…(N) = w(1)…(i)w(i+1)…(N)
Due to HMM’s independence assumptions:
P( t(i)=t, w(1)…(N)) = P(t(i)=t, w(1)…(i)) × P(w(i+1)…(N) | t(i) =t)
The forward algorithm gives P(w(1)…(N)) = ∑t forward[N][t]
Forward trellis: forward[i][t] = P(t(i)=t, w(1)…(i))
Gives the total probability mass of the prefix w(1)…(i), summed
over all tag sequences t(1)…(i) that end in tag t(i)=t
Backward trellis: backward[i][t] = P(w(i+1)…(N) | t(i)=t)
Gives the total probability mass of the suffix w(i+1)…(N), summed
over all tag sequences t(i+1)…(N), if we assign tag t(i)=t to w(i)
�37 CS447 Natural Language Processing
The backward trellis is filled from right to left.
backward[i][t] provides P(w(i+1)…(N) | t(i) =t)
NB: ∑tbackward[1][t] = P(w(i+1)…(N)) = ∑tforward[N][t]
Initialization (last column):
backward[N][t] = 1
Recursion (any other column):
backward[i][t] = ∑t’ P(t’ |t)×P(w(i+1) |t’)×backward[i+1][t’]
The Backward algorithm
�38
w(1) … w(i-1) w (i) w (i+1) … w (N)
t1
…
t
…
tT
CS447 Natural Language Processing
How do we compute 〈C(ti) |wj〉
�39
w(1) … w(i-1) w (i) w (i+1) … w (N)
t1
…
t
…
tT
〈C(t,w(i)) | w 〉 = P(t(i) =t, w)/P(w)
with
P(t(i) =t, w) = forward[i][t] backward[i][t]
P(w) = ∑t forward[N][t]
CS447: Natural Language Processing (J. Hockenmaier)
The importance of tag dictionaries
Forward-Backward assumes that each tag can be
assigned to any word.
No guarantee that the learned HMM bears any resemblance to
the tags we want to get out of a POS tagger.
A tag dictionary lists the possible POS tags for words.
Even a partial dictionary that lists only the tags for the most
common words and contains at least a few words for each tag
provides enough constraints to get significantly closer to a
model that produces linguistically correct (and hence useful)
POS tags.
�40
a DT back JJ, NN, VB, VBP, RP
an DT bank NN, VB, VBP
and CC … …
America NNP zebra NN