Sequence labeling problems
Sequence labeling problems
I Many problems in NLP can be formulated as sequence labeling problems
I POS tagging:
I The DT man NN who WP whistles VBZ tunes VBZ
pianos NNS
I Named Entity Recognition (NER)
I The O company O is O backed O by O Microsoft B-ORG cofounder O Bill B-PER Gates I-PER and O venture O capitalist O Andressen B-PER Horowitz I-PER
I Time expression detection
I BedfordOpoliceOsaidOtheyOreceivedOaOcallO
about O 3:45 B-TIMEX p.m. I-TIMEX Monday B-TIMEX
I Spoken language understanding
I Which O flights FLIGHT arrive ARRIVE in O Burbank CITY
from O Denver CITY on ON Saturday Day
I ······
Search and Learning
Recall most natural language problems can be formulated mathematically as optimization:
yˆ=argmax (x,y;✓) y2Y(x)
There are two modules to this:
I Search, the module that is responsible for finding the argmax of the score function
I Learning, the module that is responsible for finding the optimal parameters ✓
For simple text classification problems, the search module is fairly straightforward, and most of the work goes to learning. For sequence labeling and more complicated NLP problems, the search module is getting more complicated.
Sequence labeling: first idea
I Classify the sequence one element at a time
Sequence labeling example: POS tagging
I Let’s use POS tagging as an example
I The most common used data set for training POS taggers is
the Penn TreeBank
DT NN WP VBZ VBZ NNS
The man who whistles tunes pianos
I DT: Determiner
I NN: uncountable noun or noun in singular form I WP: Wh-pronoun
I VBZ: 3rd person singular verb
I NNS: plural noun
How do we extract features from sequences in a linear model?
I We take as input a sequence of word tokens x and their corresponding POS tags y, as well as a position m, and return a set of features associated with that position.
I Typically we make the assumption that the context that matters for classifying the word at position m are its surrounding words. We define a window that is centered on position m of size k, and only extract contextual information from this window.
Extracting features from a window size of 1
I Assuming a window of 1, the features we will be extracting from the example sentence will be:
f ((w = the man who whistles tunes pianos, m = 1), DT ) =(w0 =the,DT)
f ((w = the man who whistles tunes pianos, m = 2), NN ) =(w0 =man,NN)
······
How many features will we extract if we use a window of size 1?
Weights
We can then train a classifier using these features and get a weight for each feature:
DT NN WP VBZ NNS w0 =the -4.6 w0 =man -3.5 w0 =who -4.6
w0 =whistles -0.63 w0 =tunes -0.6 w0 =pianos -0.08
For example, the weight ✓1 = 0.05 for the feature f1(w0 =the,DT)
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
Using these weights we can classify each word in the sequence
((w X= the man who = fi✓i = 0.05
i
((w X= the man who = fi✓i = 3.9
i
((w X= the man who = fi✓i = 4.6
i
((w X= the man who = fi✓i = 4.6
i
((w X= the man who
= fi✓i = 4.6 i
whistles tunes pianos, m = 1), DT )
whistles tunes pianos, m = 1), NN )
whistles tunes pianos, m = 1), WP )
whistles tunes pianos, m = 1), VBZ )
whistles tunes pianos, m = 1), NNS )
Predicting the tag for each word in the sequence
After finding the argmaxy at all positions of the sentence, we get: DT NN WP NNS NNS NNS
The man who whistles tunes pianos
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
w0 =the
w0 =man
w0 =who
w0 =whistles w0 =tunes w0 =pianos
DT NN WP VBZ NNS -4.6 -3.5 -4.6
-0.63 -0.6 -0.08
Extracting features from a larger window
I If we increase the window size to 2 and also include the previous word in the context
f ((w = the man who whistles tunes pianos, m = 1), DT ) = {(w0 = the,DT),(w 1 = START,DT)}
f ((w = the man who whistles tunes pianos, m = 2), NN ) = {(w0 = man, NN), (w 1 = the, NN)}
······
f ((w = the man who whistles tunes pianos, m = 4), VBZ ) = {(w0 = whistles,VBZ),(w 1 = who,VBZ)}
Include weights for the new features
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
-0.92 -4.6 -1.6 -1.8 -2.3 -1.6
-3.9 -0.7 -2.3 -4.6 -4.6 -4.6
-1.9 -4.6 -0.9 -4.6 -4.6 -4.6
-3.5 -4.6 -1.6 -0.2 -1.6 -4.6
w0 =the
w0 =man
w0 =who
w0 =whistles w0 =tunes
w0 =pianos w 1 =START w 1 =the w 1 =man w 1 =who w 1 =whistles w 1 =tunes
DT NN WP VBZ NNS -4.6 -3.5 -4.6
-0.63 -0.6 -0.08 -0.92 -0.75 -2.3 -4.6 -0.4 -0.26
Classification with the new weights
((w X= the man who whistles tunes pianos, m = 4), DT ) = fi✓i = 4.6 1.8 = 6.4
i
((w X= the man who whistles tunes pianos, m = 4), NN ) = fi✓i = 4.6 4.6 = 9.2
i
((w X= the man who whistles tunes pianos, m = 4), WP ) = fi✓i = 4.6 4.6 = 9.2
i
((w X= the man who whistles tunes pianos, m = 4), VBZ ) = fi✓i = 0.8 0.2 = 1
i
((w X= the man who whistles tunes pianos, m = 4), NNS ) = fi✓i = 0.63+ 4.6 = 5.23
i
So VBZ receives the highest score when classifying position 4.
Updated classification results
DT NN WP VBZ NNS NNS The man who whistles tunes pianos
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
-0.92 -4.6 -1.6 -1.8 -2.3 -1.6
-3.9 -0.7 -2.3 -4.6 -4.6 -4.6
-1.9 -4.6 -0.9 -4.6 -4.6 -4.6
-3.5 -4.6 -1.6 -0.2 -1.6 -4.6
w0 =the
w0 =man
w0 =who
w0 =whistles w0 =tunes
w0 =pianos w 1 =START w 1 =the w 1 =man w 1 =who w 1 =whistles w 1 =tunes
DT NN WP VBZ NNS -4.6 -3.5 -4.6
-0.63 -0.6 -0.08 -0.92 -0.75 -2.3 -4.6 -0.4 -0.26
Sequence labeling as structured prediction
I Enlarging the window to include more context helps, to a degree
I To further improve the classifier requires also evaluating sequences of tags. For example, the tag sequence “NNS NNS” should receives a very low score as it rarely appears. Incorporating such information in the model would help improve tagging accuracy
I The tags are of course not observable in the data, and they need to be predicted together.
Sequence labeling: Computing a global score for the entire sequence
I Consider all possible label sequences for the input sequence, and choose the one that has the highest score
(w,(DT,NN,WP,VBZ,NNS,NNS)) = (w,(DT,NN,WP,VBZ,VBZ,NNS)) =
I For a sequence of M elements with a tagset of size N, there are NM possible sequences, a very large number!
I To find the sequence with the highest score, we need to do this e ciently
I The common solution is the Viterbi Algorithm
Sequence labeling as structured prediction
I The goal of the model is to find the tag sequence that yields the highest score for the input sequence:
yˆ=argmax (w,y) y2Y(w)
I To make the computation tractable we factor the score for the entire sequence into the sum of local scores at each position m
MX+1 m=1
(w,y) =
I The local score is a weighted sum of the local features at
position m.
(w1:M,ym,ym 1,m) = ✓ · f (w,ym,ym 1,m)
(w,ym,ym 1,m)
Feature representation for sequences
f (w = the man who whistles tunes pianos, y = DT NN WP VBZ VBZ NNS) MX+1
= f(w,ym,ym 1,m) m=1
= f (w , DT , ⌃, 1) + f (w , NN , DT , 2) + f (w , WP , NN , 3) +f(w,VBZ,WP,4)+f(w,VBZ,VBZ,5)+f(w,NNS,VBZ,6) +f(w,⌥,NNS,7)
= f(w0 = the,y0 = DT)+f(y0 = DT,y 1 = ⌃)
+f(w0 =man,y0 =NN)+f(y0 =NN,y 1 =DT) +f(w0 =who,y0 =WP)+f(y0 =WP,y 1 =NN) +f(w0 =whistles,y0 =VBZ)+f(y0 =VBZ,y 1 =WP) +f(w0 =tunes,ym =VBZ)+f(y0 =VBZ,y 1 =VBZ) +f(w0 =pianos,y0 =NNS)+f(y0 =NNS,y 1 =VBZ) +f(y0 = ⌥,y 1 = NNS)
Decoding for sequences: The Viterbi algorithm
I The goal is to find the sequence of tags with the highest score: yˆ=argmax (w,y)
y2Y(w)
= argmax
y1:M = argmax
y1:M
MX+1 m=1
MX+1 m=1
(w,ym,ym 1,m) sm(ym, ym 1)
I Instead of finding the argmax for the entire sequence directly, we start by finding the max up to position m and keep a sequence of back pointers
Finding the max score for the sequence
max (x , y1:M ) y1:M
= maxsM+1(⌥,yM) + max sm(ym,ym 1) yM y1:M 1 m=1
MX+1
y1:M m=1 !
= max
sm(ym, ym 1)
✓ ◆ XM
Viterbi variable
Caching Viterbi variables as intermediate results:
Xm
m(ym), max sn(yn,yn 1)
y1:m 1 n=1
mX 1
= max sm(ym, ym 1) + max ym 1 y1:m 2 n=1
sn(yn, yn 1) = max sm(ym, ym 1) + m 1(ym 1)
ym 1
Note that 1(y1) , s1(y1, ⌃) and the maximum overall score for
the sequence is the final Viterbi variable maxy1:M (w1:M,y1:M) = M+1(⌥)
The Viterbi Algorithm
Viterbi Algorithm: Each sm(k,k0) is a local score for tag ym = k and ym 1 = k0
1: 2:
3: 4: 5: 6:
7: 8: 9:
10:
fork2{0,···,K}do 1(k) s1(k, ⌃)
form2{2···M}do for k 2 {0 · · · K } do
m(k) maxk0 sm(k,k0)+ m 1(k0) bm(k) = argmaxk0 sm(k, k0) + m 1(k0)
argmaxk sM+1(⌥,k)+ M(k) form2{M 1,···,1}do
yM
return y1:M
ym bm(ym+1)
Assuming these parameters
DT NN
WP VBZ NNS ⌥ 1
1 1 1 1 1 1 -4.6 -0.7 -4.6 -2.3 -1.2
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
-4.6 -3.5 -4.6 -0.63 -0.6 -0.08
-0.92 -2.3 -4.6 -3.8 -0.2 -4.6
-3.9 -0.69 -1.6 -4.6 -1.3 -4.6
-1.9 -4.6 -0.3 -4.6 -1.6 -0.1
-3.5 -4.6 -0.36 -0.2 -4.6 -4.6
-0.92 -0.76 -1.0 -4.6 -0.92 -3.9
w0 =the
w0 =man
w0 =who
w0 =whistles w0 =tunes w0 =pianos t 1 = ⌃
t 1 =DT t 1 =NN t 1 =WP t 1 =VBZ w 1 =NNS
Example Viterbi computation
-0.97
DT DT DT DT
-2.01
NN NN NN NN end
-7.86
-7.8
start
-6.5
WP WP WP WP
-10.2
-8.1 VBZ
-5.52 NNS
the
-6.97 -3.2
VBZ VBZ VBZ
-5.2 -3.6
NNS NNS NNS
man tunes
pianos
Additional features (and their weights) can be added
DT NN WP VBZ NNS ⌥ 1
-0.05 -4.6 -4.6 -4.6 -4.6 -4.6
-3.9 -0.35 -4.6 -4.6 -4.6 -4.6
-4.6 -4.6 -0.05 -4.6 -4.6 -4.6
-4.6 -1.4 -4.6 -0.8 -0.8 -3.0
-4.6 -3.5 -4.6 -0.63 -0.6 -0.08
-0.92 -2.3 -4.6 -3.8 -0.2 -4.6
-3.9 -0.69 -1.6 -4.6 -1.3 -4.6
-1.9 -4.6 -0.3 -4.6 -1.6 -0.1
-3.5 -4.6 -0.36 -0.2 -4.6 -4.6
-0.92 -0.76 -1.0 -4.6 -0.92 -3.9
-0.92 -4.6 -1.6 -1.8 -2.3 -1.6
-3.9 -0.7 -2.3 -4.6 -4.6 -4.6
-1.9 -4.6 -0.9 -4.6 -4.6 -4.6
-3.5 -4.6 -1.6 -0.2 -1.6 -4.6
-0.92 -0.75 -2.3 -4.6 -0.4 -0.26
w0 =the
w0 =man
w0 =who
w0 =whistles
w0 =tunes
w0 =pianos
t 1 = ⌃ 1
1 1 1 1 1
t 1 =DT
t 1 =NN
t 1 =WP
t 1 =VBZ w 1 =NNS w 1 =START w 1 =the w 1 =man w 1 =who w 1 =whistles w 1 =tunes
-4.6 -0.7 -4.6 -2.3 -1.2 1 -10 -1 -9 -0.5 -0.3
Feature templates used in SoA models
State-of-the-art models tend to use richer set of features and high-order transitions
I current word, w0
I previous words, w 1, w 2 I next words, w1, w2
I previous two tags, y 1, y 2 I for rare words:
I first k characters, up to K = 4
I last k characters, up to k = 4
I whether wm contains a number, uppercase character, or
hyphen
Parameter estimation for sequence labeling
We can extend the text classification models to sequence labeling:
Text classification
Sequence Labeling
Na ̈ıve Bayes
Logistic Regression
Perceptron
Support Vector Machines (SVM) Support Vector Machines (SVM)
Hidden Markov Models (HMM) Conditional Random Fields (CRF) Perceptron