Generative approaches: Hidden Markov Models
Hidden Markov Models (HMM): the generative story
The generative story: first, the tags are drawn from a prior distribution; next, the tokens are drawn from a conditional likelihood.
y0 ⌃,m 1 repeat
ym ⇠ Categorical( ym 1) . sample the current tag
wm ⇠ Categorical ( ym ) . sample the current word until ym = ⌥ . terminate when the stop symbol is generated
The independence assumptions of HMM
In addition to the usual independence assumptions for Na ̈ıve Bayes, two additional independence assumptions are needed for HMM:
I The probability of each word token only depends on its tag, not on any other element in the sequence:
YM
P(y) = when ym = ⌃ in all cases.
P(w|y) =
I Each tag ym depends only on its predecessor
m 1 YM
m=1
P(wm|ym) P(ym|ym 1)
Parameter estimation of HMMs
The hidden Markov model has two groups of parameters:
I Emission probabilities . The probability Pe(wm|ym; ) is the emission probability
I Transition probabilities . The probability Pt(ym|ym 1; )
Both of these groups of parameters are typically computed from relative frequency estimation on a labeled corpus. The unsmoothed probabilities are,
k,i ,Pr(Wm =i|Ym =k)= count(Wm =i,Ym =k) count(Ym = k)
k,k0 , Pr(Ym = k0|Ym=1 = k) = count(Ym = k0,Ym 1 = k) count(Ym 1 = k)
Inference for HMMs
I The goal of the inference in the hidden Markov model is to find the highest probability sequence:
yˆ = argmaxlogP(y|w) y
I AsP(y,w)=P(y|w)⇥P(w)/P(y|w),theinference problem can be reformulated as finding the joint probability of w and y:
yˆ = argmaxlogP(y,w) y
Inference for HMMs
I Applying the independent assumptions, we get:
log P (y , w ) = lXog P (y ) + log P (w |y ) M+1
= log PY (ym|ym 1) + log PW |Y (wm|ym) m=1
MX+1
= log ym,ym 1 +log ym,wm m=1
MX+1
= sm(ym, ym 1)
m=1
I This allows us to apply a variant of the Viterbi algorithm, where the only parameters are the transition probabilities and emission probabilities, which correspond to two features at each position in the sequence. This limitation explains the performance disadvantage of HMMs.
Discriminative alternatives to HMMs
As with Na ̈ıve Bayes, one disadvantage of HMMs is that additional features (e.g., su xes) cannot be applied without violating the independent assumptions. Discriminative models do not have this problem and additional features can be included:
f (w = the man who whistles tunes pianos, y = DT NN WP VBZ VBZ NNS) = f(w0 = the,y0 = DT)+f(y0 = DT,y 1 = ⌃)
+f(w0 =man,y0 =NN)+f(y0 =NN,y 1 =DT)
+f(w0 =who,y0 =WP)+f(y0 =WP,y0 1 =NN)
+f(w0 =whistles,y0 =VBZ)+f(y0 =VBZ,y 1 =WP)+f(su x0 =es) +f(w0 =tunes,y0 =VBZ)+f(y0 =VBZ,y 1 =VBZ)+f(su x0 =es) +f(w0 =pianos,y0 =NNS)+f(y0 =NNS,y 1 =VBZ)
+f(y0 = ⌥,y0 1 = NNS)
Note that you do not need to add the same morphological feature for each word token.
Structured Perceptron for sequence labeling
Structured Perceptron
I Each tagging sequence is assigned a score with a linear model model
MX+1 m=1
MX+1 m=1
I The best tagging sequence can be found e ciently with the Viterbi algorithm.
I As a discriminative model, Perceptron can handle an arbitrary number of features at eash position.
(w,y) = =
f (w,ym,ym 1,m) where yM+1 = ⌥ and y0 = ⌃ by construction.
(w,ym,ym 1,m)
✓ · f (w,ym,ym 1,m)
MX+1 m=1
= ✓ ·
= ✓ · f (global ) (w , y1:M )
Parameter estimation for structured perceptron
I In the training phase, the sentences in the training set are decoded one a time (with the Viterbi Algorithm).
I If the highest scoring tagging sequence is not the same as the correct tag sequence, the parameters are updated.
yˆ = argmax ✓ · f (w , y ) y2Y(w)
✓(t+1) ✓(t) +f(w,y) f(w,yˆ)
The averaged perceptron algorithm
1: procedure Ave Perceptron(x1:N,y1:N)
2: t 0,✓(0) 0,m 0
3: repeat
4: tt+1
5: Select a sequence i
6: yˆ argmaxy ✓(t 1) ·f(w(i),y)
. Online training . Decoding by Viterbi
7: ifyˆ6=y(i) then
8: ✓(t) ✓(t 1)+f(global)(w(i),y(i)) f(global)(w(i),yˆ)
9: else . Feature count for entire sequence
10: ✓(t) ✓(t 1)
11: m m + ✓(t)
12: until tired
13: ✓ ̄=1m
t ̄ 14: return ✓
Parameter Update example
I Correct tag sequence:
I The DT man NN who WP whistles VBZ tunes VBZ
pianos NNS
I Highest scoring sequence under the current model:
I The DT man NN who WP whistles VBZ tunes NNS pianos NNS
I Which features need to be updated?
✓(tunes,VBZ) ✓(VBZ,VBZ) ✓(VBZ,NNS)
✓(tunes,VBZ) + 1 ✓(VBZ,VBZ) + 1 ✓(VBZ,NNS) + 1
✓(tunes,NNS) ✓(VBZ,NNS) ✓(NNS,NNS)
✓(tunes,NNS) 1 ✓(VBZ,NNS) 1 ✓(NNS,NNS) 1
Structured Support Vector Machines
Structured Support Vector Machines
I Classification with SVMs enforces a large-margin constraint that requires that there is a margin of at least 1 between the score of the correct label and all incorrect labels. Formally, this means
8y 6=y(i),✓·f(x,y(i)) ✓·f(x,y) 1
I This can be minimally extended to sequence labeling as
8y 6=y(i),✓·f(w,y(i)) ✓·f(w,y) 1
Extending SVMs to sequences
I A “structured” Support Vector Machine (SVM) outputs a structured object such as a sequence
I When extending SVMs to sequence labeling, two issues need to be addressed:
I Some errors are more serious than others. Two labelings might have the same “margin” from the correct labeling, but have di↵erent number of tokens that are incorrectly labeled. We woould prefer to train “against” the latter rather than the
I former
Having a fixed margin of 1 would suggest we need to enumerate all possible sequences, which is infeasible as the number of sequences is exponential to the length of the sequence.
I The solution requires an adjustment of how the margin is computed.
Extending SVMs to sequences
I Instead of using a fixed margin, we use a cost function that reflects how the serious the errors are
I 8y 6=y(i),✓·f(w(i),y(i)) ✓·f(w(i),y) c(y,y(i))
Next, instead of using a delta function c(y,y(i)) = (y,y(i)) to compute the error for the entire sequence, we use Hamming cost, which counts the number of errors in y:
MX+ 1 ( i ) c(y,y(i))= (ym 6=ym )
m=1
I Instead of training against all labelings y that have a margin
that satisfies the above constraint, we focus on the prediction that maximally violates the margin constraint. We can identify this prediction by solving:
yˆ = argmax ✓ · f (w (i ) , y ) ✓ · f (w (i ) , y (i ) ) + c (y , y (i ) ) y6=y(i)
= argmax ✓ · f (w (i ) , y ) + c (y , y (i ) ) y=y(i)
6
Extending SVMs to sequence labeling
I Reformulating the margin constraint, we get: ✓·f(w(i),y(i)) max ⇣✓·f(w(i),y)+c(y,y(i))⌘ 0
y2Y(w)
I This means that the constraint will be met (and training will be complete) when the score for f(w(i),y(i)) is equal or greater than the cost-augmented scores for all alternatives.
I In the training process, we identify predictions that are strong (scores highly according to the model) and wrong (incurs a large cost according to the truth), and reducing the scores of these predictions by adjusting weights.
I Note that Hamming cost can be reduced to local parts by adding a feature f (y ) = (y 6= y(i)), and can be
incorporated into the Viterbi algorithm for purposes of the identifying the prediction to train against.
mmmm
A comparison between Structured Perceptron and Structured SVM
I In the training process, the perceptron algorithm simply finds the prediction that has the highest score according to the model to train against, while Structured SVM needs to find the prediction that has a high score with a heavy cost. Both can be identified with the Viterbi algorithm
I No cost needs to (or can be) computed during decoding for both models
I In practice, with a large training set, the perceptron algorithm works pretty well, obviating the need for more complicated SVM models.