Conditional Random Fields
Conditional Random Fields
I The Conditional Random Field is a conditional probabilistic model for sequence labeling based on logistic regression
I The name is derived from Markov Random Fields, a class of models in which the probability of a configuration of variables is proportional to a product of scores across pairs of variables in a factor graph.
The probability model
I The basic probability model is P(y|w)=P exp( (w,y))
y02Y(w) exp( (w,y0))
I This is almost identical to Logistic Regression, except that the
label space is sequence of tags
I This requires e cient algorithms to find the best tag sequence in decoding, and for summing over all tag sequences in training
I The usual locality assumption on the scoring function:
MX+1 m=1
(w,y) =
(w,ym,ym 1,m)
Decoding in CRFs
I The Viterbi algorithm can be used to find the best tag sequence, just as it can be used for decoding HMM and Perpcetron models
I The decoding algorithm is identical to that of perceptron, because the denominator is the same for all tag sequences and can thus be ignored:
yˆ = argmaxlogP(y|w) y
=argmax (y,w) log y
X
exp( (y0,w))
= argmax
y
= argmax
y
sm(ym, ym 1)
(y,w)
MX+1 m=1
y02Y(w)
Learning in CRFs
I As with logistic regression, the weights can be learned by minimizing the regularized negative log-probability loss:
XN
L = 2k✓k2 logP(y(i)|w(i);✓)
i=1
XN X⇣⌘
= 2k✓k2 ✓·f(w(i),y(i))+log exp ✓·f(w(i),y0) i=1 y02Y(w(i))
I The second term requires computing the sum of all possible labelings. There are |Y|M possible labeling for an input of length M, so an e cient algorithm is required to compute this sum.
Computing the gradients of the loss function
I As in logistic regression, the gradient of the negative log likelihood with respect to the parameters is the di↵erence between observed and expected feature counts:
@L XN
@✓ = ✓j + E[fj(w(i),y] fj(w(i),y(i))
j i=1
I Computing this gradient involves computing the posterior of a
sequence that includes the transition Ym 1 ! Ym
I Recall the feature function for bigram tag sequences is of the form f(Ym 1,Ym,w,m). To compute the expected count of a feature, we need P(Ym 1 = k0,Ym = k|w).
Marginal probabilities over tag bigrams
Pr(Y = k0,Y m 1
P
0 QM+1expsn(yn,yn 1) y:Ym=k,Ym 1=k n=1
= k|w) =
How do we compute the numerator and denominator?
m
P 0 QM+1expsn(y0,y0 ) yn=1 nn 1
I The naive way to compute this probability is to enumerate all possible labelings for the sequence. Since there are YM possible labelings, this is prohibitively expensive for a typical tag set and sentence length.
I So we need find a more e cient way of doing this.
Computing the numerator
The numerator sums over all tag sequences that includes the transition (Ym 1 = k0) ! (Ym = k). This sum can be decomposed into three parts: the prefix y1:m 1, terminating in Ym 1 = k0, the transition (Ym 1 = k0) ! (Ym = k), and the su x ym:M , beginning with the tag Ym = k
M+1
X Y exp sn(yn, yn 1) =
y:Ym=k,Ym 1=k0 n=1
m 1
X Y exp sn(yn, yn 1)
y1:m 1:Ym 1=k0 n=1
⇥expsm(k,k0) M+1
⇥ X Y exp sn(yn, yn 1) ym:M:Ym=k n=m+1
Trellis
We can illustrate the numerator graphically with a trellis:
DDDD
!”($) &'($) exp +3(N,N)
start N N N N end
VVVV
the man tunes pianos
We compute the numerator by positing the first term as a forward variable ↵m 1(k0), and the third term as a backward variable m(k).
Defining a forward variable that can be cached
I A forward variables ↵m(ym) is equal to the sum of scores of all paths leading to the tag ym at position m:
X Xm
exp sn(yn, yn 1)
y1:m 1 n 1
X Ym y1:m 1 n=1
I The forward recurrence is also known as the sum-product algorithm and can be computed through a recurrence
↵m(ym) , =
exp sn(yn, yn 1)
The forward recurrence
I Computing the forward variable at position m
X Ym y1:m 1 n=1
↵m(ym) =
= X(expsm(ym,ym 1)) X Y expsn(yn,yn 1)
exp sn(yn, yn 1)
ym 1 y1:m 2 n=1
= (exp sm(ym, ym 1)) ⇥ ↵m 1(ym 1) ym 1
m 1
The backward recurrence
M+1
X Y exp sn(yn, yn 1)
m(k) ,
= X exp sm(k0, k) X Y
ym:M:Ym=k n=m
k02Y ym+1:M:Ym+1=k0 n=m+1
= exp sm(k0, k) ⇥ m+1(k0) k02Y
where k0 is the label at position m + 1.
M+1
exp sn(yn, yn 1)
Computing the denominator
The denominator, the score of all possible labelings for the entire sequence, can be computed either via the forward recurrence or backward recurrence, or at any given position m:
I The score of all possible labelings for the entire sequence is the value of the forward variable at position M + 1 in state ⌥:
y 2 XY ( w )
= (exp sM+1(⌥, yM )) exp sm(ym, ym 1)
(w,y) = ↵M+1(⌥)
X YM
y1:M m=1
I It can also be computed via backward recurrence as the value
of the backward variable at position 0:
y 2 XY ( w )
(w,y) = 0(⌃)
Features and their weights
fk ym 1 ym w 2 w 1 w0 w1 w2 ✓k f1 D D – – – – – -0.5
f2 N D – – – – – -0.8 f3 V D – – – – – 1.2 f4 D N – – – – – 2.5 f5 N N – – – – – 0.5 f6 V N – – – – – 3
f7 D V – – – – – 0.4 f8 N V – – – – – 4
f9 V V – – – – – 0.6 f10 – D- – man – – -0.5 f11 – N – – man – – 2 f12 – V – – man – – 1 f13 – D – the – – – -4 f14 – N – the – – – 5 f15 – V – the – – – -2
Computing local transition matricies
Recall the local score between two positions in a tag sequence is exp sm(ym, ym 1). The transition scores from each tag ym 1 to the tag ym can be arranged in a K ⇥ K matrix, where K is the number of tags. Suppose we are computing the score at the position “man” based on the feature weights in the previous slide:
exp(f1✓1 + f10✓10 + f13✓13) = exp( 0.5 0.5 4) = 0.007 2DNV3
D60.007 ? ?7 N4 ? ? ?5 V???
Forward computation in CRF
M1(k’,k)
!0(D) =1
!0(N) =1
!0(V) =1
M2(k’,k)
0.24
M3(k’,k)
!2(D) =1.84
!2(N)
!2(V)
man
M4(k’,k)
M5(k’,k)
!4(D) end
!4(N) end
!1(D) end
pianos
D
N
V
D
1.2
1.2
1.2
N
0.3
0.3
0.3
V
4
4
4
D
N
V
D
0.5
5
0.4
N
0.6
1.5
6
V
3.5
8
2.5
D
N
V
D
0.3
4.5
0.3
N
0.4
1.2
3.5
V
3
6
3
D
N
V
D
1.2
0
0
N
0
2
0
V
0
0
0.5
D
N
V
D
0.2
4
0.1
N
0.3
0.8
2.5
V
2
5
1
!1(D) =1.2
!1(N) =2
!1(V) =0.5
the
!3(D)
!3(N)
!3(V)
tunes
0.6
1
Representing features as matricies
Some features span multiple state (tag) transitions: f1(k0,k,w,m)=1i↵wm =run & k =V
f2(k0,k,w,m)=1i↵wm =run & k =N
If represented as matricies, these features have multiple cells that
have a value of 1:
2DNV3
D60 0 17 N40 0 15 V001
2DNV3
D60 1 07 N40 1 05 V010
f1(k0,k,w,m)=
f2(k0,k,w,m)=
Feature expectations
I Letf beanylocalfeaturefm[k0,k]=f(k0,k,w,m). The count of the this feature in a particular sequence is
E[F(w,y)] = XPr(Ym 1 = k0,Ym = k|w)f(k0,k,w,m) m
= X ↵m 1(f Mm) m> m Z(w)
where Z(w) is the total score of all labelings, also known as the partition function.
I Given the feature expectations, we can now compute the gradient of each feature.
F(y,w) =
I Expectation for a single feature over a sequence is:
MX+1 m
f(k0,k,w,m)
Example computation of feature expectations
I Assume the transition matrix at position m: M[k0,k]2D N V3
D 62.2 0.1 0.47 N 41.2 3.4 0.35 V 6.1 7.2 0.01
I Assume the following vectors: I ↵m 1(k0) = ⇥40 30 65⇤ I m(k) = ⇥45 65 30⇤
Computing Feature Expectations
⇥ ⇤20 0.1 032453 ↵m 1(f1 Mm) m> = 40 30 65 40 3.4 05 4655
0 7.2 0 30
⇥ ⇤2453
= 0 574 0.0 4655 = 37310.0
30
⇥ ⇤20 0 0.432453 ↵m 1(f1 Mm) m> = 40 30 65 40 0 0.354655
=?
0 0 0.01 30
CRFs
I Avoids the strong independence assumptions of HMMs
I Allows arbitrary features over observations (but not hidden
states)
I Sound probabilistic model -gives us a distribution over sequence labelings
I Uses the same e cient algorithm (the Viterbi Algorithm) for decoding as structured perceptron and HMMs
I Estimation/learning is harder (than say Perceptron) because we have to compute the posterior for a sequence
I Empirical results generally show CRFs outperforms HMMs (and other classifiers)
I Feature estimation can be replaced with LSTM RNNs, resulting in what’s called RNN-CRFs, of which LSTM-CRFs are the most widely used.