Supervised learning: a summary
I A supervised learning paradigm assumes that there are correct labels, sequences of labels, or trees and graphs.
I Having correct labels allows us to compare the predictions of a model with the correct labels to compute the loss of a model.
I During training, we initialize the parameters of a model and use these initial parameters to make predictions. We’ll see how wrong these parameters are by computing the loss and update the parameters to reduce this loss.
I When the training is done, we make predictions by searching for the label with the highest score.
I The key to supervised learning is to have annotated data with correct labels. Is there anything we can do without annotated data?
Beyond Supervised Learning
There are other learning scenarios where labeled training sets are available to various degree or not available at all
I When there is no labeled data at all, we’ll have to do unsupervised learning (e.g., K-Means clustering, variants of the EM algorithm)
I When there is a small amount of labeled data, we might want to try semi-supervised learning
I When there is a lot of labeled data in one domain but there is only a small of labeled data in the target domain, we might try domain adaptation
K-Means clustering algorithm
K-means clustering algorithm
procedure K-MEANS(x1:N,K) for i 2 1 · · · N do
z(i) RandomInt(1,K) . Intializing class membership
repeat
for k 2 1···K do PN . recompute cluster centers
k PN 1 (z(i) = k)x(i) i=1 (z(i)=k) i=1
for i 2 1 · · · N do
z(i) argminK ||x(i) k||2
until Converged return z(i)
K-Means training
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
I K-means clustering is non-parametric and has no parameters I to update
I As a result, there is also no separate training and test phase.
The number of clusters needs to be pre-specified before the
clustering process starts
Semi-supervised learning
I Initialize parameters with supervised learning and then apply unsupervised learning (such as the EM algorithm)
I Multi-view learning: Co-training
I divide features into multiple views, and train a classifier for I each view
Each classifier predicts labels for a subset of the unlabeled instances, using only the features available in its view. These predictions are then used as ground truth to train the
I classifiers associated with the other views
Named entity example: named entity view and local context
I view
Word sense disambiguation: local context view and global context view
Multi-view Learning: co-training
Domain adaptation
Supervised domain adaptation: “Frustratingly simple” domain adaptation (Daum ́e III, 2007)
I Creates copies of each feature: one for each domain and one for the cross-domain setting
f (x , y , d ) = {(boring, NEG, Movie):1, (boring, NEG, *):1, (flavorless, NEG, Movie):1, (flavorless, NEG, *):1, (three-day-old, NEG, Movie):1, (three-day-old, NEG, *):1,
···}
where d is the domain.
I Let the learning algorithm allocate weights between domain specific features and cross-domain features: for words that facilitate prediction in both domains, the learner will use cross-domain features. For words that are only relevant to a particular domain, domain-specific features will be used.
Other learning paradigms
I Active learning: A learning that is often used to reduce the number of instances that have to be annotated but can still produce the same level of accuracy
I Distant supervision: There is no labeled data, but you can generate some (potentially noisy) training data with some external resource such as a dictionary. For example, you can generate named entity annotation with a list of names.
I Multitask learning: The learning induces a representation that can be used to solve multiple tasks (learning POS tagging with syntactic parsing)
Expectation Maximization
I An unsupervised iterative learning procedure that has two steps: the Expectation Step and the Maximization Step
I Has many applications in NLP: POS tagging, syntactic parsing, word alignment, clustering
I The most e↵ective use is in word alignment, which is the first step in statistical machine translation
I E cient incarnations with dynamic programming: I The Forward-Backward algorithm (for tagging)
I The Inside-Outside algorithm (for parsing)
I The main workhorse for unsupervised learning in NLP
Expectation Maximization for Sequence Labeling
I The na ̈ıve EM algorithm for sequence labeling
I A more e cient alternative: the Forward-Backward algorithm
Hidden Markov assumptions
I Recall the generative model for HMM: P(x,y) = P(x|y)P(y)
I Apply the conditional independence assumption that a word only depends on its corresponding tag:
YM m=1
P(x,y) =
I Apply the assumption that a tag only depends on its previous
tag:
P(x,y) =
YM m=1
P(xm|ym)P(y) P(xm|ym)P(ym|ym 1)
Parameter estimation
I In a supervised setting, the probabilities of the HMM parameters can be computed via a simple count.
I Given two labeled examples: I John N likes V apples N
I John N likes V oranges N
I Counts for the parameters given our usual HMM assumptions:
NV⌥ w0 =John 0 w0 =likes 0 w0 =apples 0 w0 =oranges 0 t 1 = ⌃ 0 t 1 =N 2 t 1 =V 0
2 0 1 1 2 0 2
0 2 0 0 0 2 0
Parameter estimation
I Given these counts, we can compute the (unsmoothed) values of the parameters:
N
V
⌥
N
V
2 0 1 1 2 0 2
0 2 0 0 0 2 0
0 0 0 0 0 2 0
0.5 0 0.25 0.25 1
0 1
0 1 0 0 0 0.5 0
I
t 1 =V 0
We implicitly assume the probabilities of all other labelings are zero:
P(John V, likes N, apples V) = 0 P(John V, likes N, apples N) = 0
······
w0 =John w0 =likes
w0 =apples w0 =oranges t 1 = ⌃ t 1 =N
counts probabilities
⌥
0 0 0 0 0 0.5
What if we don’t have labeled data?
I Now assume we don’t have the labeling, only sequences of word tokens:
I John likes apples I John likes oranges
I We can no longer count the frequencies of bigram tag sequences and how often a tag occurs with a word.
I Instead, let’s start with an educated guess of the values of the parameters:
N
V
⌥
N
V
– – – – – – –
– – – – – – –
– – – – – – –
0.5 0.1 0.2 0.2 0.8 0.1 0.6
0.1 0.5 0.2 0.2 0.2 0.6 0.2
w0 =John w0 =likes
w0 =apples w0 =oranges t 1 = ⌃ t 1 =N
t 1 =V
Expected counts probabilities
⌥
0 0 0 0 0 0.3 0.2
Joint probability of a token sequence and its tag sequence
With the initial assignments, we can compute the joint probability of the token sequence and the tag sequence.
P(John N, likes N, apples N) = P(N|⌃)P(N|N)P(N|N)P(⌥|N)P(John|N) P(likes|N)P(apples|N) = 0.8 ⇥ 0.1 ⇥ 0.1 ⇥ 0.3 ⇥ 0.5 ⇥ 0.1 ⇥ 0.2 = 0.000024
P(John N, likes N, apples V) = 0.000096 P(John N, likes V, apples V) = 0.00096 P(John N, likes V, apples N) = 0.00432 P(John V, likes N, apples N) = 0.0000072 P(John V, likes N, apples V) = 0.000029 P(John V, likes V, apples N) = 0.000072 P(John V, likes V, apples V) = 0.000016 P(John N, likes N, oranges N) = 0.000024 P(John N, likes N, oranges V) = 0.000096 P(John N, likes V, oranges V) = 0.00096 P(John N, likes V, oranges N) = 0.00432 P(John V, likes N, oranges N) = 0.0000072 P(John V, likes N, oranges V) = 0.000029 P(John V, likes V, oranges N) = 0.000072 P(John V, likes V, oranges V) = 0.000016
Conditional probability of a tag sequence given its token
sequence
With the joint probabilities we can compute the conditional probability of a tag sequence given its token sequence
P((N, N, N)|(John, likes, apples)) = 0.0043 P((N,N,V)|(John, likes, apples)) = 0.0174 P ((N , V , V )|(John, likes, apples)) = 0.174 P((N,V,N)|(John, likes, apples)) = 0.782 P((V,N,N)|(John, likes, apples)) = 0.0013 P ((V , N , V )|(John, likes, apples)) = 0.0052 P ((V , V , N )|(John, likes, apples)) = 0.013
P ((V , V , V )|(John, likes, apples)) = 0.0029 P((N, N, N)|(John, likes, oranges)) = 0.0043 P((N,N,V)|(John, likes, oranges)) = 0.0174 P ((N , V , V )|(John, likes, oranges)) = 0.174 P((N,V,N)|(John, likes, oranges)) = 0.782 P((V,N,N)|(John, likes, oranges)) = 0.0013 P ((V , N , V )|(John, likes, oranges)) = 0.0052 P ((V , V , N )|(John, likes, oranges)) = 0.013 P ((V , V , V )|(John, likes, oranges)) = 0.0029
Expected counts of the parameters
With the conditional probabilities of each tag sequence given its token sequence, we can compute the expected count of each parameter – the count of each parameter weighted by the conditional probability of the tagged sequence it appears in:
Expected counts
probabilities
⌥
– – – – – – –
N
V
⌥
N
V
1.955 0.056 0.80 0.80 1.995 0.0546 ?
0.0448 ?
?
? 0.0448 1.957 ?
0
0
0
?
0 1.601 ?
– – – – – – –
– – – – – – –
w0 =John w0 =likes
w0 =apples w0 =oranges t 1 = ⌃
t 1 =N t 1 =V
Maximization
With the expected count of the parameters we can re-estimate (via maxization) the parameters, and replace the initial parameters with the updated parameters:
Expected counts
probabilities
– – – – – – –
⌥
N
V
⌥
N
V
1.955 0.056 0.80 0.80 1.995 0.0546 ?
0.0448 ?
?
? 0.0448 1.957 ?
0
0
0
?
0 1.601 ?
0.5414 –
–
– 0.978 –
–
–
–
–
–
– 0.5417 –
w0 =John w0 =likes
w0 =apples w0 =oranges t 1 = ⌃
t 1 =N t 1 =V
With the updated parameters, we can iterate this process…
A summary of this process:
I We first initialize the parameters with some initial values, hopefully with some prior knowledge
I The E-Step:
I Using these parameters, we can compute the conditional I probability of a tag sequence
With the conditional probabilities we can compute the expected counts of the parameters
I The M-step: we then re-estimate the parameters by maximizing them.
I We repeat this process until it converges at some local maxima. The underlying model is not concave (the opposite of convex) so there is no guarantee that it will hit global maxima.
The generic EM algorithm
(1) (2) (N)
Input: A sample of N points, x ,xQ ,··· ,x . A model P(x,y|✓) of
the following form: P(x,y|✓) = |✓| ✓Count(x,y,r) Output: ✓T r=1 r
1: 2: 3: 4:
5: 6: 7: 8:
9: 10:
Initialization: Choose some initial value for ✓0 for t 1···T do
for r = 1 · · · |✓| do E[Count(r)] = 0
P
forally,setuy=ty/ yty
for all r = 1 · · · |✓|, set y uyCount(xi,y,r)
E[Count(r)] =
E[Count(r)] +
for i=1···Ndo P i t 1 Forally,computety =P(x,y|✓ )
forr=1···|✓|do
✓rt = E[Count(r)] where Z is a normalization constant Z
At this time you should have some questions…
I Does this work? Why does this work at all?
I What are the scenarios in which EM can be deployed?
I Can this be done more e ciently?
I Why do we weight the counts with the posterior of the sequence (and not something else)?
I Will the log likelihood monotonically increase after each iteration?
I We’ll try to answer some of them…
The Baum-Welch algorithm
I The naive Expectation Maximization algorithm we outlined above works for short sentences in small data sets, but does not scale.
I The Baum Welch algorithm is an e cient alternative that combines EM with the Forward Backward algorithm.
I In the M-step, the HMM parameters are estimated from expected count:
Pr(W =i|Y =k)= k,i = E[count(W =i,Y =k)] E[count(Y = k)]
Pr(Ym = k|Ym 1 = k0) = k0,k = E[count(Ym = k,Ym 1 = k0)] E[count(Ym 1 = k0)]
The E-Step: transition counts
I The local scores follow the usual definitions of HMM: sm(k,k0)=logPE(wm|Ym =k; )+logPT(Ym =k|Ym 1 =k0; )
I The expected transition count for a single instance is:
E[count(Ym = k,Ym 1 = k0)|w] =
Pr(Ym =k,Ym 1 =k0|w)= ↵m 1(k0)⇥expsm(k,k0)⇥ m(k)
XM m=1
Pr(Ym = k,Ym 1 = k0|w) ↵M +1 (⌥)
I The posterior is computed the same way as in the forward-backward computation in CRF, with the only di↵erence being how the local score is computed.
The E-Step: emission counts
I The local scores follow the usual definitions of HMM: sm(k,k0)=logPE(wm|Ym =k; )+logPT(Ym =k|Ym 1 =k0; )
I The expected emission count for a single instance is
Ym 1=k0
= X ↵m 1(k0) ⇥ exp sm(k, k0) ⇥ m(k)
Ym 1=k0 ↵M+1(⌥) = ↵m(k) m(k)
I The posterior is computed the same way as in the forward-backward computation in CRF, with the only di↵erence being how the local score is computed.
E[count(Ym = k,wm)|w] =
Pr(Ym = k|w) = X Pr(Ym = k,Ym 1 = k0|w)
XM m=1,wm
Pr(Ym = k|w)
Baum-Welch: Forward computation
!) # =0.4 !” # =0.0052 !- # =0.0147 !/ 0 =0.0055
start N N N end
!) + =0.02 !” + =0.122 !- # =0.0055
VVV John likes apples
Expected counts
⌥
N
V
– – – – – – –
– – – – – – –
– – – – – – –
N
0.5 0.1 0.2 0.2 0.8 0.1 0.6
V
0.1 0.5 0.2 0.2 0.2 0.6 0.2
probabilities
⌥
w0 =John 0 w0 =likes 0 w0 =apples 0 w0 =oranges 0 t 1 = ⌃ 0 t 1 =N 0.3 t 1 =V 0.2
Baum-Welch: Backward computation
w0 =John w0 =likes
w0 =apples w0 =oranges t 1 = ⌃
t 1 =N t 1 =V
Expected counts probabilities
⌥
0 0 0 0 0 0.3 0.2
⌥
N
V
N
V
– – – – – – –
– – – – – – –
– – – – – – –
0.5 0.1 0.2 0.2 0.8 0.1 0.6
0.1 0.5 0.2 0.2 0.2 0.6 0.2
John likes apples
!0 1 =0.0055!, # =0.0135 !* # =0.03 !” # =0.3
start N N N end
!, ( =0.0062 !* ( =0.044 !” ( =0.2
VVV
Collecting expected counts
!) # =0.4 !” # =0.0052 !- # =0.0147 !/ 0 =0.0055
start N N N end
!) + =0.02 !” + =0.122 !- # =0.0055
VVV John likes apples
14 5 =0.00551) # =0.0135 1″ # =0.03 1- # =0.3
start N N N end
1) + =0.0062 1″ + =0.044 1- + =0.2
VVV
E[count(Y2 = V,likes)|John likes apples] = ↵2(V ) 2(V )/↵4(end) =?
Collecting expected counts
!) # =0.4 !” # =0.0052 !- # =0.0147 !/ 0 =0.0055
start N N N end
!) + =0.02 !” + =0.122 !- # =0.0055
VVV John likes apples
14 5 =0.00551) # =0.0135 1″ # =0.03 1- # =0.3
start N N N end
1) + =0.0062 1″ + =0.044 1- + =0.2
VVV
E[count(Y2 = V,Y1 = N)|John likes apples] = ↵1(N)s2(V , N) 2(V )/↵4(end) =?
Sequence labeling summary
I Decoding: the Viterbi algorithm
I Parameter estimation
I Supervised algorithms:
I HMM: there is a closed form solution, assuming strong I independence
Perceptron: The perceptron learning algorithm rewards
features that make correct predictions and penalizes features I that make incorrect predictions
CRF: Updates the feature weights proportionally, needs to compute expected feature counts, which in turns requires a posterior that can be computed e ciently with the forward
I backward algorithm.
LSTM-CRF: Learned feature representation and transition scores via RNNs.
I Unsupervised algorithms:
I The Baum-Welch algorithm, which combines expectation maximization and the forward-backward algorithm.