Sequence Labelling 2: Hidden Markov Models
The PoS Tagging Problem
This time:
The PoS Tagging Problem
Modelling the problem Hidden Markov Models
Hidden Markov Model tagging
Computing sequence probabilities Finding the most likely path
Efficient computation: The Viterbi algorithm Training HMMs
Data Science Group (Informatics) NLE/ANLP
The PoS Tagging Problem
Autumn 2015
1 / 29
Input: a sequence of words w1n
w1n = (w1,w2,…,wn)
Output: a sequence of PoS tags t1n
t1n = (t1, t2, . . . , tn)
In particular, find the most probable PoS tag sequence ˆt1n : ˆt1n = argmaxP(t1n |w1n)
t1n
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Modelling the Tagging Problem
We are interested in the most probable sequence given the words: ˆt1n = argmaxP(t1n |w1n)
2 / 29
One way of thinking about this is to assume that sequences are generated by a random process
The random process actually generates a pair of sequences:
1 An observable sequence w1n (the word sequence) w1n =(w1,w2,…,wn)
2 An unobservable (hidden) sequence t1n (the tag sequence) t1n = (t1,t2,…,tn)
t1n
Apply Bayes’ Law: argmaxP(t1n |w1n)
t 1n
P(tn)·P(wn|tn) 1 n 1 1
P ( w 1 )
= argmaxP(t1n)·P(w1n|t1n)
= argmax t 1n
Given an observed sequence of words, the tagging problem is to infer the unobserved tags.
t1n
Data Science Group (Informatics) NLE/ANLP Autumn 2015 3 / 29
Data Science Group (Informatics)
NLE/ANLP Autumn 2015
4 / 29
Modelling the Tagging Problem
Hidden Markov Models
argmax P (t1n | w1n ) = argmax P (t1n ) · P (w1n | t1n )
Suitable generative model is the Hidden Markov Model (HMM):
A set Q of possible states
includes a designated start state φ
A set O of possible observations
Process moves randomly from state to state
generates a sequence of states
Each state randomly generates one observation
Data Science Group (Informatics) NLE/ANLP Autumn 2015 6 / 29
Hidden Markov Model Tagger: Parameters
To define a HMM tagger need to specify:
Transition probabilities: P (tj | ti ) for each pair of states ti and tj bigram model: probability of moving to state tj from state ti
Markov assumption (chain property): probability of state tj depends only on previous state ti
Observation probabilities: P(w |t), for each word w and state t the probability of generating word w when in state t
Markov assumption: probability of generating word w depends only on current state (i.e. tag) t
t1n
t1n
– models PoS tag sequences
– models relation between word and tag sequences
P(t1n) P(w1n |t1n)
Data Science Group (Informatics) NLE/ANLP
Hidden Markov Model Tagger
For a HMM based tagger:
Set of states Q is just set of possible PoS tags T — model generates hidden tag sequence:
t1n = (t1, t2, . . . , tn)
Set of observations O is just set of possible words W
— so model generates observed word sequence: w1n = (w1, w2, . . . , wn)
Autumn 2015
5 / 29
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
7 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015 8 / 29
HMM Tagger: Transition Probabilities
Computing PoS Sequence Probabilities
0.71 1
ART
0.74
Question: How do we compute P(t1n)?
Answer: Product of transition probabilities on edges:
n
P(t1n) = P(ti |ti−1)
i=2
Example: Probability of the tag sequence N V N would be P(N V N) = 0.29 · 0.43 · 0.35 = 0.043645
and so on
Data Science Group (Informatics) NLE/ANLP Autumn 2015 10 / 29
Example: Probability of Words Given A Tag Sequence
φ
0.29
0.13
Data Science Group (Informatics) NLE/ANLP
0.65
0.44
0.26 0.43
P
NV
0.35
NB: This is just probability of words given one particular tag sequence.
. .
Note: for each PoS tag t, associated probabilities sum to 1
P(flieslikeflowers|NVN) = P(flies|N)·P(like|V)·P(flowers|N) = 0.025 × 0.034 × 0.050
= 0.0000425
Autumn 2015
9 / 29
Computing Word Sequence Probabilities
Question: How do we compute P(w1n |t1n)?
Answer: Product of the lexical generation probabilities:
n
P(w1n |t1n) = P(wi |ti)
i=1
Suppose we have:
w
P(w|N) P(w|V)
flies 0.025 like 0.012 flowers 0.050
0.015 0.034 0.005 .
Data Science Group (Informatics) NLE/ANLP Autumn 2015 11 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 29
Finding the Most Likely Path
Computational Considerations
In general, there are many paths generating a given word sequence The tag sequence is hidden
Not possible to know which tag sequence generated the output Hence the term Hidden Markov model
We are interested in recovering the most likely explanation:
i.e. most likely sequence of tags given the sequence of words
Data Science Group (Informatics) NLE/ANLP Autumn 2015
The Viterbi Algorithm
The algorithm finds the most probable sequence in O(n · k2) steps For fruit flies like flowers (and four possible tags):
4·42 =64steps
Considerable saving on the brute-force approach
How does it achieve this?
Question: How do we find the tag sequence most likely to have generated the words?
(Naïve) Answer: Enumerate all possible tag sequences and select most probable:
computationally very expensive;
With k tags there are O(kn) different tag sequences that could
generate an output of length n
e.g. if we have 4 possible tags N, V, P, ART, then the word sequence fruit flies like flowers already has of the order of 44 = 256 possible tag sequences
13 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
The Viterbi Algorithm
Viterbi exploits Markov assumptions:
Probability of next state depends only on current state
Probability of current output depends only on current tag Assumptions mean you don’t need to enumerate all paths
Viterbi is a classic example of dynamic programming Recursively decompose problem into smaller problems Keep track of (tabulate) solutions to sub-problems
14 / 29
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
15 / 29
Data Science Group (Informatics) NLE/ANLP Autumn 2015
16 / 29
Viterbi Algorithm: The Problem Space
Viterbi Algorithm: Tabulating Solutions to Subproblems
A sub-problem corresponds to a pair (i,t) where:
i ∈ {1,…,n} is position in the sequence
→ tells us we are focussed on w1,…,wi
→ remember the whole sequence is w1,…,wi,…,wn
t ∈ {N,V,…} is current PoS tag
→ tells us how wi is tagged
→ not bothered about tags for tokens earlier in the sequence
Maintain a table V that keeps track of solutions to sub-problems: One subproblem for each (i,t)
→ one place in table V for each (i, t)
V(i,t) stores optimal solution for subproblem (i,t)
V(i,t) is the probability of the most probable PoS tag sequenceforw1,…,wi suchthatwi istaggedwitht
Data Science Group (Informatics) NLE/ANLP
Viterbi Algorithm: Solving SubProblems
Initialisation: for each tag t ∈ T : V(1,t)=P(t|φ)·P(w1|t)
Autumn 2015
17 / 29
Data Science Group (Informatics) NLE/ANLP
Viterbi Algorithm: Control Strategy
foreachi from1ton for each t in T
complete entry V(i,t)
Autumn 2015
18 / 29
i.e. the probability of the first word being tagged as t Recursion: for each position i ∈ {2,…,n} and each tag t ∈ T:
V(i,t)=maxV(i−1,t′)·P(t|t′)·P(wi |t) t′∈T
i.e. the probability of the ith word being tagged as t
Data Science Group (Informatics) NLE/ANLP Autumn 2015
19 / 29
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
20 / 29
Example: Tagging flies like flowers Initialisation Step
For every tag t, V(1,t) is filled with product of two probabilities: probability of the transition from start state φ to t
probability of word observed at position 1 given PoS tag of t
Tag Probabilities at Position i = 1 flies
V
N
P
ART
7.6 · 10−6
Consider entry V(1,N):
transition probability from φ is 0.29
probability of output flies for PoS tag N is 0.025 so value for entry V (1, N ) is 0.00725
Data Science Group (Informatics) NLE/ANLP
Tag Probabilities at Positions i > 1 Entry V(i,t) calculated as follows:
φ
0.00725
0
0
Autumn 2015
21 / 29
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
22 / 29
Tag Probabilities at i = 2
flies like
The highest of the following k products (one for each tag t′): V(i −1,t′)
×
transition (bigram) probability from t′ to t ×
probability of ith word given PoS tag t
N
P
ART
N
VV
0.00031
1.3 · 10−5
φ
P 0.00022
ART
0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
23 / 29
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
24 / 29
Tag Probabilities at i = 3
flies like flowers
V V V
N N N
Most Probable Sequence of Tags
1.6533 · 10−9
5.684 · 10−6
flies like
V V
N N
flowers
V 1.6533 · 10−9
N 5.684 · 10−6
φφ
PPP PPP
ART
Data Science Group (Informatics)
Training HMMs
ART
NLE/ANLP
ART
Autumn 2015
25 / 29
ART
Data Science Group (Informatics)
Training HMMs (cont.)
ART
NLE/ANLP
ART
Autumn 2015
26 / 29
Lexical generation probabilities
count(w, t) P(w | t) = w′ count(w′, t)
count(w,t): number of times word w is tagged with t
Bigram probabilities
count(t′, t) P(t | t ) = t′′ count(t′, t′′)
′
count(t,t′): number of times tag t is followed by tag t′
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
27 / 29
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
28 / 29
Next Topic: Parsing with Context-Free Grammar
Grammar and language Phrase structure
Context Free Grammar (CFG) Basic parsing algorithms Parsing and search
Chart parsing
Data Science Group (Informatics) NLE/ANLP Autumn 2015 29 / 29