Sequence Labelling 2: Hidden Markov Models
This time:
The PoS Tagging Problem
Modelling the problem Hidden Markov Models
Hidden Markov Model tagging
Computing sequence probabilities Finding the most likely path
Efficient computation: The Viterbi algorithm Training HMMs
Data Science Group (Informatics) NLE/ANLP Autumn 2015 1 / 29
The PoS Tagging Problem
Input: a sequence of words w1n
w1n = (w1,w2,…,wn)
Output: a sequence of PoS tags t1n
t1n = (t1,t2,…,tn)
In particular, find the most probable PoS tag sequence ˆt1n : ˆt1n = argmaxP(t1n |w1n)
tn
1
Data Science Group (Informatics) NLE/ANLP Autumn 2015 2 / 29
The PoS Tagging Problem
One way of thinking about this is to assume that sequences are generated by a random process
The random process actually generates a pair of sequences: 1 An observable sequence w1n (the word sequence)
w1n = (w1,w2,…,wn)
2 An unobservable (hidden) sequence t1n (the tag sequence)
t1n = (t1,t2,…,tn)
Given an observed sequence of words, the tagging problem is to
infer the unobserved tags.
Data Science Group (Informatics) NLE/ANLP Autumn 2015 3 / 29
Modelling the Tagging Problem
We are interested in the most probable sequence given the words: ˆt1n = argmaxP(t1n |w1n)
Apply Bayes’ Law:
tn
1
argmaxP(tn|wn) = argmaxP(t1n)·P(w1n|t1n) n 11 n P(wn)
t1 t1 1
= argmaxP(t1n)·P(w1n|t1n)
tn
1
Data Science Group (Informatics) NLE/ANLP Autumn 2015 4 / 29
Modelling the Tagging Problem
argmax P(t1n | w1n) = argmax P(t1n) · P(w1n | t1n) tn tn
P(t1n) – models PoS tag sequences
P(w1n |t1n) – models relation between word and tag sequences
11
Data Science Group (Informatics) NLE/ANLP Autumn 2015 5 / 29
Hidden Markov Models
Suitable generative model is the Hidden Markov Model (HMM): A set Q of possible states
includes a designated start state φ
A set O of possible observations
Process moves randomly from state to state generates a sequence of states
Each state randomly generates one observation
Data Science Group (Informatics) NLE/ANLP Autumn 2015 6 / 29
Hidden Markov Model Tagger
For a HMM based tagger:
Set of states Q is just set of possible PoS tags T — model generates hidden tag sequence:
t1n = (t1,t2,…,tn)
Set of observations O is just set of possible words W
— so model generates observed word sequence: w1n = (w1,w2,…,wn)
Data Science Group (Informatics) NLE/ANLP Autumn 2015 7 / 29
Hidden Markov Model Tagger: Parameters
To define a HMM tagger need to specify:
Transition probabilities: P (tj | ti ) for each pair of states ti and tj
bigram model: probability of moving to state tj from state ti Markov assumption (chain property): probability of state tj
depends only on previous state ti
Observation probabilities: P(w |t), for each word w and state t
the probability of generating word w when in state t
Markov assumption: probability of generating word w depends
only on current state (i.e. tag) t
Data Science Group (Informatics) NLE/ANLP Autumn 2015 8 / 29
HMM Tagger: Transition Probabilities
φ
0.71
1
ART
0.74
P
0.65
0.29
0.44
0.26
0.43 NV
0.13
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
9 / 29
0.35
Computing PoS Sequence Probabilities
Question: How do we compute P(t1n)?
Answer: Product of transition probabilities on edges:
n
P(t1n) = P(ti |ti−1)
i=2
Example: Probability of the tag sequence N V N would be P(N V N) = 0.29 · 0.43 · 0.35 = 0.043645
and so on
Data Science Group (Informatics) NLE/ANLP Autumn 2015 10 / 29
Computing Word Sequence Probabilities
Question: How do we compute P(w1n |t1n)?
Answer: Product of the lexical generation probabilities:
n
P ( w 1n | t 1n ) = P ( w i | t i )
i=1
NB: This is just probability of words given one particular tag sequence.
Data Science Group (Informatics) NLE/ANLP Autumn 2015 11 / 29
Example: Probability of Words Given A Tag Sequence
Suppose we have:
w
P(w|N) P(w|V)
flies 0.025
like 0.012
0.015 0.034 0.005
flowers 0.050
. . .
…
Note: for each PoS tag t, associated probabilities sum to 1
P(flieslikeflowers|NVN) = P(flies|N)·P(like|V)·P(flowers|N) = 0.025 × 0.034 × 0.050
= 0.0000425
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 29
Finding the Most Likely Path
In general, there are many paths generating a given word sequence The tag sequence is hidden
Not possible to know which tag sequence generated the output Hence the term Hidden Markov model
We are interested in recovering the most likely explanation:
i.e. most likely sequence of tags given the sequence of words
Data Science Group (Informatics) NLE/ANLP Autumn 2015 13 / 29
Computational Considerations
Question: How do we find the tag sequence most likely to have generated the words?
(Naïve) Answer: Enumerate all possible tag sequences and select most probable:
computationally very expensive;
With k tags there are O(kn) different tag sequences that could
generate an output of length n
e.g. if we have 4 possible tags N, V, P, ART, then the word sequence fruit flies like flowers already has of the order of 44 = 256 possible tag sequences
Data Science Group (Informatics) NLE/ANLP Autumn 2015 14 / 29
The Viterbi Algorithm
The algorithm finds the most probable sequence in O(n · k2) steps For fruit flies like flowers (and four possible tags):
4·42 =64steps
Considerable saving on the brute-force approach
How does it achieve this?
Data Science Group (Informatics) NLE/ANLP Autumn 2015 15 / 29
The Viterbi Algorithm
Viterbi exploits Markov assumptions:
Probability of next state depends only on current state
Probability of current output depends only on current tag Assumptions mean you don’t need to enumerate all paths
Viterbi is a classic example of dynamic programming Recursively decompose problem into smaller problems Keep track of (tabulate) solutions to sub-problems
Data Science Group (Informatics) NLE/ANLP Autumn 2015 16 / 29
Viterbi Algorithm: The Problem Space
A sub-problem corresponds to a pair (i,t) where:
i ∈ {1,…,n} is position in the sequence
→ tells us we are focussed on w1,…,wi
→ remember the whole sequence is w1,…,wi,…,wn
t ∈ {N,V,…} is current PoS tag
→ tells us how wi is tagged
→ not bothered about tags for tokens earlier in the sequence
Data Science Group (Informatics) NLE/ANLP Autumn 2015 17 / 29
Viterbi Algorithm: Tabulating Solutions to Subproblems
Maintain a table V that keeps track of solutions to sub-problems: One subproblem for each (i,t)
→ one place in table V for each (i, t)
V(i,t) stores optimal solution for subproblem (i,t)
V(i,t) is the probability of the most probable PoS tag sequenceforw1,…,wi suchthatwi istaggedwitht
Data Science Group (Informatics) NLE/ANLP Autumn 2015 18 / 29
Viterbi Algorithm: Solving SubProblems
Initialisation: for each tag t ∈ T :
V (1, t ) = P (t | φ) · P (w1 | t )
i.e. the probability of the first word being tagged as t Recursion: for each position i ∈ {2,…,n} and each tag t ∈ T:
V(i,t)=maxV(i−1,t′)·P(t|t′)·P(wi |t) t′∈T
i.e. the probability of the ith word being tagged as t
Data Science Group (Informatics) NLE/ANLP Autumn 2015 19 / 29
Viterbi Algorithm: Control Strategy
foreachi from1ton for each t in T
complete entry V(i,t)
Data Science Group (Informatics) NLE/ANLP Autumn 2015 20 / 29
Example: Tagging flies like flowers
Initialisation Step
For every tag t, V(1,t) is filled with product of two probabilities: probability of the transition from start state φ to t
probability of word observed at position 1 given PoS tag of t
Consider entry V(1,N):
transition probability from φ is 0.29
probability of output flies for PoS tag N is 0.025 so value for entry V (1, N ) is 0.00725
Data Science Group (Informatics) NLE/ANLP Autumn 2015 21 / 29
Tag Probabilities at Position i = 1 flies
V
7.6 · 10−6
N
0.00725
φ
P
0
ART
0
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
22 / 29
Tag Probabilities at Positions i > 1
Entry V(i,t) calculated as follows:
The highest of the following k products (one for each tag t′):
V(i −1,t′)
×
transition (bigram) probability from t′ to t ×
probability of ith word given PoS tag t
Data Science Group (Informatics) NLE/ANLP Autumn 2015 23 / 29
Tag Probabilities at i = 2
flies like
VV
0.00031
N N
1.3 · 10−5
φ
PP
0.00022
ART ART
0
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
24 / 29
Tag Probabilities at i = 3
flies like
flowers
V 1.6533 · 10−9
V V
N N
N 5.684 · 10−6
φ
PPP
ART ART ART
Data Science Group (Informatics) NLE/ANLP Autumn 2015 25 / 29
Most Probable Sequence of Tags
flies like
V V
flowers
V 1.6533 · 10−9
φ
N N
N 5.684 · 10−6
PPP
ART ART ART
Data Science Group (Informatics) NLE/ANLP Autumn 2015 26 / 29
Training HMMs
Lexical generation probabilities
count(w, t) P(w|t)= ′count(w′,t)
w
count(w,t): number of times word w is tagged with t
Data Science Group (Informatics) NLE/ANLP Autumn 2015 27 / 29
Training HMMs (cont.)
Bigram probabilities
′ count(t′, t) P(t |t ) = ′′ count(t′,t′′)
t
count(t, t′): number of times tag t is followed by tag t′
Data Science Group (Informatics) NLE/ANLP Autumn 2015 28 / 29
Next Topic: Parsing with Context-Free Grammar
Grammar and language Phrase structure
Context Free Grammar (CFG) Basic parsing algorithms Parsing and search
Chart parsing
Data Science Group (Informatics) NLE/ANLP Autumn 2015 29 / 29