CS计算机代考程序代写 deep learning flex Hidden Markov Mode algorithm l6-hmm-v2

l6-hmm-v2

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

Semester 1 2021 Week 3
Jey Han Lau

Sequence Tagging:
Hidden Markov Models

COMP90042
Natural Language Processing

Lecture 6

COMP90042 L6

2

POS Tagging Recap

• Janet will back the bill

• Janet/NNP will/MB back/VP the/DT bill/NN

• Local classifier: prone to error propagation

• What about treating the full sequence as a “class”?
‣ Output: “NNP_MB_VP_DT_NN”

• Problems:
‣ Exponentially many combinations: |Tags|M ,for length M
‣ How to tag sequences of different lengths?

COMP90042 L6

3

A Better Approach

• Tagging is a sentence-level task but as humans we
decompose it into small word-level tasks.
‣ Janet/NNP will/MB back/VP the/DT bill/NN

• Solution:
‣ Define a model that decomposes process into individual

word level steps
‣ But that takes into account the whole sequence when

learning and predicting (no error propagation)

• This is the idea of sequence labelling, and more
general, structured prediction.

COMP90042 L6

4

A Probabilistic Model
• Goal: obtain best tag sequence t from sentence w

‣ [Bayes]

• Let’s decompose:

‣ [Prob. of a word depends only on the tag]

‣ [Prob. of a tag depends only on the previous tag]

• This is a Hidden Markov Model (HMM)

�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒕 𝒘)

�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 
𝑃(𝒘 𝒕)𝑃(𝒕)

𝑃(𝒘)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)

𝑃(𝒘 𝒕) =  
𝑛


𝑖=1

𝑃(𝑤𝑖 | 𝑡𝑖)

𝑃(𝒕) =  
𝑛


𝑖=1

𝑃(𝑡𝑖 | 𝑡𝑖−1)

COMP90042 L6

5

Two Assumptions of HMM

• Output independence
‣ An observed event (word) depends only on the hidden state

(tag)

• Markov assumption
‣ The current state (tag) depends only on previous state

�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)

𝑃(𝒘 𝒕) =  
𝑛


𝑖=1

𝑃(𝑤𝑖 | 𝑡𝑖)

𝑃(𝒕) =  
𝑛


𝑖=1

𝑃(𝑡𝑖 | 𝑡𝑖−1)

COMP90042 L6

6

HMMs – Training
• Parameters are the individual probabilities
‣ = emission (O) probabilities
‣ = transition (A) probabilities

• Training uses Maximum Likelihood Estimation (MLE)
‣ This is done by simply counting word frequencies according to

their tags (just like N-gram LMs!)

P(wi | ti)
P(ti | ti−1)

𝑃(𝑙𝑖𝑘𝑒 𝑉𝐵) =  
𝑐𝑜𝑢𝑛𝑡(𝑉𝐵,   𝑙𝑖𝑘𝑒)

𝑐𝑜𝑢𝑛𝑡(𝑉𝐵)

𝑃(𝑁𝑁 𝐷𝑇 ) =  
𝑐𝑜𝑢𝑛𝑡(𝐷𝑇, 𝑁𝑁)

𝑐𝑜𝑢𝑛𝑡(𝐷𝑇 )

COMP90042 L6

7

HMMs – Training
• What about the first tag?

• Assume we have a symbol “” that represents the start of
your sentence

• What about unseen (word, tag) and (tag, previous_tag)
combinations?
‣ Smoothing techniques

𝑃(𝑁𝑁 < 𝑠 > ) =  
𝑐𝑜𝑢𝑛𝑡( < 𝑠 > ,   𝑁𝑁)

𝑐𝑜𝑢𝑛𝑡( < 𝑠 > )

COMP90042 L6

8

Transition Matrix

COMP90042 L6

9

Emission (Observation) Matrix

COMP90042 L6

10

HMMs – Prediction (Decoding)

• Simple idea: for each word, take the tag that maximises
. Do it left-to-right, in greedy fashion.

• This is wrong! We are looking for , not individual
terms.

‣ This is a local classifier: error propagation

• Correct way: consider all possible tag combinations, evaluate
them, take the max

�̂� = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
  = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 

𝑛


𝑖=1

𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖 | 𝑡𝑖−1)

𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖 | 𝑡𝑡−1)
argmax𝒕

𝑎𝑟𝑔𝑚𝑎𝑥𝑡𝑖

COMP90042 L6

11

• NT
• NT

• TN

• N

PollEv.com/jeyhanlau569

What is the complexity of such a system that
considers all possible tag combinations? 

If sentence length = N and tag size = T

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L6

12

COMP90042 L6

13

The Viterbi Algorithm
• Dynamic Programming to the rescue!
‣ We can still proceed sequentially, as long as we are careful.

• POS tag: “can play”

• Best tag for “can” is easy:

• Suppose best tag for “can” is NN. To get the tag for “play”, we
can take but this is wrong.

• Instead, we keep track of scores for each tag for “can” and
check them with the different tags of “play”.

𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(can 𝑡)𝑃(𝑡 |)

𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(play 𝑡)𝑃(𝑡 NN)

COMP90042 L6

14

The Viterbi Algorithm
Janet will back the bill

NNP

MD

VB

JJ

NN

RB

DT

COMP90042 L6

15

The Viterbi Algorithm
Janet will back the bill

NNP P(Janet|NNP) *
P(NNP|)

MD

VB

JJ …

NN …

RB …

DT …

COMP90042 L6

16

P(Janet | NNP) =?; P(NNP | ) = ?

COMP90042 L6

17

The Viterbi Algorithm
Janet will back the bill

NNP 0.000032 *
0.2767 =
8.8544e-06

MD P(Janet|MD) *
P(MD|)

VB …

JJ …

NN …

RB …

DT …

COMP90042 L6

18

P(Janet | MD) = ?; P(MD | ) =?

COMP90042 L6

19

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

MD 0

VB 0

JJ 0

NN 0

RB 0

DT 0

COMP90042 L6

20

P(Janet | VB/JJ/etc) = ?

COMP90042 L6

21

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

P(will|NNP) *
P(NNP|tJanet) *
s(tJanet|Janet)

MD 0 …

VB 0 …

JJ 0 …

NN 0 …

RB 0 …

DT 0 …

max( P(will | NNP) * P(NNP | NNP) * s(NNP | Janet),
P(will | NNP) * P(NNP | MD) * s(MD | Janet),


P(will | NNP) * P(NNP | DT) * s(DT | Janet) )

Consider all possible tags, and
take the maximum

= P(will | NNP) * P(NNP | NNP) * 8.8544e-06

COMP90042 L6

22

P(will | NNP) = ?; P(NNP | NNP) = ?

COMP90042 L6

23

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0

MD 0 …

VB 0 …

JJ 0 …

NN 0 …

RB 0 …

DT 0 …

COMP90042 L6

24

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0

MD 0 P(will|MD) *
P(MD|tJanet) *
s(tJanet|Janet)

VB 0 …

JJ 0 …

NN 0 …

RB 0 …

DT 0 …

max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
P(will | MD) * P(MD | MD) * s(MD | Janet),


P(will | MD) * P(MD | DT) * s(DT | Janet) )

= P(will | MD) * P(MD | NNP) * 8.8544e-06

COMP90042 L6

25

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0

MD 0 3.004e-08

VB 0 …

JJ 0 …

NN 0 …

RB 0 …

DT 0 …

max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
P(will | MD) * P(MD | MD) * s(MD | Janet),


P(will | MD) * P(MD | DT) * s(DT | Janet) )

= P(will | MD) * P(MD | NNP) * 8.8544e-06

COMP90042 L6

26

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0

MD 0 3.004e-8

VB 0 2.231e-13

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

27

Transition and Emission Matrix

COMP90042 L6

28

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0

MD 0 3.004e-8

VB 0 2.231e-13

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

29

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0

MD 0 3.004e-8 0

VB 0 2.231e-13 P(back|VB) *
P(VB|twill) *
s(twill|will)

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

30

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0

MD 0 3.004e-8 0

VB 0 2.231e-13 MD: 1.6e-11
VB: 7.5e-19
NN: 9.7e-17

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

31

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0

MD 0 3.004e-8 0

VB 0 2.231e-13 MD: 1.6e-11
VB: 7.5e-19
NN: 9.7e-17

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

32

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0

MD 0 3.004e-8 0

VB 0 2.231e-13 1.6e-11

JJ 0 0

NN 0 1.034e-10

RB 0 0

DT 0 0

COMP90042 L6

33

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0

MD 0 3.004e-8 0

VB 0 2.231e-13 1.6e-11

JJ 0 0 5.1e-15

NN 0 1.034e-10 5.4e-15

RB 0 0 5.3e-11

DT 0 0 0

COMP90042 L6

34

The Viterbi Algorithm
Janet will back the bill

NNP 8.8544e-06

0 0 2.5e-17

MD 0 3.004e-8 0 0

VB 0 2.231e-13 1.6e-11 0

JJ 0 0 5.1e-15 0

NN 0 1.034e-10 5.4e-15 0

RB 0 0 5.3e-11 0

DT 0 0 0 1.8e-12

COMP90042 L6

35

Done!
Janet will back the bill

NNP 8.8544e-06

0 0 2.5e-17 0

MD 0 3.004e-8 0 0 0

VB 0 2.231e-13 1.6e-11 0 1.0e-20

JJ 0 0 5.1e-15 0 0

NN 0 1.034e-10 5.4e-15 0 2.0e-15

RB 0 0 5.3e-11 0 0

DT 0 0 0 1.8e-12 0

COMP90042 L6

36

Going Backwards To Get

The Best Tag Sequence

Janet will back the bill

NNP 8.8544e-06

0 0 2.5e-17 0

MD 0 3.004e-8 0 0 0

VB 0 2.231e-13 1.6e-11 0 1.0e-20

JJ 0 0 5.1e-15 0 0

NN 0 1.034e-10 5.4e-15 0 2.0e-15

RB 0 0 5.3e-11 0 0

DT 0 0 0 1.8e-12 0

COMP90042 L6

37

Janet will back the bill

NNP 8.8544e-06

0 0 2.5e-17 0

MD 0 3.004e-08 0 0 0

VB 0 2.231e-13 1.6e-11 0 1.0e-20

JJ 0 0 5.1e-15 0 0

NN 0 1.034e-10 5.4e-15 0 2.0e-15

RB 0 0 5.3e-11 0 0

DT 0 0 0 1.8e-12 0

Going Backwards To Get

The Best Tag Sequence

COMP90042 L6

38

Janet/NNP will/MD back/VB the/DT bill/NN

NNP 8.8544e-06

0 0 2.5e-17 0

MD 0 3.004e-08 0 0 0

VB 0 2.231e-13 1.6e-11 0 1.0e-20

JJ 0 0 5.1e-15 0 0

NN 0 1.034e-10 5.4e-15 0 2.0e-15

RB 0 0 5.3e-11 0 0

DT 0 0 0 1.8e-12 0

Final Tagging Results

COMP90042 L6

39

PollEv.com/jeyhanlau569

they fish

N 0.7 0.3

V 0.4 0.6

Word Emission Probabilities

N V

0.6 0.4

N 0.8 0.2

V 0.7 0.3

Transition Probabilities

What is the optimal POS tag sequence for:

THEY FISH

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L6

40

THEY FISH

they fish

N

V

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L6

41

THEY FISH

they fish

N

P(they | N) x P(N | )
= 0.7 x 0.6
= 0.42

P(fish | N) x P(N | N) x 0.42 = 0.3 x 0.8 x 0.42 = 0.1 

P(fish | N) x P(N | V) x 0.16 = 0.3 x 0.7 x 0.16 = 0.03

0.1

V

P(they | V) x P(V | )
= 0.4 x 0.4
= 0.16

P(fish | V) x P(V | N) x 0.42 = 0.6 x 0.2 x 0.42 = 0.05 

P(fish | V) x P(V | V) x 0.16 = 0.6 x 0.3 x 0.16 = 0.03

0.05

Answer: N N

N for “they”

V for “they”

COMP90042 L6

42

COMP90042 L6

43

The Viterbi Algorithm

• Complexity: O(T2N), where T is the size of the
tagset and N is the length of the sequence.
‣ T * N matrix, each cell performs T operations.

• Why does it work?
‣ Because of the independence assumptions that

decompose the problem.
‣ Without these, we cannot apply DP.

COMP90042 L6

44

Viterbi Pseudocode

• Good practice: work with log probabilities to prevent
underflow (multiplications become sums)

• Vectorisation (use matrix-vector operations)

alpha = np.zeros(M, T)
for t in range(T):
alpha[1, t] = pi[t] * O[w[1], t]

for i in range(2, M):
for t_i in range(T):
for t_last in range(T): # t_last means t_{i-1}
s = alpha[i-1, t_last] * A[t_last, t_i] * O[w[i], t_i]
if s > alpha[i,t_i]:
alpha[i,t_i] = s

back[i,t_i] = t_last

best = np.max(alpha[M-1,:])
return backtrace(best, back)

COMP90042 L6

45

HMMs In Practice
• We saw HMM taggers based on bigrams (first order HMM).

‣ I.e. current tag depends only the immediate previous tag

• State-of-the-art use tag trigrams (second order HMM).


Viterbi now O(T3N)

• Need to deal with sparsity: some tag trigram sequences might
not be present in training data
‣ Interpolation: ) = ) + ) + )
‣ = 1

• With additional features, reach 96.5% accuracy on Penn Treebank (Brants,
2000)

𝑃(𝒕) =  
𝑛


𝑖=1

𝑃(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2)

𝑃(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2 λ3�̂�(𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2 λ2�̂�(𝑡𝑖 | 𝑡𝑖−1 λ1�̂�(𝑡𝑖
λ1 + λ2 + λ3

COMP90042 L6

46

Generative vs. Discriminative Taggers

• HMM is generative

• trained HMM can generate data (sentences)!

• allows for unsupervised HMMs: learn model without any
tagged data!

̂T = argmaxTP(T |W)

= argmaxTP(W |T)P(T)

= argmaxT∏
i

P(wi | ti)P(ti | ti−1)

COMP90042 L6

47

Generative vs. Discriminative Taggers

• Discriminative models describe directly

‣ supports richer feature set, generally better accuracy when
trained over large supervised datasets

‣ E.g., Maximum Entropy Markov Model (MEMM), Conditional
random field (CRF)

‣ Most deep learning models of sequences are discriminative

̂T = argmaxTP(T |W)

= argmaxT∏
i

P(ti |wi, ti−1)

P(T |W)

P(ti |wi, ti−1, xi, yi)

COMP90042 L6

48

HMMs in NLP

• HMMs are highly effective for part-of-speech tagging
‣ trigram HMM gets 96.5% accuracy
‣ related models are state of the art
‣ MEMMs 97%
‣ CRFs 97.6%
‣ Deep CRF 97.9%

‣ English Penn Treebank tagging accuracy https://aclweb.org/aclwiki/
index.php?title=POS_Tagging_(State_of_the_art)

• Apply out-of-the box to other sequence labelling
tasks
‣ named entity recognition (lecture 18), shallow parsing, …
‣ In other fields: DNA, protein sequences…

https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)

COMP90042 L6

49

A Final Word

• HMMs are a simple, yet effective way to perform
sequence labelling.

• Can still be competitive, and fast. Natural baseline
for other sequence labelling models.

• Main drawback: not very flexible in terms of feature
representation, compared to MEMMs and CRFs.

COMP90042 L6

50

Readings

• JM3 Appendix A A.1-A.2, A.4

• See also E18 Chapter 7.3

• References:
‣ Rabiner’s HMM tutorial http://tinyurl.com/2hqaf8
‣ Lafferty et al, Conditional random fields: Probabilistic

models for segmenting and labeling sequence data (2001)

http://tinyurl.com/2hqaf8
http://tinyurl.com/2hqaf8