程序代写代做代考 chain flex algorithm deep learning C Hidden Markov Mode N

N
j=1 Sequence Tag
åπ =1 j
Hidden Markov Models
COMP90042
The Markov Chain
ging:  
åa=1; 1≤i≤N ij
N
j=1
The Markov chain described above is also called the observab
cause the output of the process is the set of states at each time instan corresponds to an observable event Xi . In other words, there is one-t
between the observable event sequence X and the Markov c S=s1,s2,Ksn Consider a simple three-state Markov chain for the
average as shown in Figure 8.1. At the end of each day, the Dow Jo may correspond to one of the following states:
state 1 – up (in comparison to the index of previous day) Natural Language Processing
Lecture 6
0.6
up
0.3
down 2
0.1
3
0.5
state 2 – down (in comparison to the index of previous day) state 3 – unchanged (in comparison to the index of previous day)
1
0.2
0.5
0.2
unch.
0.4
0.2
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
Figure 8.1 A Markov chain for the Dow Jones Industrial average. Three down, and unchanged respectively. 1
The parameter for this Dow Jones Markov chain may include a
l c o
D n

COMP90042
L6
POS Tagging Recap
• Janetwillbackthebill
• Janet/NNPwill/MBback/VPthe/DTbill/NN
• Localclassifier:pronetoerrorpropagation
• Whatabouttreatingthefullsequenceasa“class”? ‣ Output: “NNP_MB_VP_DT_NN”
• Problems:
‣ Exponentially many combinations: |Tags|M ,for length M ‣ How to tag sequences of different lengths?
2

COMP90042
L6
A Better Approach
• Taggingisasentence-leveltaskbutashumanswe
decompose it into small word-level tasks. ‣ Janet/NNP will/MB back/VP the/DT bill/NN
• Solution:
‣ Define a model that decomposes process into
individual word level steps
‣ But that takes into account the whole sequence when learning and predicting (no error propagation)
• Thisistheideaofsequencelabelling,andmore general, structured prediction.
3

COMP90042
L6
A Probabilistic Model • Goal:obtainbesttagsequencetfromsentencew
‣ ‣
^𝒕 = 𝑎 𝑟 𝑔 𝑚 𝑎 𝑥 𝒕 𝑃 ( 𝒕 𝒘 )
^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕)𝑃(𝒕) = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕)
[Bayes]
𝑃 (𝒘)
𝑃(𝒘 𝒕) = ∏𝑛 𝑃(𝑤𝑖|𝑡𝑖) [Prob. of a word depends only on the tag]
• Let’sdecompose:
‣ ‣
• This is a Hidden Markov Model (HMM)
∏𝑛 𝑖=1
𝑃(𝒕) = 𝑃(𝑡𝑖|𝑡𝑖−1) [Prob. of a tag depends only on the previous tag]
𝑖=1
• These are independence assumptions (bigram language models?)
4

COMP90042
L6
Hidden Markov Model
^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕) 𝑃(𝒘 𝒕)= ∏𝑛 𝑃(𝑤𝑖|𝑡𝑖)
∏𝑛 𝑖=1 𝑃(𝒕) = 𝑃(𝑡𝑖|𝑡𝑖−1)
𝑖=1
‣ Because it assumes the sequence follows a Markov chain: probability
• Why “Markov”?
of an event (tag) depends only on the previous event (last tag)
• Why “Hidden”?
‣ Because the events (tags) are not seen: goal is to find the best sequence
5

COMP90042
L6
HMMs – Training
• Parameters are the individual probabilities 𝑃(𝑤𝑖 𝑡𝑖) and
𝑃(𝑡𝑖|𝑡𝑖−1)
‣ Respectively, emission (O) and transition (A) probabilities
• Training uses Maximum Likelihood Estimation (MLE)
‣ In Naïve Bayes & n-gram LMs, this is done by simply counting
word frequencies according to the class.
• We do exactly the same in HMMs!
𝑐𝑜𝑢𝑛𝑡(𝑉𝐵, 𝑙𝑖𝑘𝑒) 𝑐𝑜𝑢𝑛𝑡(𝑉 𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐷𝑇, 𝑁𝑁 ) 𝑐𝑜𝑢𝑛𝑡(𝐷𝑇 )
‣ ‣
𝑃(𝑙𝑖𝑘𝑒 𝑉𝐵) = 𝑃(𝑁𝑁 𝐷𝑇)=
6

COMP90042
L6
HMMs – Training • What about the first tag?
• Assume we have a symbol “” that represents the start of
your sentence.
𝑃(𝑁𝑁 <𝑠>)= • What about the last tag?
𝑐𝑜𝑢𝑛𝑡(<𝑠>, 𝑁𝑁) 𝑐𝑜𝑢𝑛𝑡( < 𝑠 > )
• Assume we have a symbol “” that represents the end of sentence.
• What about unseen (word,tag) and (tag, previous) combinations? ‣ Smoothing techniques, like NB/n-gram LMs
7

COMP90042
L6
Transition Matrix
8

COMP90042
L6
Emission (Observation) Matrix
9

COMP90042
L6
HMMs – Prediction (Decoding) ^𝒕 = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑃(𝒘 𝒕) 𝑃(𝒕) 
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 ∏𝑛 𝑃(𝑤𝑖 𝑡𝑖)𝑃 (𝑡𝑖 | 𝑡𝑖−1)
𝑖=1
• Simple idea: for each word, take the tag that maximises
•
𝑃(𝑤𝑖 𝑡𝑖)𝑃(𝑡𝑖|𝑡𝑡−1). Do it left-to-right, in greedy fashion. This is wrong! We are looking for argmax𝒕, not individual
𝑎𝑟𝑔𝑚𝑎𝑥𝑡𝑖 terms.
‣ This is a local classifier: error propagation
• Correct way: take all possible tag combinations, evaluate them, take the max (like Naïve Bayes)
‣ Problem: exponential number of sequences.
10

COMP90042
L6
The Viterbi Algorithm
• Dynamic Programming to the rescue!
‣ We can still proceed sequentially, as long as we are careful.
• “can play” -> can/MD play/VB
Best tag for “can” is easy: 𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(can 𝑡)𝑃(𝑡|) ‣ We can do that because first “tag” is always “”
•
• Suppose best tag for “can” is NN. To get the tag for “play”,
we can take 𝑎𝑟𝑔𝑚𝑎𝑥𝑡 𝑃(play 𝑡)𝑃(𝑡 NN) but this is wrong.
• Instead, we keep track of scores for each tag for “can”
and check what would happen if “can” had a different tag. 11

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
MD
VB
JJ
NN
RB
DT
12

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
P(Janet|NNP) * P(NNP|)
MD
P(Janet|MD) * P(MD|)
VB
…
JJ
…
NN
…
RB
…
DT
…
13

COMP90042
L6
Transition and Emission Matrix
14

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
0.000032 * 0.2767
MD
0 * 0.0006
VB
…
JJ
…
NN
…
RB
…
DT
…
15

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
MD
0
VB
0
JJ
0
NN
0
RB
0
DT
0
16

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
P(will|NNP) * P(NNP|tJanet) *
s(tJanet|Janet)
MD
0
…
VB
0
…
JJ
0
…
NN
0
…
RB
0
…
DT
0
…
17

COMP90042
L6
The Viterbi Algorithm
NNP
MD
VB
JJ
NN
RB
DT
Janet
8.8544e-06
●
0
0
0
0
0
0
will
P(will|NNP) * P(NNP|tJanet) *
s(tJanet|Janet)
…
Calculate this for all tags, take the max.
…
…
P(will | NNP) * P(NNP | MD) * s(MD | Janet),  …
…
P(will | NNP) * P(NNP | DT) * s(DT | Janet) )
…
…
back
max( P(will | NNP) * P(NNP | NNP) *
the
bill
s(NNP | Janet),
18

COMP90042
L6
Transition and Emission Matrix
19

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0 * P(NNP|tJanet) * s(tJanet|Janet)
MD
0
…
VB
0
…
JJ
0
…
NN
0
…
RB
0
…
DT
0
…
20

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
P(will|MD) * P(MD|tJanet) *
s(tJanet|Janet)
VB
0
…
max( P(will | MD) * P(MD | NNP) * s(NNP | Janet),
JJ
0
… …
P(will | MD) * P(MD | MD) * s(MD | Janet),  …
P(will | MD) * P(MD | DT) * s(DT | Janet) )
NN
0
RB
0
…
DT
0
…
21

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-8
VB
0
…
JJ
0
…
NN
0
…
RB
0
…
DT
0
…
22

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-8
VB
0
2.231e-13
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
23

COMP90042
L6
Transition and Emission Matrix
24

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
MD
0
3.004e-8
VB
0
2.231e-13
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
25

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
P(back|VB) * P(VB|twill) *
s(twill|will)
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
26

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
MD: 1.6e-11 VB: 7.5e-19 NN: 9.7e-17
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
27

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
MD: 1.6e-11
VB: 7.5e-19 NN: 9.7e-17
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
28

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
1.6e-11
JJ
0
0
NN
0
1.034e-10
RB
0
0
DT
0
0
29

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
MD
0
3.004e-8
0
VB
0
2.231e-13
1.6e-11
JJ
0
0
5.1e-15
NN
0
1.034e-10
5.4e-15
RB
0
0
5.3e-11
DT
0
0
0
30

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
MD
0
3.004e-8
0
0
VB
0
2.231e-13
1.6e-11
0
JJ
0
0
5.1e-15
5.2e-16
NN
0
1.034e-10
5.4e-15
5.9e-18
RB
0
0
5.3e-11
0
DT
0
0
0
1.8e-12
31

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
32

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
33

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
34

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-8
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
35

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
36

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
37

COMP90042
L6
The Viterbi Algorithm
Janet
will
back
the
bill
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
38

COMP90042
L6
The Viterbi Algorithm
Janet/NNP
will/MD
back/VB
the/DT
bill/NN
NNP
8.8544e-06
●
0
0
2.5e-17
0
MD
0
3.004e-08
0
0
0
VB
0
2.231e-13
1.6e-11
0
1.0e-20
JJ
0
0
5.1e-15
5.2e-16
0
NN
0
1.034e-10
5.4e-15
5.9e-18
2.0e-15
RB
0
0
5.3e-11
0
0
DT
0
0
0
1.8e-12
0
39

COMP90042
L6
The Viterbi Algorithm
• Complexity:O(T2N),whereTisthesizeofthe tagset and N is the length of the sequence.
‣ T * N matrix, each cell performs T operations. • Whydoesitwork?
‣ Because of the independence assumptions that decompose the problem (specifically, the Markov property). Without these, we cannot apply DP.
40

COMP90042
L6
• •
Viterbi Pseudocode
alpha = np.zeros(M, T)
for t in range(T):
alpha[1, t] = pi[t] * O[w[1], t]
for i in range(2, M):
for t_i in range(T):
for t_last in range(T): # t_last means t_{i-1}
s = alpha[i-1, t_last] * A[t_last, t_i] * O[w[i], t_i] if s > alpha[i,t_i]:
alpha[i,t_i] = s
back[i,t_i] = t_last
best = np.max(alpha[M-1,:])
return backtrace(best, back)
Good practice: work with log probabilities to prevent underflow (multiplications become sums)
Vectorisation (use matrix-vector operations)
41

COMP90042
L6
HMMs In Practice
• WesawHMMtaggersbasedonbigrams.State-of-the-
𝑃 (𝒕) = ∏𝑛 𝑃 (𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2) Viterbi now O(T3N) 𝑖=1
• Needtodealwithsparsity:sometagtrigramsequences might not be present in training data
Backoff: 𝑃 (𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2) = λ3𝑃^ (𝑡𝑖 | 𝑡𝑖−1, 𝑡𝑖−2) + λ2𝑃^ (𝑡𝑖 | 𝑡𝑖−1) + λ1𝑃^ (𝑡𝑖) λ1 + λ2 + λ3 = 1
• With additional features, reach 96.5% accuracy on Penn Treebank (Brants, 2000)
art use tag trigrams.
‣
‣ ‣
42

COMP90042
L6
Other Variant Taggers • HMM is generative
‣ allows for unsupervised HMMs: learn model without any tagged data!
43

COMP90042
L6
Other Variant Taggers
• Discriminative models describe P(t | w) directly
‣ supports richer feature set, generally better accuracy when
trained over large supervised datasets
‣ E.g., Maximum Entropy Markov Model (MEMM), Conditional random field (CRF), Connectionist Temporal Classification (CTC)
‣ Most deep learning models of sequences are discriminative (e.g., encoder-decoders for translation), similar to an MEMM
44

COMP90042
L6
HMMs in NLP
• HMMsarehighlyeffectiveforpart-of-speechtagging
‣ trigram HMM gets 96.5% accuracy (TnT)
‣ related models are state of the art
‣ MEMMs
‣ CRFs
‣ Deep CRF
97% 97.6% 97.9%
‣ English Penn Treebank tagging accuracy https://aclweb.org/ aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
• Applyout-of-theboxtoothersequencelabelling tasks
‣ named entity recognition, shallow parsing, alignment … ‣ In other fields: DNA, protein sequences, image lattices…
45

COMP90042
L6
A Final Word
• HMMsareasimple,yeteffectivewaytoperform sequence labelling.
• Canstillbecompetitive,andfast.Naturalbaseline for other sequence labelling tasks.
• Maindrawback:notveryflexibleintermsoffeature representation, compared to MEMMs and CRFs.
46

COMP90042
L6
Readings
• JM3AppendixAA.1-A.2,A.4
• SeealsoE18Chapter7.3
• References:
‣ Rabiner’s HMM tutorial http://tinyurl.com/2hqaf8
‣ Lafferty et al, Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
47

Related Posts