CS计算机代考程序代写 data structure chain Bayesian Hidden Markov Mode Excel Bayesian network algorithm Formalizing Hidden Markov Models

Formalizing Hidden Markov Models
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 26, Three Segments

Outline

● Markov Chains for Language Modeling
● Formalizing a Hidden Markov Model
● Computing Likelihood of a sequence: Forward Algorithm

2Outline, Wk 9, Mtg 25

Formalizing Hidden Markov Models
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 26, Segment 1 of 3: Markov Chains

Illustrating Markov Chains

● Predicting sequences of random variables
○ Daily temperature: Xi = {cold, warm, hot}
○ Language modeling: Xi = vocabulary of a given language

4Markov Chains

Example from Martin & Jurafsky, 3rd Ed., Appendix A

Formal Specification of a Markov Chain Model

● States: e.g., daily temperature; words
● Transition model: all conditional probabilities for any transition of daily

temperatures; of words
● Prior probabilities of daily temperatures; of words

5Markov Chains

Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Probability of a Sentence S: Unigram Version

● A crucial step in speech recognition, language modeling, etc.
● First guess: products of unigrams:

● Given a lattice for word hypotheses in a 3-word sequence, the above
formulation is not quite right:

6Markov Chains

form (183) subsidy (15) for (18,185)

farm (74) subsidies (55) far (570)

Unigram counts from 1.7M words of AP news text

Probability of a Sentence S: Bigram Version

● Next guess: products of bigrams:

● Given a lattice for word hypotheses in a 3-word sequence:

● Bigram probabilities are more predictive, but very low and sparse!

7Markov Chains

form subsidy for

farm subsidies far

form subsidy (0) farm subsidy (0) subsidy for (2) subsidies for (6)

form subsidies (0) farm subsidies (4) subsidy far (0) subsidies far (0)

BOTEC: Estimates of What We Can Estimate

● What parameters can be estimated from 100M words of training data?
● Assume a uniform distribution over a vocabulary of 5K words

● Sparse data problem

8Markov Chains

Event Count Quality

unigrams 5K Excellent: 20,000

bigrams 25M Tolerable: 4

trigrams 12.5B Terrible

Markov Assumption for Language Modeling

A bigram language model is a Markov model:

● S, a set of states, one for each word wi
● A, a transition matrix where a(i,j) is the probability of going from state

wi to state wj
○ Where the probability a(i,j) can be estimated by

● π, a vector of initial state probabilities, where π(i) is the probability of
the first word being wi

9Markov Chains

MLE for Bigram Language Model Parameters

● Estimate the probability of each wi given the previous context by

● Using MLE estimates

● A correct, practical estimation method for P(Sent) given the Markov
assumption:

10Markov Chains

Visualizing an Ngram Language Model

Shannon/Miller/Selfridge method:
● To generate a sequence of n words given a bigram language model:

○ Fix an ordering of the vocabulary v1 v2 …vk and specify a sentence length n
○ Choose a random value ri between 0 and 1
○ Select the first vj such that P(vj ) = ri
○ For each remaining position in the sentence

■ Choose a random value ri between 0 and 1
■ Select the first vk such that P(vk|vj ) ≥ ri

11Markov Chains

Shannon-Miller-Selfridge for Shakespeare

12Markov Chains
Example from Dan Jurafsky

Shakespeare as Corpus

● N=884,647 tokens, V=29,066
● Shakespeare produced 300,000 bigram types out of V2= 844 million

possible bigrams
● So 99.96% of the possible bigrams were never seen (have zero entries

in the table)
● Quadrigrams: What’s coming out looks like Shakespeare because it is

Shakespeare

13Markov Chains

Shannon-Miller-Selfridge for Wall Street Journal

14Markov Chains

Example from Dan Jurafsky

Perplexity as a Language Model Metric
● Lower perplexity on test set means higher

likelihood of observed sentences (less perplexed)
● Nth root normalizes the inverse probability by the

number of words to get a per word perplexity
● Equivalent to the weighted average branching

factor
○ The number of possible words that can follow any

word
○ Example: a language consisting of the words for

the digits zero to nine:

15Markov Chains

Language Modeling: Current Methods

● Recurrent neural networks (RNNs)
○ Avoids exponential increase in computation time with statistical LMs
○ Weight parameters are shared across the network
○ Therefore, there is a linear increase in computation time

● Still requires smoothing
● Neural networks will be briefly presented in week 12

16Markov Chains

Formalizing Hidden Markov Models
AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A

CMPSC 442
Week 9, Meeting 26, Segment 2 of 3: Formalizing HMMs

Markov Chain versus HMM

● An HMM is a non-deterministic Markov Chain: cannot uniquely identify
a state sequence (introduced in Mtg 25)

● States are partially observed (sensor model)

18Formalizing HMM

Formal Specification of an HMM

19

Figure from Martin & Jurafsky, 3rd Ed., Appendix A

Formalizing HMM

Ice Cream Example

● The only evidence for the weather (hot vs. cold) on a given day is how many
ice creams Jason ate that day (Jason Eisner; example from Dan Jurafsky)
○ Q hidden sequence of weather states (H or C)
○ O sequence of observations of number of ice creams

20Formalizing HMM

One Possible Sequence of Hidden States (of Eight)

● Given an HMM λ = (A,B) and an observation sequence O, determine
P(O|λ)

● EG: probability of an ice cream sequence 3, 1, 3

● P(3, 1, 3, hot, hot, cold) = P(3|hot) x P(1|hot) x P(3|cold) x P(hot|start) x
P(cold|hot) x P(hot|cold)

21Formalizing HMM

Likelihood: Summing the Possible State Sequences

Likelihood of O = 3 1 3 sums over eight 3-event sequences

● P(3 1 3, hot cold hot) = P(3|hot) x P(1|cold) x P(3|hot) x P(hot|start) x
P(cold|hot) x P(hot|cold)

● P(3 1 3, hot hot cold) = P(3|hot) x P(1|hot) x P(3|cold) x P(hot|start) x
P(hot|hot) x P(cold|hot)

. . .

22Formalizing HMM

Motivation for Dynamic Programming

● Calculation of
○ Sum the probabilities of all possible state sequences in the HMM
○ The probability of each state sequence is the product of the state

transitions and emission probabilities

● Naïve computation is very expensive. Given T observations and N
states, there are NT possible state sequences.
○ For T=10 and N=10, 10 billion different paths!

● Solution: linear time dynamic programming
○ DP: uses a table (trellis) to store intermediate computations

23Formalizing HMM

Trellis Data Structure

24Formalizing HMM

Size of the Trellis

● N nodes per column, where N is the number of states
● S columns, where S is the length of the sequence
● E edges, one for each transition
● Total trellis size is approximately S(N+E)

○ For N=10, S=10:
■ E = (N x S) {edges from Sn to Sn+1} = 10

2

■ S(N+E) = 10(10+100) = 1,100 << 1010 25Formalizing HMM Formalizing Hidden Markov Models AIMA 14.3; Jurafsky & Martin, Draft 3rd ed., Appendix A CMPSC 442 Week 9, Meeting 26, Segment 3 of 3: Forward Algorithm Structure of the Forward Algorithm ● Initialization: all the ways to start ● Induction: all the ways to go from any given state at time t to any subsequent state at time t+1 ○ Given the probability for state qi at time t, induction carries forward the probability to each next qj at time t+1 ● Termination: all the ways to end 27Forward Algorithm Forward Probabilities ● For a given HMM λ, given that the state q is i at time t, what is the probability that the partial observation o1 … ot has been generated? ● Forward algorithm computes at(i) 1 < i< N, 1 < t < T in time 0(N 2T) using the trellis, where T is number of observations and N is the number of hidden states 28Forward Algorithm Computing Values of Each Trellis Entry Each cell of the forward algorithm trellis αt ( j ) represents the probability of being in state j after seeing the first t observations, given the HMM λ . The value of each cell is computed by summing over the probabilities of every path that leads to this cell. 29Forward Algorithm forward path probability from the previous time step at state qi transition probability from previous state qi to current qj state observation likelihood of ot given current state j Forward Algorithm Initialization: Induction: Termination: 30Forward Algorithm Ice Cream HMM with CPTs 31Forward Algorithm Ice Cream Example: 3 1 3 ● Initialization: 32 Induction Step 33Forward Algorithm Visualizing Computation of a Single Trellis Entry 34Forward Algorithm Trellis for an Ice Cream Sequence: 3,1,3 35Forward Algorithm Forward Algorithm: Pseudo Code 36Forward Algorithm Summary - One ● The Markov assumption can be used to construct a Bayesian Network for statistical language model, or to model other sequences over time, based on a transition model of conditional probabilities ● A Hidden Markov Model relies on two Markov assumptions, one for each hidden state conditioned on the past hidden state (transition model), and one for the observations conditioned on the current hidden state (sensor model) ● Dynamic programming can reduce the computational complexity of HMMs because many computations are reused 37Summary, Wk 9, Mtg 26 Summary - Two ● The trellis is the data structure used for the forward algorithm ● The forward algorithm computes the likelihood of a sequence of observations conditioned on the sequences of states that would result in the observation sequence 38Summary, Wk 9, Mtg 26