CS代写 COMP9418: Advanced Topics in Statistical Machine Learning

COMP9418: Advanced Topics in Statistical Machine Learning
Markov Chains and Hidden Markov Models
Instructor: University of Wales

Copyright By PowCoder代写 加微信 powcoder

Introduction
§ This lecture discusses two classes of Graphical Models § Markov chains
§ Hidden Markov Models (HMM)
§ Both models are instances of Dynamic Bayesian Networks (DBN) § They have a repeating structure that grows with time or space
§ Such structure is simple and uses the Markov property
§ The Markov property states that future states are independent of past ones given the current state
§ In Markov chains, all states are observable
§ HMM extend the chains by allowing hidden states
§ We will discuss specialised inference algorithms for both classes
§ Applications of these graphical models in domains such as robot localisation

Time and Space
§ Several problems require reasoning about sequences
§ Such sequences may represent the problem dynamics in time or space
§ Examples of applications are speech recognition and robot localization
§ Dynamic Bayesian Networks (DBN) allow
to incorporate time or space in our models
§ The two simplest instances of DBNs are Markov chains and Hidden Markov models
𝑋! 𝑋” 𝑋# 𝑋$

Markov Chains
§ Markov chain is a state machine
§ 𝑋 is a discrete variable and each value is called a state § Transitions between states are nondeterministic
§ Parameters
§ Prior probabilities 𝑃(𝑋!)
§ Transition probabilities or dynamics 𝑃(𝑋%|𝑋%&!)
§ Stationary assumption
§ Transition probabilities are the same for all values of 𝑡 § Also known as a time-homogeneous chain
𝑋! 𝑋” 𝑋# 𝑋$

Markov Chains: Weather
§𝑋 = {𝑠𝑢𝑛,𝑟𝑎𝑖𝑛}
§ Initial distribution 𝑋” = 1 𝑠𝑢𝑛
§ Transition probabilities
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛 .7
𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑠𝑢𝑛 .9 .1 #$” 𝑟𝑎𝑖𝑛 .3 .7
Matrix of transition probabilities
𝑃(𝑋#|𝑋#$”)
Transition or state diagram

Markov Chains: Independencies
§ An relevant question is which independencies are implied in this chain
§ We can use d-separation to visually infer the independencies
§ This independence assumption is known as (first order) Markov property
§ Independences are also apparent when we look at the chain rule
§ Chain rule if Bayesian networks for this example
𝑃 𝑋!,𝑋”,𝑋#,𝑋$,𝑋’ =𝑃 𝑋! 𝑃 𝑋” 𝑋! 𝑃 𝑋# 𝑋” 𝑃 𝑋$ 𝑋# 𝑃(𝑋’|𝑋$)
§ Chain rule in general
𝑃 𝑋!,𝑋”,𝑋#,𝑋$,𝑋’ = 𝑃 𝑋! 𝑃 𝑋” 𝑋! 𝑃 𝑋# 𝑋”,𝑋! 𝑃 𝑋$ 𝑋#,𝑋”,𝑋! 𝑃(𝑋’|𝑋$,𝑋#,𝑋”,𝑋!)
𝑋! ⊥ 𝑋”|𝑋# 𝑋$ ⊥ 𝑋”, 𝑋#|𝑋! 𝑋% ⊥ 𝑋”, 𝑋#, 𝑋!|𝑋$
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
𝑋’ ⊥ 𝑋(|𝑋& More generally, 𝑋#%” ⊥ 𝑋#$”|𝑋#
𝑋” ⊥ 𝑋&|𝑋’

Markov Chains: Independencies
§ In general
𝑃 𝑋!,…,𝑋( = 𝑃 𝑋! 0𝑃(𝑋%|𝑋%&!)
𝑋 ( parameters 𝑋 + (𝑛 − 1) 𝑋 ” parameters We also assume that 𝑃(𝑋&|𝑋&'”) is the same for all 𝑡 (stationarity)
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
|𝑋| + 𝑋 ” parameters 𝑋#$” 𝑋#
𝑃(𝑋# |𝑋#$”)
.9 .1 .3 .7
1𝑠𝑢𝑛 0 𝑟𝑎𝑖𝑛
𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛 𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛

Probability of a State Sequence
§ The probability of a sequence is the product of the transition probabilities
§ This comes directly from the chain rule 𝑃 𝑋”,𝑋’,𝑋&,𝑋(,𝑋1 =
𝑃𝑋” 𝑃𝑋’𝑋” 𝑃𝑋&𝑋’ 𝑃𝑋(𝑋& 𝑃(𝑋1|𝑋()
𝑃 𝑠𝑢𝑛, 𝑟𝑎𝑖𝑛, 𝑟𝑎𝑖𝑛, 𝑠𝑢𝑛, 𝑠𝑢𝑛 = 1.1 .7 .3 .9 =.189
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
For example, what is the probability of the sequence: sun, rain, rain, sun, sun?

Probability of Staying in a Certain State
§ The probability of staying in a certain state for 𝑑 steps
§ It is the probability of a sequence in this state for 𝑑 − 1 steps then going to a different state
𝑺 32 = { 𝑋 4 = 𝑠 ∶ 1 ≤ 𝑖 ≤ 𝑑 } 𝑃𝑺32 =𝑃𝑠𝑠3$”(1−𝑃𝑠𝑠)
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
For example, what is the probability of three raining days?
=𝑃 𝑟𝑎𝑖𝑛𝑟𝑎𝑖𝑛 &$”(1−𝑃 𝑟𝑎𝑖𝑛𝑟𝑎𝑖𝑛 )= (.7 ) 1−.7 =.147

Expected Time in a State
§ The average duration of a sequence is a certain state
§ It is the expected number of time steps in that state 8
𝔼 𝑺 2 = ? 𝑃 𝑺 42 𝑖 84
= ? 𝑖 𝑃 𝑠 𝑠 4$”(1 − 𝑃 𝑠 𝑠 ) 41
= 1 − 𝑃(𝑠|𝑠)
𝔼 𝑆5647 = 1 =3.33 1−.7
𝑋! 𝑋” 𝑋# 𝑋$ 𝑋’
For example, what is the expected number of raining days?
” 0𝑟𝑎𝑖𝑛 10

Mini-Forward Algorithm
§ What is 𝑃(𝑋) on some day 𝑡?
§ We can obtain an answer by simulating the
𝑋! 𝑋” 𝑋# 𝑋$ …
𝑃(𝑥 ) is known 1
.84 .804 .16 .196
” 0 𝑃𝑥# =?𝑃(𝑥#,𝑥#$”)
= ? 𝑃 𝑥#|𝑥#$” 𝑃(𝑥#$”) 9!”#

Mini-Forward Algorithm
Input: time 𝑛, transition probability 𝑃(𝑋&|𝑋&'”), prior probability of states 𝑃(𝑋”) Output: 𝑃(𝑋()
for each state 𝑥 do
𝑝𝑥,1 ←𝑃(𝑋”=𝑥) for 𝑡 ← 2 to 𝑛 do
for each state 𝑥& do 𝑝𝑥&,𝑡 =0
for each state 𝑥&'” do
𝑝𝑥&,𝑡 ←𝑝𝑥&,𝑡 +𝑝𝑥&'”,𝑡−1𝑃(𝑥&|𝑥&'”)
return 𝑝[𝑥, 𝑛]

Grasshopper Example
.5 -2.5 -1.5 0.5 +1.5 +2
.25 .25 .25 .25
. 25″ = .0625
2 .5 (.25) = .25
.5″ +2 .25 ” = .375
2 .5 (.25) = .25
. 25″ = .0625
𝑃𝑋% =𝑃𝑋%&!𝑇
.25 .5 .25
.25 .5 .25
.25 .5 .25
=0.25.5.250

Stationary Distributions
§ Starting with a sunny day
.84 .804 .16 .196
𝑋! 𝑋” 𝑋# 𝑋$ …
§ Starting with a rainy day
.48 .588 .52 .412
§ Starting with an unknown day
𝑝 … .75 1−𝑝 .25

Stationary Distributions
§ For most chains
§ Influence of the initial distribution
gets less and less over time
§ The distribution we end up in is independent of the initial distribution
𝑃8 𝑋 = 𝑃8%” 𝑋 = ? 𝑃 𝑋 𝑥 𝑃8 𝑥 § Stationary distribution 9
§ The stationary distribution 𝜋 of the chain is the distribution we obtain if the chain converges
§ The stationary distribution satisfies
𝜋𝑋 =?𝑃𝑋𝑥𝜋(𝑥) 9

Stationary Distributions
§ Question: What is 𝑃(𝑋<)? 𝜋 𝑠𝑢𝑛 =𝑃 𝑠𝑢𝑛𝑠𝑢𝑛 𝜋 𝑠𝑢𝑛 +𝑃 𝑠𝑢𝑛𝑟𝑎𝑖𝑛 𝜋 𝑟𝑎𝑖𝑛 𝜋 𝑟𝑎𝑖𝑛 =𝑃 𝑟𝑎𝑖𝑛𝑠𝑢𝑛 𝜋 𝑠𝑢𝑛 +𝑃 𝑟𝑎𝑖𝑛𝑟𝑎𝑖𝑛 𝜋 𝑟𝑎𝑖𝑛 𝑋! 𝑋" 𝑋# 𝑋$ ... 𝜋 𝑠𝑢𝑛 =0.9𝜋 𝑠𝑢𝑛 +0.3𝜋 𝑟𝑎𝑖𝑛 𝜋 𝑟𝑎𝑖𝑛 =0.1𝜋 𝑠𝑢𝑛 +0.7𝜋 𝑟𝑎𝑖𝑛 𝜋 𝑠𝑢𝑛 = 3𝜋 𝑟𝑎𝑖𝑛 𝜋 𝑟𝑎𝑖𝑛 = 1/3𝜋 𝑠𝑢𝑛 𝜋𝑠𝑢𝑛 +𝜋𝑟𝑎𝑖𝑛 =1 𝜋 𝑠𝑢𝑛 = 3/4 𝜋 𝑟𝑎𝑖𝑛 = 1/4 Stationary Distributions: Grasshopper .5 -2.5 -1.5 0.5 +1.5 +2 .25 .25 .25 .25 .25 .25 .25 .25 § What is the stationary distribution? .25 .5 .25 .25 .5 .25 .25 .5 .25 Irreducible Markov Chains § A Markov chain is irreducible if every state 𝑥’ is reachable from every other state 𝑥 § That is, for every pair of states 𝑥 and 𝑥’, there is sometimetsuchthatthe𝑃 𝑋# =𝑥’𝑋" =𝑥 >0
§ Also known as regular or ergodic chain
§ In this case, the states of the Markov chain
are said to be recurrent
§ Each state is guaranteed to be visited an infinite number of times when we simulate the chain
.1 .1 1 1 .9 2 .5 3 .5 4
A reducible Markov chain

Stationary Distribution
§ Every (finite state) Markov chain has at least one stationary distribution
§ Yet an irreducible Markov chain is guaranteed to have a unique stationary distribution
§ An irreducible chain may or may not converge to its stationary distribution
§ To guarantee convergence, we need an additional property: Aperiodicity

Aperiodic Markov Chains
§ A Markov chain is aperiodic if it is possible to return to any state at any time
§Thereexistsan𝑡suchthatforallstate𝑥andall𝑡N ≥𝑡, 𝑃 𝑋#’ = 𝑥 𝑋” = 𝑥 > 0
§ An irreducible and aperiodic Markov chain converges to a unique stationary distribution
§ Irreducible: we can go from any state to any state
§ Aperiodic: avoids chains that alternates forever between states without ever settling in a stationary distribution
An irreducible but periodic Markov chain

Markov Chains Convergence
§ Although an irreducible and aperiodic Markov chain converges to a single stationary distribution, the convergence can be slow
§ In this example, the stationary distribution is close to (0.5, 0.5)
§ For a small 𝜖 it will take a very long time to reach the stationary distribution
§ We stay in the same state with high probability and rarely transition to another state
§ The average of these states will converge to (0.5, 0.5), but the convergence will be very slow

Applications of Markov Chains
§ Markov chains have several well-know applications
§ Markov chain Monte Carlo (MCMC) is a powerful approximate
inference algorithm used in statistical software such as Stan
§ MCs are part of the (LZMA) Lempel-Ziv-Markov compression algorithm used in 7zip
§ PageRank algorithm used by Google 1.0 is a direct application of MCs § PageRank
§ Model the web as a state graph: pages are states and hyperlinks are
transitions 4
§ Each transition from state 𝑖 has a probability ,+ , where 𝛼 is a !
constant parameter and 𝑘) is the outgoing degree of node 𝑖 § Compute a stationary distribution. But it is not unique. Why?
§ Augment the graph with phantom transitions of weight !&+, where 𝑁 is the number of nodes in the graph –

Applications of Markov Chains
§ Markov chains have several well-know applications
§ Markov chain Monte Carlo (MCMC) is a powerful approximate
inference algorithm used in statistical software such as Stan
§ MCs are part of the (LZMA) Lempel-Ziv-Markov compression algorithm used in 7zip
§ PageRank algorithm used by Google 1.0 is a direct application of MCs § PageRank
§ Model the web as a state graph: pages are states and hyperlinks are transitions
§ Each transition from state 𝑖 has a probability ,+ , where 𝛼 is a !
constant parameter and 𝑘) is the outgoing degree of node 𝑖 § Compute a stationary distribution. But it is not unique. Why?
§ Augment the graph with phantom transitions of weight !&+, where 𝑁 is the number of nodes in the graph –

Hidden Markov Models (HMM)
§ Hidden Markov Models (HMM) are Markov chains where the states are not directly observable
§ In the weather example, the weather may not be directly observable
§ Instead, we use sensors, such as temperature, air pressure, humidity, wind speed, etc.
§ HMM has two components
§ Underlying Markov chain over states 𝑋
§ Observable outputs (effects of the states) at each time step § These outputs are often called emissions
𝑋! 𝑋” 𝑋# 𝑋$ … 𝐸! 𝐸” 𝐸# 𝐸$

HMM Weather Example
§ HMM parameters
§ Initial distribution 𝑃(𝑋”)
§ Transition probabilities 𝑃(𝑋#|𝑋#$”) § Emission probabilities 𝑃(𝐸#|𝑋#)
𝑋! 𝑋” 𝑋# 𝑋$ … 𝐸! 𝐸” 𝐸# 𝐸$
𝑋” 𝑃(𝑋”) 𝑠𝑢𝑛 .5 𝑟𝑎𝑖𝑛 .5
𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛 𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛
𝑃(𝑋# |𝑋#$”) .7
𝑠𝑢𝑛 𝑢𝑚𝑏 𝑟𝑎𝑖𝑛 𝑢𝑚𝑏 𝑟𝑎𝑖𝑛 𝑢𝑚𝑏
𝑃(𝐸# |𝑋# ) .2

HMM: Independencies
§ The chain rule of Bayesian networks for HMMs (
𝑃 𝑋!,𝐸!,…,𝑋(,𝐸( = 𝑃 𝑋! 𝑃 𝐸! X! 0𝑃 𝑋% 𝑋%&! 𝑃(𝐸%|𝑋%) %*!
§ Independences are also apparent when we look at the chain rule
𝑋! 𝑋” 𝑋# … 𝐸! 𝐸” 𝐸#
§ Chain rule for Bayesian networks for this example
𝑃(𝑋”,𝐸”,𝑋#,𝐸#,𝑋!,𝐸!)=𝑃 𝑋” 𝑃 𝐸” 𝑋” 𝑃 𝑋# 𝑋” 𝑃 𝐸# 𝑋# 𝑃 𝑋! 𝑋# 𝑃 𝐸! 𝑋!
§ Chain rule in general
𝑃(𝑋”,𝐸”,𝑋#,𝐸#,𝑋!,𝐸!) = 𝑃 𝑋” 𝑃 𝐸” 𝑋” 𝑃 𝑋# 𝑋”,𝐸” 𝑃 𝐸# 𝑋#,𝑋”,𝐸” 𝑃 𝑋! 𝑋#,𝑋”,𝐸#,𝐸” 𝑃(𝐸!|𝑋!,𝑋#,𝑋”,𝐸#,𝐸”)
𝑋” ⊥ 𝐸!|𝑋! 𝑋# ⊥ 𝑋!, 𝐸!, 𝐸”|𝑋”
𝐸” ⊥ 𝑋!, 𝐸!|𝑋” 𝐸# ⊥ 𝑋!, 𝑋”, 𝐸!, 𝐸”|𝑋#

HMM: Independencies
§ In general, HMM have the following independency assumptions
§ A state is independent of all past states and all past evidence given the previous state (Markov property)
𝑋# ⊥𝑋”,…,𝑋#$’,𝐸”,…,𝐸#$’|𝑋#$”
§ Evidence is independent of all past evidence and all past states given the current state (independence of observations)
𝐸# ⊥ 𝑋”,…,𝑋#$”,𝐸”,…,𝐸#$”|𝑋#
§ Transition and emission probabilities are the same for all values of 𝑡 (stationary process)
𝑋! 𝑋” 𝑋# 𝐸! 𝐸” 𝐸#

HMM: Inference
§ We start with a first task of tracking the distribution 𝑃(𝑋C) over time
§ This task is known as filtering or monitoring §WeuseB 𝑋# =𝑃(𝑋#|𝑒”,…,𝑒#)todenotethebeliefof
§ We start with 𝐵(𝑋”), usually using a uniform distribution
§ Update 𝐵(𝑋#) as time passes and we get new
observations
§ The inference has two main steps § Passage of time
§ Observation
𝑋! 𝑋” 𝑋# 𝐸! 𝐸” 𝐸#

Passage of Time
§ Suppose we know the current state of belief 𝐵(𝑋C) 𝐵(𝑋#) = 𝑃(𝑋#|𝑒”:#)
§ We need to update it as one unit of time passes. Our aim is to compute 𝑃 𝑋#%” 𝑒”:#
𝑃 𝑋#%” 𝑒”:# = ∑9! 𝑃(𝑋#%”, 𝑥#|𝑒”:#)
= ∑9! 𝑃( 𝑋#%” 𝑥#, 𝑒”:# 𝑃(𝑥#|𝑒”:#)
= ∑9! 𝑃 𝑋#%” 𝑥# 𝑃(𝑥#|𝑒”:#) = ∑9! 𝑃 𝑋#%” 𝑥# 𝐵(𝑥#)

Observation
§ Given we updated the belief with passage of time § We know 𝑃 𝑋#%” 𝑒”:# and we need to update it to
𝐵 𝑋#%” = 𝑃 𝑋#%” 𝑒”:#%”
𝑃 𝑋#%” 𝑒”:#%”
= 𝑃(𝑋#%”|𝑒#%”, 𝑒”:#)
= 𝑃(𝑋#%”, 𝑒#%” 𝑒”:# /𝑃(𝑒#%”|𝑒”:#)
∝ 𝑃(𝑋#%”, 𝑒#%” 𝑒”:#
= 𝑃(𝑒#%”, 𝑋#%”|𝑒”:#)
= 𝑃 𝑒#%” 𝑋#%”, 𝑒”:# 𝑃(𝑋#%”|𝑒”:#) = 𝑃 𝑒#%” 𝑋#%” 𝑃(𝑋#%”|𝑒”:#)
𝐵 𝑋#%” ∝ 𝑃 𝑒#%” 𝑋#%” 𝑃(𝑋#%”|𝑒”:#)
We must renormalise the results by ∑𝐵(𝑋”#$)

HMM Weather Example
𝐵(𝑠𝑢𝑛) = .5 𝐵(𝑠𝑢𝑛) = .5 𝐵(𝑠𝑢𝑛) = .373 𝐵(𝑟𝑎𝑖𝑛) = .5 𝐵(𝑟𝑎𝑖𝑛) = .5 𝐵 𝑟𝑎𝑖𝑛 = .627
𝐵(𝑠𝑢𝑛) = .182 𝐵(𝑟𝑎𝑖𝑛) = .818
𝐵(𝑠𝑢𝑛) = .117 𝐵(𝑟𝑎𝑖𝑛) = .883
𝑃(𝑋”) 𝑋#$” .5 𝑠𝑢𝑛 .5 𝑠𝑢𝑛
𝑃(𝑋# |𝑋#$”) .7
𝑃(𝐸# |𝑋# ) .2
𝑟𝑎𝑖𝑛 𝑠𝑢𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛
𝑢𝑚𝑏 𝑟𝑎𝑖𝑛 𝑢𝑚𝑏 𝑟𝑎𝑖𝑛 𝑢𝑚𝑏

Forward Algorithm
§ Suppose we have a sequence of evidence observations and we want to know the state belief at the end of the sequence
𝐵 𝑋# = 𝑃(𝑋#|𝑒”:#)
𝑃(𝑋#|𝑒”:#) ∝ 𝑃(𝑋#, 𝑒”:#)
= ∑9!”# 𝑃(𝑋#, 𝑥#$”, 𝑒”:#)
= ∑9!”# 𝑃(𝑋#, 𝑥#$”, 𝑒#, 𝑒”:#$”)
= ∑9!”# 𝑃 𝑥#$” 𝑃 𝑋# 𝑥#$” 𝑃 𝑒# 𝑥#$”, 𝑋# 𝑃(𝑒”:#$”|𝑒#, 𝑥#$”, 𝑋#)
= ∑9!”# 𝑃 𝑥#$” 𝑃 𝑋# 𝑋#$” 𝑃 𝑒# 𝑋# 𝑃(𝑒”:#$”|𝑥#$”) = ∑9!”# 𝑃(𝑋#|𝑥#$”)𝑃(𝑒#|𝑋#)𝑃(𝑒”:#$”, 𝑥#$”)
= 𝑃(𝑒 |𝑋 )∑ 𝑃(𝑋 |𝑥 )𝑃(𝑥 ,𝑒 ) # # 9!”# # #$” #$” “:#$”
You can renormalise every step, but this algorithm often renormalised only the final one

Forward Algorithm
Input: time 𝑛, transition probability 𝑇, emission probability 𝐸, prior probability of states 𝑃(𝑋”), sequence of observations {𝑒#, … , 𝑒&}
Output: 𝐵(𝑋&)
for each state 𝑥 do
𝑝𝑥,1 ←𝑃(𝑋”=𝑥) for 𝑡 ← 2 to 𝑛 do
for each state 𝑥& do 𝑝𝑥&,𝑡 =0
for each state 𝑥&'” do
𝑝𝑥&,𝑡 ←𝑝𝑥&,𝑡 +𝑝𝑥&'”,𝑡−1𝑇(𝑥&|𝑥&'”)
𝑝𝑥&,𝑡 ←𝑝𝑥&,𝑡𝐸𝑒&𝑥& returnnormalised𝑝𝑥,𝑛 forallstates𝑥

Example: Robot Localization
Example from
Sensor model: can read in which directions there is a wall, never more than 1 mistake
Motion model: may not execute action with small prob.
Slide from Berkeley AI course 34

Example: Robot Localization
Lighter grey: was possible to get the reading, but less likely b/c required 1 mistake
Slide from Berkeley AI course 35

Example: Robot Localization
Slide from Berkeley AI course 36

Example: Robot Localization
Slide from Berkeley AI course 37

Example: Robot Localization
Slide from Berkeley AI course 38

Example: Robot Localization
Slide from Berkeley AI course 39

Most Probable Explanation (MPE)
§ The forward algorithm tracks the probability of the states
§ These probabilities are updates with as time passes and we observe evidence
§ A different task is to provide the most likely explanation
§ Considering all possible state combinations, which one has the highest probability considering the evidence
§ Therefore, we want to compute 𝑎𝑟𝑔𝑚𝑎𝑥9#:! 𝑃(𝑥”:#|𝑒”:#)
𝑋! 𝑋” 𝑋# 𝐸! 𝐸” 𝐸#

State Trellis
§ A state trellis is a graph that illustrates the state transition over time
§ Each arc represents a time passage/evidence observation with weight
𝑃 𝑥% 𝑥%&! 𝑃(𝑒%|𝑥%)
§ A path is a sequence of states
§ The product of weights on a path is the sequence
probability according to the evidence
§ The forward algorithm computes sums of paths probabilities that end in a same state, such as 𝑋( = 𝑠𝑢𝑛
§ We will see now the Viterbi algorithm that computes the path with highest probability
𝑋” 𝑋# … 𝑋(
𝑠𝑢𝑛 𝑠𝑢𝑛 𝑠𝑢𝑛
𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛 𝑟𝑎𝑖𝑛
State Trellis

Forward and Viterbi Algorithms
State Trellis
§ The Viterbi algorithm computes the maximum of the path probabilities that lead to the same final state
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
=𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’
§ The forward algorithm computes the sum of the path probabilities that lead to the same final state
𝑠𝑥% =𝑃𝑥%𝑒!:%
=𝑃 𝑒% 𝑥% H𝑃 𝑥% 𝑥%&! 𝑠[𝑥%&!]

Viterbi Algorithm
§ Consider we have two unfair coins, 𝑐” and 𝑐’
§ Someone flips the coins sequentially, but we do not know
which one. We only observe the outcomes heads or tails
§ But we know that 𝑐! has a higher probability of heads and 𝑐”
§ Also, the person has a preference to keep the same coin and he/she starts with a coin chosen randomly with equal probabilities
𝑋! 𝑋” 𝑋# 𝑋$ h𝑡𝑡𝑡
𝑋” 𝑃(𝑋”) 𝑋#$” 𝑋# 𝑐” .5 𝑐” 𝑐”
𝑐’ .5 𝑐” 𝑐’
𝑐’ 𝑐” 𝑐’ 𝑐’
𝑃(𝑋# |𝑋#$”) 𝑋# .7 𝑐”
.3 𝑐” .3 𝑐’ .7 𝑐’
𝐸# 𝑃(𝐸#|𝑋#) h .8
𝑡 .2 h .2 𝑡 .8

Viterbi Algorithm
𝑋” 𝑋# 𝑋! 𝑋$ 𝑋% .7 𝑐”
𝑋% 𝐸% 𝑃 ( 𝐸% | 𝑋%) 𝑐! h .8
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
= 𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’
𝑋! 𝑃(𝑋!) 𝑐! .5 𝑐” .5

Viterbi Algorithm
𝑋” 𝑋# 𝑋! 𝑋$ 𝑋% .7
𝑐” .5 𝑐# .5
.28 .0392 .00549 .00226
.07 .0672 .03763 .02107
𝑋% 𝐸% 𝑃 ( 𝐸% | 𝑋%) 𝑐! h .8
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
= 𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’
𝑋! 𝑃(𝑋!) 𝑐! .5 𝑐” .5

Viterbi Algorithm
Input: time 𝑛, transition probability 𝑇, emission probability 𝐸, prior probability of states 𝑃(𝑋”), sequence of observations {𝑒#, … , 𝑒&}
Output: max 𝑃 𝑥”:&'”, 𝑥& 𝒆#:& ,!:#$!
for each state 𝑥 do 𝑚𝑥,1 ←𝑃(𝑋”=𝑥)
for 𝑡 ← 2 to 𝑛 do
for each state 𝑥& do
for each state 𝑥&'” do
if𝑚𝑥&'”,𝑡−1𝑇𝑥&𝑥&'” >𝑚𝑥&,𝑡 𝑚𝑥&,𝑡 ←𝑚𝑥&'”,𝑡−1𝑇𝑥&𝑥&'”
𝑚 𝑥&,𝑡 ← 𝑚 𝑥&,𝑡 𝐸(𝑒&|𝑥&) return𝑝𝑥,𝑛 forallstates𝑥

Viterbi Algorithm
§ The Viterbi algorithm of the previous slide provides the probability of the most likely sequence
§ However, often we are more interested in the sequence instead of its probability
§ There are to common solutions
§ Keep an additional structure pointing to the parent of each node § Backtrack the computation from the last node
𝑋𝑋𝑋𝑋𝑋 “#!$%
𝑋% 𝐸% 𝑃(𝐸%|𝑋%) 𝑐! h .8
𝑐! 𝑡 .2 𝑐”h.2 𝑐”𝑡.8
.28 .0392 .00549 .00226
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
=𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’
.5 .07 .0672 .03763 .02107

Viterbi Algorithm: Backtracking Computation
.00549 .00226
.03763 .02107
𝑋! 𝑃(𝑋!) 𝑐! .5 𝑐” .5
𝐸% 𝑃(𝐸% | 𝑋%) 𝑐! h .8
𝑐!𝑡.2 𝑐” h .2 𝑐” 𝑡 .8
1. Divide by the probability of evidence
2. For each state 𝑥%&! divide by 𝑃(𝑥%|𝑥%&!)
3. See which value of 𝑥%&! matches the result
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
= 𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’

Viterbi Algorithm: Backtracking Computation
𝑐# .5 § Repeat
.00549 .00226
.03763 .02107
𝑋% 𝐸% 𝑃 ( 𝐸% | 𝑋%) 𝑐! h .8
𝑚𝑥% =max𝑃𝑥!:%&!,𝑥%𝑒!:% 0′:%&’
= 𝑃 𝑒% 𝑋% max𝑃 𝑥% 𝑥%&! 𝑚[𝑥%&!] 0%&’
𝑥$ = 𝑐#: 0.03763
Divide by the probability of evidence
For each state 𝑥%&! divide by 𝑃(𝑥%|𝑥%&!) See which value of 𝑥%&! matches the result
𝑋! 𝑃(𝑋!) 𝑐! .5 𝑐” .5
.02107 = 0.0263375 .8
𝑥$ = 𝑐”: 0.0263375 ≈ 0.08779 3. .3
𝑥$ = 𝑐#: 0.0263375 ≈ 0.03763 .7

Viterbi Algorithm: Backtracking Computation
Müller, M. (2015). Fundamentals of music processing: Audio, analysis, algorithms, applications. Springer.

Viterbi Algorithm: Vanishing Probabilities
§ Notice the probabilities decrease as we observe more evidence
§ It is intuitive since the number of paths grows exponentially with the sequence size

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com