程序代写代做 algorithm 18.3 The backward algorithm and calculating P(x)

18.3 The backward algorithm and calculating P(x)
The backward algorithm is similar and if we run it to the end, we again calculate P (x). This starts from the end of the sequence and works back to the beginning. Define
bk(i) = P(x(i+1):L|⇡i = k)
the probability observing the last part of a sequence, xi+1, xi+2, . . . , xL conditional on
starting in state k at time i.
Once again, this is calculated using tabular computation:
Initialisation i = L: bk(L) = ak0 for all k.
Recursion i = L 1, . .P. , 1: bk(i) = Pl aklel(xi+1)bl(i + 1). Termination: P(x) = l a0lel(x1)bl(1).
If we don’t model the end of the sequence, ak0 = 1.
Here’s a diagram showing how to get the ith column from the (i + 1)th column in the
backward algorithm. The example shown has three states 1, 2 and 3. column i
b1(i) = a11e1(xi+1)b1(i + 1) + a12e2(xi+1)b2(i + 1) + a13e3(xi+1)b3(i + 1)
a12 a13 The log version of the algorithm is (writing Bk(i) = logbk(i)):
column i + 1 b1(i + 1) b2(i + 1) b3(i + 1)
a11
Initialisation i = L: Bk(L) = Ak0 for all k (or Bk(L) = 0 if end not modelled) Recursion i=L1,…,1P: Bk(i)=log[Plexp(Akl +El(xi+1)+Bl(i+1))]. Termination: P(x)=log[ lexp(A0l+El(x1)+Bl(1))].
18.4 The posterior probability of being in state k at time i P (⇡i = k|x) The final product of this backward algorithm is not usually what we are interested in (we
use the forward algorithm to calculate that) but the combined forward and backward
96

algorithms allow us to calculate the joint probability of the all observations and the prob that ⇡i = k
P(x1:i,⇡i = k)P(xi+1:L|x1:i,⇡i = k) since P(A,B) = P(A)P(B|A) = fk (i)bk (i)
P(x,⇡i = k) =
= P (x1:i, ⇡i = k)P (xi+1:L|⇡i = k) by Markov property
Of more interest is the posterior probability P (⇡i = k|x) which we obtain directly by P(⇡i = k|x) = fk(i)bk(i),
P (x)
where the denominator is calculated either from the forward or backward algorithm.
18.5 What can we do with the posterior estimates?
We saw that the Viterbi path, ⇡⇤, is the most likely single path. But usually the most likely path is not very likely at all — there may be many other paths the are nearly as likely. We can use the posterior P (⇡i = k|x) to get some other likely paths.
The first is ⇡ˆ, the maximum posterior path, where
⇡ˆi =argmaxP(⇡i =k|x).
k
Note that ⇡ˆ is often not a legal path through the state space as it may include transitions that are not allowed.
The second is when we are interested in some function of the states, g(k). In these cases, we calculate the the posterior expectation of g at a particular position,
G(i|x) = Ek[g(⇡i|x)] = XP(⇡i = k|x)g(k). k
In particular, if g is an indicator function, that is, g takes the value 1 for some subset of states and 0 for all others, Ek[g(⇡i|x)] is just the posterior probability that ⇡i is in the specified subset.
18.6 Estimating the parameters of an HMM
So far we have assumed we know the structure of the HMM and the associated parameter values (the transition probabilities akl and emission probabilities ek(b)). In general, we don’t know either of these. What we usually do is decide on a model (based on our knowledge of the system) and then estimate the parameters of the model. Let ✓ = {akl, ek(b)} be the set of all parameters of the model.
Then we are interested in finding the set of parameters that maximizes the (log) likeli-
hood
Xn l(x1,…,xn|✓) = logP(x1,…,xn|✓) = logP(xj|✓).
97
j=1

The likelihood of the jth sequence, P(xj|✓), is just what we have been referring to as P(xj) up to this point as we had always assumed the parameter values, ✓, were known. Writing it as P(xj|✓) simply emphasises the fact that we think of it now as a function of the unknown ✓.
If we knew the state paths for a long sequence (or many short sequences), we could estimate the parameters simply by using the empirical proportions of transitions and emissions as our probabilities:
Akl Ek (b) aˆkl=PA andeˆk(b)PE(j)
i ki j k
where A and E are empirical counts. Note that, as some transitions or emissions probably wouldn’t occur in smaller datasets, it is advisable add a small number of ‘pseudo-counts’ to the empirical counts so that none are zero). But assuming we know the state paths is unrealistic, so we proceed assuming we have only observed sequences x1, x2, . . . , xn.
18.7 Baum-Welch algorithm for estimating parameters of HMM
The Baum-Welch algorithm is an iterative algorithm that attempts to maximize the (log) likelihood of an HMM. Unlike earlier algorithms we have seen, it is not exact, so the estimate it finds is not guaranteed to be the best. It may also get stuck in local maxima, so di↵erent starting points are necessary.
The idea of the algorithm is to pick a starting value for ✓ = (a, e). Probable paths for this value of ✓ are found. From these probable paths, a new value for ✓ is found by calculating A and E. This process repeats until the likelihood of ✓ converges on some value (that is, no change or a very small change is seen in l(x1, . . . , xn|✓) from one step to the next).
In one version of the algorithm, the probable paths used are the Viterbi paths for each sequence and the values A and E are calculated from these paths. This seems reasonable and can produce satisfactory results but it does not converge to the maximum likelihood estimate.
It turns out that we can avoid actually imputing a probable path by directly calculating the probability that the transition from k to l occurs at position i in x:
P(⇡i =k,⇡i+1 =l|x,✓)= fk(i)aklel(xi+1)bl(i+1). P (x)
Thus, to get a the expected value of Akl, we simply sum over all possible values of i. A similar argument can be made for E. The expected values for Akl and Ekl are then:
Akl = X 1 Xfj(i)aklel(xj )bj(i+1) (2) P(xj) k i+1l
ji
Ek(b) = X 1 X fj(i)bj(i) (3) jP(xj) k k
i:xji =b 98

The Baum-Welch algorithm proceeds as follows:
Initialise: Set starting values for the parameters. Set log-likelihood to 1.
Iterate: 1. Set A and E to their pseudo count values. For each training sequence xj: (a) Calculate fk(i) for xj from forward algorithm
(b) Calculate bk(i) for xj from backward algorithm
(c) Calculate Aj and Ej and add to A and E using Equations ?? and ??
above.
2. Setnewvaluesforaande,basedonAandE
3. Calculate log-likelihood of model
4. If change in log likelihood is small, stop, else, continue.
99