Example: In the human genome, the dinucleotide (that is, two base pairs next to each other in the sequence) CG, written CpG to distinguish it from the nucleotide pair C-G, is subject to methylation. Methylation changes the G to a T. That means we see the CpG dinucleotide less frequently than we would expect by considering the individual frequencies of C and G. In some functional regions of the genome, such as promoter and start regions of genes, methylation is suppressed and CpG occurs at a higher frequency than elsewhere. These are called CpG islands.
To detect these regions we model each region of a sequence as having either high or low CG content. We label the regions H and L. In high CG regions, nucleotides C and G each occur with probability 0.35 and nucleotides T and A with probability 0.15. In low CG regions, C and G are probability 0.2 and T, A are 0.3. The initial state is H with probability 0.5 and L with probability 0.5. Transitions are given by aHH = aHL = 0.5, aLL = 0.6,aLH = 0.4. Use the Viterbi algorithm to find the most likely sequence staes given the sequence of symbols x = GGCACTGAA.
Solution: We work in the log space (base 2) so that the numbers don’t get too small. We’ll need the log values of each of the above probabilities which are best represented as matrices: set A = log2(a) where a is the transition matrix
HL A= H log(aHH)= 1 1
L 1.322 0.737
and if E = log2(e) where e is the matrix if emission probabilities then
A CGT
E = H log(eH (A)) = 2.737 1.515 1.515 2.737 L 1.737 2.322 2.322 1.737
The matrix V , with row indices k and column indices i (so that the (k, i)th element is Vk(i)) along with the pointers to allow traceback is
GGCACTGAA 00
H 1 2.51 5.03 7.54 11.28 13.12 16.85 18.65 22.39 25.41 L 1 3.32 5.84 8.35 10.28 13.34 15.81 18.87 21.35 23.82
The first column in this matrix is simple: every sequence is in state 0 at step 0, so V0(0) = log(1) = 0 while other states have VH(0) = VL(0) = log(0) = 1.
The second column is derived from the first as follows:
VH(1) = log(eH(G))+max{V0(0)+log(a0H),VH(0)+log(aHH),VL(0)+log(aLH)} = 1.515 + (V0(0) + log(a0H )) = 1.515 1 = 2.515 and similarly for VL(1).
Traceback begins in the final column where we see the state that maximises the joint probability is L. Following the pointers from this position and recording the state at each step gives us the state path with the highest probability is ⇡⇤ = HHHLLLLLL. Note that ⇡⇤ is built from right to left in the traceback procedure. ⇤
92
The schematic below shows how the (i + 1)th column is derived from the ith column in a Viterbi matrix. Here, there are 3 possible states, 1, 2 and 3. The 0 state is omitted in this diagram.
column i
v1(i)
v2 (i)
column i + 1
v1(i + 1) = e1(xi+1) max (v1(i)a11, v2(i)a21, v3(i)a31)
column i V1(i) V2 (i) V3 (i)
a11
a21
a31
column i + 1
V1(i + 1) = E1(xi+1) + max (V1(i) + A11, V2(i) + A21, V3(i) + A31)
a11
a21
v3 (i)
The same diagram shown using log units:
a31
18.2 The forward algorithm and calculating P(x)
We have seen that it is easy to calculate P(x,⇡). However, we usually only observe x so can’t directly calculate P(x,⇡). We could tackle this by finding a suitable state path, such as the Viterbi path, ⇡⇤, and calculate P (x, ⇡⇤). But calculating P (x, ⇡⇤) does not adequately tell us the likelihood of observing x which may have arisen from a large number of possible state paths.
What we really want is calculate is P (x), the probability of observing x without taking any particular state path into account. This involves marginalizing over all possible
paths: that is,
X
P(x) = P(x,⇡). ⇡
The number of possible state paths grows exponentially so we cannot enumerate them all and naively calculate this sum. Instead, we use another dynamic programming algorithm called the forward algorithm and calculate P (x) iteratively.
The forward algorithm iteratively calculates the quantity fk(i) = P(x1:i,⇡i = k),
93
the joint probability of the first i observations and the prob that ⇡i = k. The recursion
used is that
X
fl(i + 1) = el(xi+1) fk(i)akl. (1) k
Initialisation i = 0: f0(0) = 1, fk(0) = 0 for k > 0. Recursion i = 1, . . . , LP: fl(i) = el(xi) Pk fk(i 1)akl). Termination: P(x) = k fk(L)ak0.
If the end state 0 is not modelled, simply set ak0 = 1 to get P(x) = Pk fk(L).
Here’s a diagram showing how to get the (i + 1)th column from the ith column in the forward algorithm. The example shown has three states 1, 2 and 3.
column i
f1(i)
f2 (i)
f3 (i)
column i + 1
f1(i + 1) = e1(xi+1) (f1(i)a11 + f2(i)a21 + f3(i)a31)
a11
a21
a31
Once again, we’ll need to work with the log quantities as the qualities of interest get very small very fast. However, if we take the log of both sides of Equation ??, the log of the sum on the right hand side does not simplify immediately.
Let Fk(i) = log(fk(i)) and Akl = log(akl). Then Equation ?? becomes
Fl(i) = log[el(xi) X fk(i 1)akl)] = log[el(xi)] + log “X exp(Fk(i 1) + Akl)#
kk
Directly calculating a sum of the form log(c) = log(ea + eb) requires calculating ea and eb which we were trying to avoid all along. Instead, note that log(ea + eb) = log(ea(1 + eb a)) = log(ea) + log(1 + eb a) = a + log(1 + eb a). If the di↵erence b a is not too large, this method never need store an extremely large or small number so is numerically stable. This extends to finding the log of a sum of multiple logged numbers:
function logsum(x)
return x[0] + log(sum(exp(x – x[0])))
or, if you are using log2, function log2sum(x)
return x[0] + log2(sum(2^(x – x[0])))
94
Example cont: For the CG island example above, use the forward algorithm to calcu- late the probability of the sequence x = GGCACTGAA.
Solution: The matrix produced by the forward algorithm is given below, in log units (base 2). The first column is based on the start state, 0. The first entry of the second column is log(fH (1)) = log(eH (G)) + log(1/2)
GGCACTGAA 00
H 1 2.51 4.49 6.39 9.51 10.49 13.55 14.53 17.58 19.78 L 1 3.32 5.08 6.97 8.27 10.89 12.30 14.92 16.33 18.37
The log probability is thus log(P (x)) = log(2 19.78 + 2 18.37) = 17.91.
95