程序代写代做 Hidden Markov Mode html algorithm C Bioinformatics DNA ✏


X
⌧ B 12⌧ M
⌧ E 1 2 ⌧





Y
Figure 8: A pair HMM model for global alignment. Emission probabilities for states M, X, Y are pxiyj , qxi and qyj , respectively. Compare it to the simpler FSA in Figure 2.
10 Applications of HMMs in bioinformatics 10.1 Pairwise alignment with HMMs
We saw that we could tackle the pairwise alignment problem with finite state automata. We now tackle the problem using an HMM, which is sometimes called a stochastic FSA.
We define a pair HMM as emitting a pair of sequences (x, y) as opposed to the standard HMMs we have considered so far that emit a single sequence.
The basic HMM which produces a global alignment has three states, X,Y and M. M emits a match, X emits a residue from sequence x and a gap in sequence y (an insertion in x relative to y), while Y emits a residue from y and a gap in x. Emission probabilities for states M, X, Y are pxiyj , qxi and qyj , respectively. We also include begin and end states, B and E. Non-zero transition probabilities are aMX = aMY = , aXX = aY Y = ✏, aBX=aBY =P,akE=⌧foranyk,andakM6=0fork6=Eandcanbecalculatedusing the fact that k aik = 1.
All the algorithms we saw for standard HMMs will work for these pair HMMs but we need a little bit of accounting for the 2 sequences — instead of vk(i) or fk(i), we need to work with vk(i,j) or fk(i,j) etc where k is one of the 3 states M, X or Y. In each case, we keep 3 score matrices instead of 1 to keep track of which state we are in at every point in the alignment.
Durbin et al show how the parameters in these pair HMMs work within the Viterbi algorithm to produce exact analogues of the dynamic programming algorithms we saw earlier.
59
1✏⌧
1✏⌧

For example, to the standard quantities we use in the Needleman-Wunsch algorithm from the Viterbi HMM formulation of global alignment, set
s(a,b)=log pab +log12⌧(1⌘)2 qa qb
d = log (1 ✏ ⌧ )
(1 ⌘)(1 2 ⌧)
e=log ✏ . 1⌘
These pair HMMs give us more than just another way of viewing the basic alignment algorithms. Since they are couched in the language of probability, we can answer ques- tions about alignments with more depth and nuance than we can with the standard deterministic tools.
60

10.1.1 Probability that two sequences are related
For example, given a pair sequences, x and y, we can ask what probability that two sequences are related without relying on any particular alignment. This is given by the
quantity
X
P(x,y) = P(x,y,⇡) ⇡
where the sum is over all possible alignments, ⇡. As in the standard HMM case, we calculate this quantity via the forward algorithm or the backward algorithm.
10.1.2 Sampling alignments
Further, instead of relying on a single alignment, we could sample possible alignments in proportion to their probability using the techniques described in Section 9.8.
That is, we could calculate the forward fM (i, j), fX (i, j) and fY (i, j) from the forward algorithm and then traceback to find a state path which is an alignment. The traceback from the current state chooses the next state (M, X, or Y) at each step according to it’s contribution to the probability of the current state. This is exactly the technique described in the Section 9.8.
10.1.3 Probability that xi and yj are aligned
Consider two residues, xi and yj, in our given sequences x and y. What is the probability that they are actually aligned to each other? Write hxi,yji to mean xi and yj are aligned, so the probability of interest is Pr(hxi,yji|x,y). We could estimate this probability by sampling many alignments using the technique described above and counting the proportion of times that hxi,yji.
But it turns out that we can calculate this number exactly using the general technique of finding the posterior probability of being in a state at some time described in Section 9.4. xi is aligned to yj exactly when the HMM is in state M at (i,j). Thus
Pr(hxi,yji|x,y) = P(⇡(i,j) = M|x,y) = P(⇡(i,j) = M,x,y) = fM(i,j)bM(i,j). P(x,y) P(x,y)
The last equality is exactly the same as the one derived in Section 9.4 and, as usual, the quantity P(x,y) can be calculated using either the forward or backward algorithm.
10.2 Profile HMMs
The canonical problem in genetics is to find a group of sequences that are homologous. In particular, we are interested in finding homologous genes that share a similar function. We say such sequences or genes belong to the same family in the sense that they share a common ancestor and have maintained the same (or similar) functionality. They may be the same sequence in di↵erent species or in the same species but in di↵erent parts of the
61

genome (having arrived there through duplication). Sequences in the same family will often have features in common, particularly where they share the same function and, therefore, the same basic secondary structure.
If we can characterise these families accurately, by finding features that almost certainly share in common and identifying regions where more variation is seen, we will be able to better align sequences known to belong to the family and more easily identify other members of the family. We achieve this characterisation by modelling the family using an HMM, known as a profile HMM.
We’ll start by assuming that we are given a family of homologous sequences that are already aligned into a multiple sequence alignment (MSA). See a very small example in Figure 9.
VGA–HAGEY
V—-NVDEV
VEA–DVAGH
VKG——D
VYS–TYETS
FNA–NIPKH
IAGADNGAGV
Figure 9: Ten columns from a given multiple alignment of seven globin sequences.
From a given alignment, we wish to characterise a typical sequence in the alignment at each position. Once we have made this characterisation, we could use it to search for other sequences (or parts of sequences) that fit the profile and so are candidates to be members of this family.
We model the alignment as an HMM, where each position in the alignment is a state with a distinct probability of emitting the various residues.
We’ll start by supposing the alignment that is largely free of gaps (by ignoring, say, columns in the alignment that are more than 50% gaps). With each position in the alignment, associate a state in the HMM. Call this state a match state and label the ith match state Mi. The (ungapped) alignment is then modelled as a HMM with only the trivial transitions, Mi to Mi+1 (with additional begin and end states). Let emission probabilities from the i match state be eMi . A model for a MSA of length 3 looks like:
Now let’s allow gaps in the alignment. How do we handle them? To handle an insertions (with respect to the alignment — parts of a sequence x that are not matched by anything in the model) we introduce a new set of states Ii which matches the residues in x after i to a gap. These states have emission probabilities eIi (a) which we set to the background rate: eIi (a) = qa.
There is a transition from Mi to Ii, a loop at Ii, and a transition from Ii to Mi+1.
62
B
M1
M2
M3
E

Ij
A gap of length k therefore has log-odds score
logaMiIi +logaIiMi+1 +(k1)logaIiIi,
the same as an ane gap penalty.
A deletion relative to the model corresponds to skipping ahead in the model. To allow transitions from every match state to every other state ahead of it in the model would introduce too many transitions (how many would we need?) so we introduce instead a silent delete states Di next to every match state. Silent states emit no residues. We allow transitions aMi1Di,aDiMi+1 and aDiDi+1.
Dj
A full model incorporates both insert and delete states and is drawn below. Notice that we have not drawn transitions between I and D states though it simplifies computation to allow them (we’d allow Dj ! Ij and Ij ! Dj+1).
Dj
Ij
This profile HMM model is equivalent to a pair HMM model for pairwise alignment where the given MSA is summarised into a single sequence y . So when we seek to fit a candidate sequence x to our profile HMM, it is as though we are aligning x and y with a pair HMM (with appropriately chosen emission probabilities).
63
B
Mj
E
B
Mj
E
B
Mj
E

10.2.1 Estimating the parameters of a profile HMM
We can choose the parameters of the profile HMM using the empirical counts (Akl and Ek(b)) with pseudo-counts added. Remember that the number of possible transitions is limited: in our full drawn model we only allow non-zero transitions for aMi Mi+1 , aMi Di+1 , aMiIi , aIiIi , aIiMi+1 , aDiDi+1 and aDiMi+1 . We have emissions for all possible residues at each site so we add pseudo-counts to ensure eMi (a) > 0 for all a. Then we use the rule we saw earlier to estimate transition and emission probabilities:
Akl Ek (b) akl=PA andek(b)=PE(j).
j kj j k
A simple way of assigning pseudo-counts is to add 1 to all scores (including zero scores). There is a lengthy discussion of which pseudo-count values to choose in the Durbin et al book and many more sensible schemes are proposed – none are particularly complicated but we don’t have time to cover them all here.
We assume emissions from insert states are at the background rate (that is, that rate the residue occurs at at any position).
In the example in Figure 9, the first column has all seven sequences in the match state with 5 Vs, and 1 each of F and I. The 17 other possible residues are not observed, so have frequency 0. Adding a pseudo-count of 1 to each observed frequency gives us emission probabilities eM1(V) = 6/27, eM1(F) = eM1(I) = 2/27 and the other 17 residues have eM1 (b) = 1/27. 6 of the 7 transitions to the next column are to the match state again while one is to a delete state. Again, adding pseudo-countsof one gives aM1M2 = 7/10, aM1D2 = 2/10 and aM1I1 = 1/10. Transitions from the insert and delete states are just based on the pseudo-counts here.
10.2.2 Finding matches
Once the model has been set and the parameters estimated, we can set about seeing whether other sequences match the family. Call M the model and x a sequence we wish to test against the model. To do so, we align proposed sequences to the family using the now familiar tools of the Viterbi algorithm, giving the Viterbi path ⇡⇤ (and the associated score, P(x|⇡⇤,M)) or, better, the full probability of a sequence summed over all alignments, P(x|M).
These scores can be used to compare the hypotheses that x belongs to family M or x is no moreQlike M than we would expect at random. Call R the random model and let P (x|R) = Li=1 qxi be the likelihood of x under the random model where qa is simply the background rate of occurrence of residue a. If the log ratio log(P (x|M )/P (x|R)) > 0, x is more likely associated with the modelled family than not. To find more certain matches, we may chose a threshold that this ratio must exceed before we call it a member of the family.
Note that if we wish to fit new found sequence x to the family, we can simply use the Viterbi alignment to align it to the family and update parameters of the model.
64

10.2.3 Alignment with a known profile HMM
The simplest case is when we have a known aligned family to which we have already fitted a profile HMM and we wish to add a number of sequences to the profile. In this case, we use the Viterbi algorithm to find the most probable alignment for the new sequences.
The Viterbi path will consist of matches, insert and delete states. At the delete states, we add a gap character, –, to the sequence we are aligning. At an insert state, the unaligned residue of the sequence we are aligning is emitted, forcing a gap like character in the already aligned profile. In the profile, we placeholder character, ., at these positions. Note that if there are multiple insertions in di↵erent sequences at a position, there are many ways to align them against each other. Since we believe insertions are not shared between all members of the family, we can set an arbitrary rule for aligning them, such as using simple left-justification.
10.2.4 Alignment from unaligned sequences with HMMs
If we do not have a pre-existing aligned family and/or profile HMM, we need to first specify an HMM and then use the Baum-Welch algorithm to estimate the parameters. After we have done that, we are in the same position as above and can construct the MSA from the fitted profile HMM.
A rule of thumb for specifying the HMM in the absence of prior knowledge is to allow M match states where M is the average length of the training sequences. In general, it is dicult to fit an HMM of this size. A number of heuristics have been developed to avoid local maxima.
Clustal⌦ implements this method and is quick enough to align thousands of sequences in reasonable time. Currently, Clustal⌦ can only be used for protein sequences. See Fabian Sievers et al, 2011, Molecular Systems Biology 7, http://www.nature.com/msb/ journal/v7/n1/full/msb201175.html for a full description.
10.3 Gene finding
A DNA sequence can roughly be divided into two types of region: genes and non-genes or inter-genic regions. Genes are regions that code for proteins which actually perform biological functions in an organism so these regions are of primary interest. The role played by intergenic regions is not yet clear and it is often referred to as ‘junk DNA’.
We are interested, then, to find a method of finding regions of a given sequence that are genes. To do so, we need to look at how a gene is structured along a sequence. Recalling that a sequence has a direction with one end being 5’ end, the other being the 3’ end. Moving forward in the sequence is going from the 5’ end towards the 3’ end.
Our model of a gene has the following elements occurring in the order listed here: an inter-genic region, a promoter region, a 5’ un-transcribed region, a series of exons and
65

Figure 10: Gene structure and transcription. The DNA of the coding region is composed of exons (coding DNA) interspersed with introns (non-coding DNA) and is flanked by untranslated regions (UTRs). Upstream of the coding regions within the gene are DNA sequences that control (promoter) and regulate (enhancers) gene expression. During transcription, the initial nuclear transcript includes RNA sequence complementary to the entire coding region and the UTRs. In a subsequent step, the introns are spliced out to form mRNA which translocates to the cytoplasm where it is translated into protein. Source: http://dx.doi.org/10.1093/bja/aep130 by UoA subscription
introns, a 3’ un-transcribed region, a poly-A region and then back to another inter-genic region.
These regions are described in more detail below.
Inter-genic region: A non-coding region between genes.
Promoter: a region of DNA that facilitates the transcription of a particular gene. Pro- moters are located near the genes they regulate, on the same strand and typically up- stream (towards the 5’ region of the sense strand). For the transcription to take place, the enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences and response elements that pro- vide a secure initial binding site for RNA polymerase and for proteins called transcription factors that recruit RNA polymerase. These transcription factors have specific activator or repressor sequences of corresponding nucleotides that attach to specific promoters and regulate gene expressions.
As promoters are typically immediately adjacent to the gene in question, positions in the promoter are designated relative to the transcriptional start site, where transcription of RNA begins for a particular gene (i.e., positions upstream are negative numbers counting back from -1, for example -100 is a position 100 base pairs upstream).
In eukaryotes, the process is complex and promoters may occur hundreds of base-pairs upstream.
66

In prokaryotes, the promoter consists of two short sequences at -10 (that is 10 bases up- stream from the UTR and is called the Pribnow box which typically looks like TATAAT) and at -35 (the -35 element, usually TTGACAT). The promoter regions are not tran- scribed to RNA.
Untranslated regions (UTR 5’ or 3’) : These are regions immediately flanking the trans- lated region. They are transcribed into RNA but not translated into proteins.
Exons and introns: An exon is transcribed into RNA and is further translated into a protein. An intron is transcribed into a form of RNA and then spliced out of the RNA sequence that finally gets translated in a protein. In any gene, there could be one or many exons and zero or many introns. Exons and introns alternate along the sequence.
Figure 11: A figure showing how the transcribed precursor to messenger RNA includes the UTRs, exons and introns. The introns are spliced out to form the messenger RNA. The Exons in the mRNA are translated into proteins. Souce: http://en.wikipedia. org/wiki/File:Pre-mRNA_to_mRNA.svg
Poly(A) signal: After the 3’ UTR on the RNA, a number of adenines (As) is added — this is called the poly(A) tail. The poly(A) signal, or polyadenylation signal, is thus a stretch of DNA that signals to the RNA transcription mechanism to begin the addition of the poly(A) tail.
A method first described in Burge and Karlin 1997 (see http://www.ncbi.nlm.nih. gov/pubmed/9149143) describes a generalized HMM incorporating all these regions. See Figure 12 for a sketch of the structure of the HMM. The states of the HMM are N (cor- responding to an inter-genic region), P (promotor), F (5’ UTR), E (exons), I (introns), T (3’ UTR) and A (poly(A) signal). The multiple states for introns, I0, I1 and I2 and exons E0, E1 and E2 indicate the relation of the reading frame of the exon to the reading frame of the initial exon (Einit). Recall that three bases of DNA code for a single amino acid, with each group of 3 bases call a codon. If an exon or intron does not have length that is a multiple of 3, then the start of the next exon or intron may be out of phase with it. A subscript of 0, 1 or 2 represents in phase or lagging 1 or 2 steps out of phase, respectively.
The GHMM produces a set of states q = q1 . . . qn with an associated set of lengths (durations) d = d1 . . . dn and for each state qi, it produces a sequence of length di according to a probability model associated with the state qi. Algorithms to analyse sequences according to this model are implemented in Genscan and GlimmerHMM.
67

86
Gene Structure Pr
tal functional units of a eukaryotic gene, e. intron, intergenic region, etc. (see Figure le details), which may occur in any biologica sistent order. Note that introns and intern in our model are divided according to “ which is closely related to the reading fram an intron which falls between codons is co phase 0; after the Ærst base of a codon, after the second base of a codon, phase 2, I0, I1, I2, respectively. Internal exons are s divided according to the phase of the prev tron (which determines the codon positio Ærst base-pair of the exon, hence the frame). For convenience, donor and accept sites, translation initiation and termination are considered as part of the associated exo
Reverse strand states and forward stran are dealt with simultaneously in this mode what similar to the treatment of both stran GENMARK program (Borodovsky & M 1993); see the legend to Figure 3. Though so similar to the model described by Kulp et al
Figure 12:
et al 1997. See text for details. The full HMM includes a mirror image corresponding
The structure of the (generalised) HMM used for gene finding from Burge
to the reverse strand which has been deleted here.
cludes: (1) single as well as multi-exon ge promoters, polyadenylation signals and in sequences; and (3) genes occuring on either DNA strands. In addition, as mentioned pre partial as well as complete genes are permit the occurrence of multiple genes in the s quence. Thus, the essential structure of m tebrate genomic sequences likely to be enco in genome sequencing projects can be descr this model structure. The most notable lim are that overlapping transcription units (p rare) cannot be handled and that alternati cing is not explicitly addressed.
The model, essentially of semi-Markov conveniently formulated as an explicit st ation Hidden Markov Model (HMM) of described by Rabiner (1989). BrieØy, the though of as generating a “parse” f, co of an ordered set of states, q~ à fq1 ; q2 with an associated set of lengths (du d~ à fd1;d2;…;dng which, using prob models of each of the state types, generates sequence S of length Là⌃ni à 1 di. The ge of a parse corresponding to a (pre-deÆ quence length L is as follows:
(1) An initial state q is chosen accor
Figure 3. Each circle or diamond represents a functional unit (state) of a gene or genomic region: N, intergenic region; P, promoter; F, 50 untranslated region (extending from the start of transcription up to the translation in- itiation signal); Esngl, single-exon (intronless) gene (trans- lation start ! stop codon); Einit, initial exon (translation start ! donor splice site); Ek (0 4 k 4 2), phase k in- ternal exon (acceptor splice site ! donor splice site); Eterm, terminal exon (acceptor splice site ! stop codon); T, 30 untranslated region (extending from just after the stop codon to the polyadenylation signal); A, polyade-
68
our model is substantially more general in t
nylation signal; and Ik (0 4 k 4 2), phase k intron (see 1
d
g g
a
e n
p d
i n o
n d l
d c
m . h
t
v t a o
u
a t m
n r
n n