EE6435 Programming homework on hidden Markov model
. Python. Debug may take you time.
Full mark: 70 pts
Mark your calendar. No late work will be graded.
Implementation and application of the Viterbi algorithm
The base composition of most genome sequences is not homogeneous. In particular, GC composition can vary regionally. In the human genome, for instance, there are “CpG islands” which show a very strong statistical signal (even stronger at the dinucleotide composition level than the mononucleotide composition level) and which tend to mark the 5′ end of many genes. Therefore the problem of objectively segmenting a genome sequence into regions of different compositions arises naturally. One way to segment a genome is with HMM algorithms.
Let’s assume that a genome can be modeled as a two-state HMM, with two states A and B. Assume that the parameters of this HMM are as follows. State A has an AT-biased emission distribution {0.35, 0.15, 0.15, 0.35} for the probabilities of {A,C,G,T}. State B has a GC-biased emission distribution {0.15, 0.35, 0.35, 0.15}. State A switches to state B with probability 0.001. State B switches back to state A with probability 0.01. Assume that the initial distribution for the HMM is uniform; 0.5, 0.5 for the two states.
The format of the hidden Markov model can be found in example.hmm and example.hmm.readme.
The file example.fa contains a simulated 1 Mb genome sequence, generated by this HMM. Lower case residues were generated by state A; upper case residues by state B. Note that *.fa means a format named ¡°fasta¡±. In fasta files, the first line is a header starting with ¡°>¡±. This is the standard format for describing sequences. The line starts with ¡°>¡± contains the information of the following sequence and the Viterbi parsing is applied to the sequence starting with the next line (not the line with >).
Viterbi parsing
Implement a Viterbi parser for the HMM above using Python, including the initialization, matrix fill, and traceback stages. Your program should output the alignment as a series of segments using the following
format.:
1 153 state A (or state 1)
154 252 state B (or state 2)
253 1651 state A
(… etc…)
Each line is a continuous segment with the same state. The format of each line is: start-position end-position state X
Note that the start position is 1-indexed (not 0-indexed)
Run your program on the 1 Mb simulated sequence in example.fa. How many segments of the genome are in state B? Output this information as well. (The true state path is in the file example.positions.)
Name your program as
Your program should run like this: your-student-ID.py
Don’t hard code any input parameters because we will change them during testing. For example, my testing file may be named as “test3.txt” or “inputdata.fa”. But they always have the fasta format.
When we test your program, we will run additional input files besides the one provided on Canvas. Below please find the detailed instructions about what to submit.
1. The source codes
2. A readme.txt file: include the complete command line to run your program, tell us
the environment and version of Python you used.
3. In your source codes, clearly comment the part about reading files, table generation,
table initialization, table filling, and the traceback part.
More details about grading.
1. Can generate correct results on 3 input files (each 15 pts, total 45 pts). The three
input files have the same format as the example.fa. It is a fasta format file and contain multiple lines or just two lines.
e.g.
>test1
AACCGGA Or
>test2 AACGG AGCT AAACGTA
2. The format (input, output, not hardcode any inputs) follows the instructions (15 pts, no partial credit)
3. The output of your program on example.fa. How many segments of the genome are in state B? Output this information as well. (10 pts)