Assignment 2. Due: through course web (Moodle) on 9/22/2016, before 11:55 PM. Note: See the course web page for the late turn-in policy.
Note: Collaboration policy: see “Basic Information” link on the course web page for more details.
Problem 1. Sequence Alignment. 15 points.
Calculate and show the Dynamic Programming matrix and an optimal alignment for the DNA sequences GCTTAGC and GCATTGC, scoring +3 for a match, -2 for a mismatch, and with a gap penalty of 3 (i.e., each gap column contributes -3). (If there are more than one optimal alignments, report any of those.)
Problem 2. Sequence Alignment. 15 points.
Find the optimal pairwise global alignment of the sequences ATGACCATGAC and ACTGATCACTGAT with the condition that the C nucleotides shown in bold font
must be aligned with each other. What is the optimal alignment score? The scoring parameters are defined as +2 for match, -1 for mismatch and -2 for gap. Show the dynamic programming matrix you used. (And please do not create a 11×13 matrix!)
Problem 3. Pattern Matching. 15 points.
Build a keyword tree, with fail edges (as in the Aho-Corasick algorithm discussed in class) for the following dictionary. Mark with dashed arrows the “fail edges” that do not go back to the root. That is, you do not need to show fail edges going back to the root. You may, but are not required to mark the nodes that correspond to pattern matches.
Count and note down the number of fail edges your keyword tree has. (Again, only the fail edges you marked, i.e., fail edges that do not go back to the root.)
FACED CEDER DERIVE RIVAL VALENCE
Problem 4. Sequence alignment. 50 points.
Note: you may use any programming language you wish. You are not allowed to use
any existing libraries or programs that are implemented to align sequences.
Your first task is to implement a dynamic programming algorithm that finds the optimal global alignment of two input DNA sequences. The program should take as input:
a. Two “Fasta” format files, each with one DNA sequence. See http://en.wikipedia.org/wiki/FASTA_format for help on this format. See http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/assn2.file1.fa for a sample fasta format file. The program should be able to handle sequences of length 1000 bp each.
b. A text file containing a 4 x 4 matrix of substitution scores, to be used in the alignment (see sample below). You may use the following file as a sample to develop/debug your code: http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/subs.txt
Example of a nucleotide substitution matrix:
ACGT A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91
c. A negative real number representing the “gap penalty”.
The program will be run from the Linux command-line as:
<programfilename> <fastafilename1> <fastafilename2> <substitutionmatrixfile> <gap-penalty>
The output of the program should be a textual display of the optimal global alignment (any one, if more than one optimal solutions exist). In particular:
a. First, there should be a statement of the form: “The optimal alignment between given sequences has score X.”
b. Next, the alignment should be output in the format given by the following example:
ACCC--AC--AT-A...AT AC-CGGACATATTA...AT
The entire alignment should be in two lines, one line for each input sequence.
Note that since you will be implementing a dynamic programming alignment algorithm, we expect your program to be of quadratic time complexity (i.e., two sequences of length n can be aligned in O(n2) time).
Your second task is to write a program (called “create_data_and_align” below) that creates a pair of “related” sequences and aligns them using the alignment program from the first task.
The “create_data_and_align” program should take as input a single integer L. It should then create two “related” DNA sequences as follows:
- Create a random DNA sequence of length L, with uniform nucleotide probabilities. (I will discuss in class how to create a random sequence.) For now, we will call this sequence ‘S1’. Please use the current time to seed your pseudo-random number generator.
- Make int(L/10) random changes to sequence S1 to create sequence S2. To make a random change, randomly choose a position in the sequence, then mutate it to something else with probability 1⁄2 and delete that position (the single nucleotide at that position) with probability 1⁄2.
- Make int(L/10) random changes to sequence S1 to create sequence S3.
- Write sequences S2 and S3 from the previous steps into two separate Fasta
format files.
The “create_data_and_align” program should then call your alignment program from the first task above, and thus produce a new text file that has the optimal alignment of sequences S2 and S3. Call your alignment program using the subs.txt scoring matrix mentioned above and gap penalty = -500.
Your third task is to run the “create_data_and_align” program ten times, and store the resulting alignment files in files named aln1.txt, aln2.txt, … aln10.txt. Each run should use L = 100. (and a gap penalty of -500)
Your fourth task is to run the alignment program you created in the first task on the two fasta format files included in the following downloadable tarball: http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/assn2-alnproblem.tar.gz
Write your alignment in a file called “mel-vir-aln.txt”.
Use the substitution matrix file “subs.txt” linked to above, and a gap penalty of -200.
What to turn in for problem 4:
Create a tarball or zip archive of a directory containing the following files, and upload along with your solution to the assignment.
a. Files that have your code. (The alignment program, the “create_data_and_align” program.)
b. The files aln*.txt created from your third task above.
c. The alignment file “mel-vir-aln.txt” from your fourth task above.
Note: the TA is not going to run your programs, and you are going to submit the code only as part of our records that you did indeed implement them.