Homework #3 for Bioimaging and Bioinformatics
BME2210 – Spring 2017
Bioinformatics portion
Due (as 3 .cpp files to the ICON dropbox) by 11am on Wednesday, February 8
Sequence Alignment Basics
Part 1: Create a file hw3.1.cpp. Write a program that prompts the user for two DNA sequence
strings and scores an alignment.
- Each A or T match contributes a score of +2 (so A matching A, T matching T, A matching T, and so on…)
- Each C or G match contributes a score of +3
- Each non-match, including mismatches and gaps, contributes a score of -2
You may assume that the input is valid (only DNA bases), although you should allow the user to input the sequence in either upper- or lower-case letters. For example, two sequences and their corresponding score could be:
Sequence 1: ATGCTGACTGCA Sequence 2: CTTGAGACG
A/T score = 3*2= 6 C/G score = 3*3= 9 Non-match = 6*-2= -12 Total score = 3
Part 2: Create a file hw3.2.cpp. You can start with the program you wrote in part 1 and modify it if you like. In this one, you are to assess whether a given RNA sequence could represent a coding strand (CDS) for a protein (assume no introns) – i.e., if it can represent a “valid” messenger RNA (mRNA) with a start codon and in-frame stop codon. To simplify things, input only a single sequence instead of 2 sequences. Scan that sequence in all 6 reading frames (that is, in both the forward and reverse directions, and in all 3 reading frames for each of those directions) determine if there is at least one start codon (AUG) with an in-frame stop codon (UAA, UAG or UGA). The output for the given sequence should be either “No”, if the sequence does not appear to be a valid mRNA, or “Yes:<xx>”, if the sequence appears to be a valid mRNA, and where <xx> is the number of amino acids in the resulting translated protein.
You can assume for this assignment that the given sequence encodes at most one valid CDS
region, and you can also go ahead and stop at the first stop codon that you encounter (although in a REAL bioinformatics program, you would have to find and account for / deal with all such possibilities).
For example, the sequence GGGAUGAAAUAACCC would result in an output of “Yes:2”.
Part 3: Create a file hw3.3.cpp. Copy your answer to part 1 above (not 2) and modify accordingly. This time you may again assume that the input is valid and do not need to verify as you did in part 2.
This script will try all overlapping combinations to find the maximum alignment score. You can add gaps on the front/back of the sequences to facilitate the analysis. You do not need to insert internal gaps. Use the same scoring function that you did in part 1 above (i.e., A or T matches = +2, C or G = +3, and mismatch or gap = -2). To simplify the assignment, any character paired with a dash (‘–‘) should be called a gap. However, a gap character paired with another gap character (at the sequence ends) is considered to not be properly part of the alignment – i.e., either skip those scenarios entirely in your scoring process, or else give them a score of 0 (which has the same effect, and might be easier to implement, depending on how you go about it).
For example:
Enter the first DNA sequence: ATGCTGACTGCA Enter the second DNA sequence: CTTGAGACG
seq1 ---------ATGCTGACTGCA--------- seq2 CTTGAGACG--------------------- 0 A/T matches 0 C/G matches
21 mismatches + gaps score: -42
seq1 ---------ATGCTGACTGCA--------- seq2 -CTTGAGACG-------------------- 0 A/T matches 0 C/G matches
20 mismatches + gaps score: -40
seq1 ---------ATGCTGACTGCA--------- seq2 --CTTGAGACG------------------- 0 A/T matches 0 C/G matches
19 mismatches + gaps score: -38
seq1 ---------ATGCTGACTGCA--------- seq2 ---CTTGAGACG------------------ 1 A/T matches 1 C/G matches
16 mismatches + gaps score: -27
...(others not shown)...
seq1 ---------ATGCTGACTGCA--------- seq2 ---------------------CTTGAGACG 0 A/T matches 0 C/G matches
21 mismatches + gaps score: -42
Maximum score = ???
Hints: Although there are many ways to do part 3, it is surprisingly difficult. You have to be mindful of the end of the sequences (i.e., do not try to access a character position that does not exist), and careful with loop indices. For my solution, I made sure I always knew which string was the longest. I also made a function call “rotateSeqRight” that simply took a sequence as input, and rotated all the characters right by one position (although if you use character arrays instead of the string class, which I do not recommend b/c it would be more difficult, then make sure to exempt the null terminating character from this procedure). If you can’t get your program to do all the comparisons, at least try to get the strings set up with dashes and do the first comparison and scoring for some partial credit on part 3. Also, in case it helps, the following pages have some code in Matlab to help get you started on the algorithm (although remember to actually do the assignment in C++).
GRADING Rubric (100 points total):
30 points for working code, part 1. 30 points for working code, part 2. 40 points for working code, part 3.
NOTE: If your program does not compile you will receive a zero on that part of your homework. In addition, late homework will not be accepted. DO NOT WORK TOGETHER! Students caught working together on this or any assignment will drop a whole letter grade for the course!
% Code to start Part 3 from.
function HW2_3
% Ask the user for DNA sequence 1.
prompt = ‘Enter the first DNA sequence of length 1 to 99: ‘; seq1 = input(prompt, ‘s’);
len1 = length(seq1);
sprintf(‘ Seq 1 %s has %d bases’, seq1, len1)
% Ask the user for DNA sequence 2.
prompt = ‘Enter the second DNA sequence of length 1 to 99: ‘; seq2 = input(prompt, ‘s’);
len2 = length(seq2);
sprintf(‘ Seq 2 %s has %d bases’, seq2, len2)
% Guess sequence 1 is longer than sequence 2.
longSeq = seq1;
longLen = len1;
shortSeq = seq2;
shortLen = len2;
% Correct an incorrect guess.
if (len2 > len1)
longSeq = seq2; longLen = len2; shortSeq = seq1; shortLen = len1;
end
% Create “buffered” strings to facilitate the scan.
bufLength = longLen + 2 * shortLen; scan1 = blanks(bufLength);
scan2 = blanks(bufLength);
% Fill the “scan” strings with dashes for i = 1:bufLength
scan1(i) = ‘-‘;
scan2(i) = ‘-‘; end
% Initialize the first scan string with the long sequence in the middle.
for i = shortLen + 1:shortLen + longLen scan1(i) = longSeq(i-shortLen);
end
% Initialize the second scan string with the short sequence at the % beginning.
for i = 1:shortLen
scan2(i) = shortSeq(i);
end
% Score & print the initial alignment.
score = scoreSequences(seq1, seq2);
printSeq(scan1, scan2, score);
% Rotate the short sequence to the right, score & print. scan2 = rotateSeqRight(scan2);
score = scoreSequences(seq1, seq2);
printSeq(scan1, scan2, score);
% Repeat in a loop!!
end
% Use your scoring logic from part 1.
function score = scoreSequences(seq1, seq2) % To Do!
score = 0;
end
% Print out the sequences and score.
function printSeq(seq1, seq2, score)
fprintf(1, ‘%s\n’, seq1); fprintf(1, ‘%s\n’, seq2); fprintf(1, ‘Score: %d\n’, score);
end
% Rotate the sequence to the right.
function seq = rotateSeqRight(seq) seqLen = length(seq);
last = seq(seqLen);
for i=seqLen:-1:2
seq(i) = seq(i-1);
end
seq(1) = last;
end