HW3
Homework #3 for Bioimaging and Bioinformatics
BME2210 – Spring 2017
Bioinformatics portion
Due (as 3 .cpp files to the ICON dropbox) by 11am on Wednesday, February 8
Sequence Alignment Basics
Part 1: Create a file hw3.1.cpp. Write a program that prompts the user for two DNA sequence
strings and scores an alignment.
• Each A or T match contributes a score of +2 (so A matching A, T matching T, A matching
T, and so on…)
• Each C or G match contributes a score of +3
• Each non-match, including mismatches and gaps, contributes a score of -2
You may assume that the input is valid (only DNA bases), although you should allow the user to
input the sequence in either upper- or lower-case letters. For example, two sequences and
their corresponding score could be:
Sequence 1: ATGCTGACTGCA
Sequence 2: CTTGAGACG
A/T score = 3*2= 6
C/G score = 3*3= 9
Non-match = 6*-2= -12
Total score = 3
Part 2: Create a file hw3.2.cpp. You can start with the program you wrote in part 1 and modify
it if you like. In this one, you are to assess whether a given RNA sequence could represent a
coding strand (CDS) for a protein (assume no introns) – i.e., if it can represent a “valid”
messenger RNA (mRNA) with a start codon and in-frame stop codon. To simplify things, input
only a single sequence instead of 2 sequences. Scan that sequence in all 6 reading frames (that
is, in both the forward and reverse directions, and in all 3 reading frames for each of those
directions) determine if there is at least one start codon (AUG) with an in-frame stop codon
(UAA, UAG or UGA). The output for the given sequence should be either “No”, if the sequence
does not appear to be a valid mRNA, or “Yes:
and where
You can assume for this assignment that the given sequence encodes at most one valid CDS
region, and you can also go ahead and stop at the first stop codon that you encounter (although
in a REAL bioinformatics program, you would have to find and account for / deal with all such
possibilities).
For example, the sequence GGGAUGAAAUAACCC would result in an output of “Yes:2”.
Part 3: Create a file hw3.3.cpp. Copy your answer to part 1 above (not 2) and modify
accordingly. This time you may again assume that the input is valid and do not need to verify as
you did in part 2.
This script will try all overlapping combinations to find the maximum alignment score. You can
add gaps on the front/back of the sequences to facilitate the analysis. You do not need to insert
internal gaps. Use the same scoring function that you did in part 1 above (i.e., A or T matches =
+2, C or G = +3, and mismatch or gap = -2). To simplify the assignment, any character paired
with a dash (‘–‘) should be called a gap. However, a gap character paired with another gap
character (at the sequence ends) is considered to not be properly part of the alignment – i.e.,
either skip those scenarios entirely in your scoring process, or else give them a score of 0 (which
has the same effect, and might be easier to implement, depending on how you go about it).
For example:
Enter the first DNA sequence:
ATGCTGACTGCA
Enter the second DNA sequence:
CTTGAGACG
seq1 ———ATGCTGACTGCA———
seq2 CTTGAGACG———————
0 A/T matches
0 C/G matches
21 mismatches + gaps
score: -42
seq1 ———ATGCTGACTGCA———
seq2 -CTTGAGACG——————–
0 A/T matches
0 C/G matches
20 mismatches + gaps
score: -40
seq1 ———ATGCTGACTGCA———
seq2 –CTTGAGACG——————-
0 A/T matches
0 C/G matches
19 mismatches + gaps
score: -38
seq1 ———ATGCTGACTGCA———
seq2 —CTTGAGACG——————
1 A/T matches
1 C/G matches
16 mismatches + gaps
score: -27
…(others not shown)…
seq1 ———ATGCTGACTGCA———
seq2 ———————CTTGAGACG
0 A/T matches
0 C/G matches
21 mismatches + gaps
score: -42
Maximum score = ???
Hints: Although there are many ways to do part 3, it is surprisingly difficult. You have to be
mindful of the end of the sequences (i.e., do not try to access a character position that does not
exist), and careful with loop indices. For my solution, I made sure I always knew which string
was the longest. I also made a function call “rotateSeqRight” that simply took a sequence as
input, and rotated all the characters right by one position (although if you use character arrays
instead of the string class, which I do not recommend b/c it would be more difficult, then make
sure to exempt the null terminating character from this procedure). If you can’t get your
program to do all the comparisons, at least try to get the strings set up with dashes and do the
first comparison and scoring for some partial credit on part 3. Also, in case it helps, the
following pages have some code in Matlab to help get you started on the algorithm (although
remember to actually do the assignment in C++).
GRADING Rubric (100 points total):
30 points for working code, part 1.
30 points for working code, part 2.
40 points for working code, part 3.
NOTE: If your program does not compile you will receive a zero on that part of your homework.
In addition, late homework will not be accepted. �DO NOT WORK TOGETHER! Students caught
working together on this or any assignment will drop a whole letter grade for the course! �
% Code to start Part 3 from.
function HW2_3
% Ask the user for DNA sequence 1.
prompt = ‘Enter the first DNA sequence of length 1 to 99: ‘;
seq1 = input(prompt, ‘s’);
len1 = length(seq1);
sprintf(‘ Seq 1 %s has %d bases’, seq1, len1)
% Ask the user for DNA sequence 2.
prompt = ‘Enter the second DNA sequence of length 1 to 99: ‘;
seq2 = input(prompt, ‘s’);
len2 = length(seq2);
sprintf(‘ Seq 2 %s has %d bases’, seq2, len2)
% Guess sequence 1 is longer than sequence 2.
longSeq = seq1;
longLen = len1;
shortSeq = seq2;
shortLen = len2;
% Correct an incorrect guess.
if (len2 > len1)
longSeq = seq2;
longLen = len2;
shortSeq = seq1;
shortLen = len1;
end
% Create “buffered” strings to facilitate the scan.
bufLength = longLen + 2 * shortLen;
scan1 = blanks(bufLength);
scan2 = blanks(bufLength);
% Fill the “scan” strings with dashes
for i = 1:bufLength
scan1(i) = ‘-‘;
scan2(i) = ‘-‘;
end
% Initialize the first scan string with the long sequence in the middle.
for i = shortLen + 1:shortLen + longLen
scan1(i) = longSeq(i-shortLen);
end
% Initialize the second scan string with the short sequence at the
% beginning.
for i = 1:shortLen
scan2(i) = shortSeq(i);
end
% Score & print the initial alignment.
score = scoreSequences(seq1, seq2);
printSeq(scan1, scan2, score);
% Rotate the short sequence to the right, score & print.
scan2 = rotateSeqRight(scan2);
score = scoreSequences(seq1, seq2);
printSeq(scan1, scan2, score);
% Repeat in a loop!!
end
% Use your scoring logic from part 1.
function score = scoreSequences(seq1, seq2)
% To Do!
score = 0;
end
% Print out the sequences and score.
function printSeq(seq1, seq2, score)
fprintf(1, ‘%s\n’, seq1);
fprintf(1, ‘%s\n’, seq2);
fprintf(1, ‘Score: %d\n’, score);
end
% Rotate the sequence to the right.
function seq = rotateSeqRight(seq)
seqLen = length(seq);
last = seq(seqLen);
for i=seqLen:-1:2
seq(i) = seq(i-1);
end
seq(1) = last;
end