HW4
Homework #4 for Bioimaging and Bioinformatics
BME2210 – Spring 2017
Bioinformatics portion
Due (as 5 files – 4 .cpp files and 1 .txt file – to the ICON dropbox) by 11am on Monday, February
20
Implement Needleman-Wunsch algorithm
Implement the global alignment Needleman-Wunsch algorithm for protein sequences in an
iterative fashion (i.e., do not use recursion, which simplifies the implementation for those that
do not have experience with recursion.) For an iterative solution, we will use 3 matrices:
• The “A” matrix will store the final alignment scores.
• The “B” matrix is simply an initialized “A” matrix with Blosum62 scores for each pair of
residues.
• The “D” matrix will store our directions.
If you wish, you can look at the sample code (included below), which provides you with
functions to:
• initialize A and/or B matrices
• print matrices, and
• use the BLOSUM62 substitution matrix for scoring
However, note that the sample code is written in Matlab, to help you understand the overall
algorithm, but you need to turn in your code for parts 1-4 in C++ (.cpp) files. Thus, while the
Matlab code limits the number of amino acids to 25, your code should use strings in order to
handle any size of input; and while the Matlab code merely initializes the matrix to be filled
with zeros (instead of undefined values), you actually need to fill the matrix with real values. Its’
main use is to show you how a matrix can be fairly easily printed.
Hint 1: C++ can easily handle padding a string with spaces using the setw and setfill commands,
for which you would need to include iomanip as well as iostream – i.e., as described at
http://www.cplusplus.com/reference/iomanip/setw/. (actually you won’t even need setfill(),
since spaces are the default padding material – I just mention that for future reference) If you
try to use the older C style of using a statement such as ‘printf (“% 3s\n”, x);’, for which you
would need to include stdio.h, you will have a much harder time printing off the matrices in this
assignment, since you will have to figure out a way to convert an object of string class (and
being of any arbitrarily-given size) to the older C-style ‘character array’ that printf demands.
Thus, I highly recommend the C++ style, as shown in the code below where I print an arbitrary
string (here, a value of -2) with an arbitrary amount of space-padding (here, 3 spaces, although
note that the maximum width of any value in the BLOSUM62 matrix is only 2 character spaces,
since its’ maximum value is ‘11’ and its’ minimum value is ‘-4’):
#include
#include
#include
using namespace std;
int main()
{
string x=”-2″;
cout << setw(3) << x << endl;
//somewhat equivalent to: printf ("% 3s\n", x);
return 0;
}
Hint 2: the Needleman-Wunsch algorithm is described in the Reading Material (especially L8-
PairwiseAlignment1-Fall2015.pdf), and in even more helpful detail in the book (chapter 3,
especially pages 76-80, and it may be helpful to keep going and read portions of the rest of the
chapter, like Smith-Waterman on 82, heuristics on page 84, dot plots on 85, etc.). Pay attention
to the details – e.g., each matrix is of dimensionality (n+1) * (m+1), where the length of the
input sequence 1 is n and the length of the input sequence 2 is m. In other words, these
matrices are one dimension longer (in both # of rows and columns) than the lengths of the two
sequences. (do you see why this is so? the 1 added to the n is the left column, and the 1 added
to the m is the upper row, shown in Figure 3.21(a) in the online book; note that these are filled
in first before anything else is added to the matrix, as shown in parts (b)-(g) of that figure).
Hint 3: You already know that C++ array indices start with 0, so that the first position of a string
is stored in array index 0, the second in array index 1, etc. However, Needleman-Wunsch
introduces a twist, in that b/c of the additional dimension added to both the row and the
column (added at the 0th position), the first position of the sequence 1 corresponds to the
SECOND column of the matrix, which ends up making its’ array index equal to 1 after all. The
same applies to sequence 2 and the rows of the matrix. Rather than attempt to describe this
further in 1,000+ words, here is a picture instead:
0 1 2 3 4 5 Sequence Indices
A R N C D E Sequence Characters
0 1 2 3 4 5 6 Matrix Indices
0 0 -2 -4 -6 -8 -10 -12
0 A 1 -2 4 -1 -2 0 -2 -1
1 R 2 -4 -1 5 0 -3 -2 0
2 N 3 -6 -2 0 6 -3 1 0
3 D 4 -8 -2 -2 1 -3 6 2
4 C 5 -10 0 -3 -3 9 -3 -4
5 Q 6 -12 -1 1 0 -3 0 2
6 E 7 -14 -1 0 0 -4 2 5
Hint 4: To forestall questions from people who are not careful readers, I will point out that
while some parts of the online book talks about using an identity matrix to score the alignment,
other parts talk about using a BLOSUM62 one instead. e.g., Figure 3.20 demonstrates an
example with BLOSUM62. The lesson here is: do not simply read a sentence out of context! Yes,
you should actually read the WHOLE thing, if you want to know how it works.
Part 1: Create a file hw4.1.cpp. In this part, simply initialize the “B” matrix and print it out.
Scoring scheme:
• Gap penalty of -2
• Match/mismatch scores from BLOSUM62
You can get the latter from: ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62. Note that you are
not required to write code to read this matrix from a file – you may hard-code it into your
program as a matrix if you prefer.
Example output:
Sequence 1: ARNDCQE
Sequence 2: ARNCDE
B matrix
A R N C D E
0 -2 -4 -6 -8 -10 -12
A -2 4 -1 -2 0 -2 -1
R -4 -1 5 0 -3 -2 0
N -6 -2 0 6 -3 1 0
D -8 -2 -2 1 -3 6 2
C -10 0 -3 -3 9 -3 -4
Q -12 -1 1 0 -3 0 2
E -14 -1 0 0 -4 2 5
Part 2: Create a file hw4.2.cpp. In this part, now create the “A” matrix and print it out.
(Remember that BEFORE you do the calculations to obtain the final values that this matrix will
contain, you will need to initialize this matrix the same way that you initialized the “B” matrix.
e.g., you could simply copy the “B” matrix into “A”, if that is easier for you. Although I found it
easier to just put the initialization code into a function, and then I called it twice, first with B
and then with A.)
Hint: except for the first column and first row, each matrix score is the maximum of 3
possibilities: an alignment of two residues, a gap in sequence 1, or a gap in sequence 2.
Continuing the example output from above:
A matrix
A R N C D E
0 -2 -4 -6 -8 -10 -12
A -2 4 2 0 -2 -4 -6
R -4 2 9 7 5 3 1
N -6 0 7 15 13 11 9
D -8 -2 5 13 12 19 17
C -10 -4 3 11 22 20 18
Q -12 -6 1 9 20 22 22
E -14 -8 -1 7 18 22 27
score = 27
(do you see where “score =” comes from? hint: bottom-right corner of the matrix…)
Part 3: Create a file hw4.3.cpp. In this part, now create the “D” matrix and print out the
contents.
As you were calculating the A matrix in part 2 above, you had to select a direction from which
to calculate the new sub-totals (Vertical (V), Horizontal (H), or Diagonal (D) – these are the
arrows shown in the lecture slides / reading material / online book figures). Keep track of these
directions in this “D” matrix. Print out this D matrix as part of your answer. If more than one
direction is possible that includes Diagonal (D), then simply select D. If more than one direction
is possible but D is not included (i.e., H and V only), then arbitrarily select V. Note that you can
initialize the first row/column to all H/V.
[Optional]: Instead of picking arbitrarily, you could encode the combinations of different
directions with additional characters. I’d suggest the following scheme, for compatibility with
other people’s code: for V & H => +, for D & V => y, for D & H => 2, and for V, D & H => *. Note
that one reason to do this is that if you do not deal with such scenarios, then the code becomes
non-deterministic (meaning that if I reverse the order of the two input sequences – i.e., swap
sequence #1 and #2 – I can get a different answer, mainly b/c of arbitrarily picking V, and
therefore swapping H & V will affect the outcome).
Continuing the sample output from above:
D matrix
A R N C D E
0 H H H H H H
A V D H H H H H
R V V D H H H H
N V V V D H H H
D V V V V D D H
C V V V V D H H
Q V V V V V D D
E V V V V V D D
Part 4: Create a file hw4.4.cpp. In this part, it is now straight-forward to use the “D” matrix to
generate the alignment. Do that, and then print out that alignment.
Continuing the sample output from above, the alignment would be:
ARNDCQE
||| | |
ARN-CDE
Note how the vertical bar is only shown for identical letters, but not for gaps or mismatches.
[Optional Challenge]. Purely for your own interest, you could implement this Needleman-
Wunsch algorithm using recursion. There are a number of reasons that you might want to do
this – but if you end up doing that, which do you find that you prefer?
Part 5: Create a plain-text file hw4.5.txt. In this part, write a few sentences to briefly discuss
(i.e., just a few sentences each) the following issues:
a) What were the major obstacles/hurdles that you had to overcome during this
assignment?
a. Personally, was it your mastery of the tool that you used (C++), or basic biology
(knowing what an ‘amino acid’ is), or bioinformatics (knowing what an
‘alignment’ is), or just finding time to do the homework? Those are all honest
and valid responses to this question.
b. What I am particularly interested in, though, is: what were the major engineering
challenges that had to be overcome to solve problems of this type?
b) Now that you have code that implements the Needleman-Wunsch algorithm, what
would be the next step(s) for someone to take? (don’t worry, it doesn’t have to be you,
at least not for this class:-)
a. If you were going to make improvements to your code, what are some ideas that
you could start off with, and why might you want to do those improvements?
(I gave you some ideas above, like adding symbols that go beyond just H/V/D,
which would allow more options to be explored instead of being completely
arbitrary; and another idea would be to read in the BLOSUM62 matrix from a file
instead of hard-coding it in, b/c that would allow you to use other, alternative
matrices in the future, such as BLOSUM45, or PAM250, etc. But since I gave you
these ideas, please pick other ones – note that recursion is fine to include, as
long as you state why it would be useful, but in that case please also include at
least one more idea as well)
b. Going beyond just this code, what are some more general improvements that
you could make, and why would you want to do them?
Hint: the main answer that I am looking for here is the name of an algorithm that
is better than Needleman-Wunsch…can you think of any?
c) Pretend that I am a world-class expert at the subject of bioinformatics (e.g., I know all
about alignments, and algorithms, and scoring, etc.), but yet somehow I knew nothing
about the Needleman-Wunsch algorithm (as if that were possible! Maybe I’m from an
alternate universe or something:-). Describe for me what Needleman-Wunsch is in
general, what it does in particular, and what it can do for me. Like, when should I use it?
When would I be better off not to use it, and should use something else instead?
Notably, what are the key concepts that distinguish it from all other bioinformatics
algorithms?
GRADING Rubric (100 points total):
15 points for working code, part 1.
15 points for working code, part 2.
30 points for working code, part 3.
40 points for working code, part 4.
5 points for answers, part 5.
NOTE: If your program does not compile you will receive a zero on that part of your homework.
In addition, late homework will not be accepted. �DO NOT WORK TOGETHER! Students caught
working together on this or any assignment will drop a whole letter grade for the course! �
% Code to start part HW4 from.
function HW4_0
% A trial sequence with all amino acids:
% ARNDCQEGHILKMFPSTWYV
% The trial sequences in the HW description:
% ARNDCQE
% ARNCDE
% Ask the user for protein sequence 1.
prompt = ‘Enter the first amino acid sequence of length 1 to 25: ‘;
seq1 = input(prompt, ‘s’);
len1 = length(seq1);
fprintf(1, ‘Seq 1 %s has %d residues\n’, seq1, len1);
% Ask the user for protein sequence 2.
prompt = ‘Enter the second amino acid sequence of length 1 to 25: ‘;
seq2 = input(prompt, ‘s’);
len2 = length(seq2);
fprintf(1, ‘Seq 2 %s has %d residues\n’, seq2, len2);
% Initialize a 25×25 matrix of zeros.
A = zeros(25,25);
% The supplied initMatrix must be modified!
A = initMatrix(A, seq1, seq2);
% Print the matrix A.
printMatrix(A, seq1, seq2);
end
% Print out a matrix
function mat = initMatrix(mat, seq1, seq2)
numRows = length(seq1) + 1;
numCols = length(seq2) + 1;
mat(1,1) = 0; % Init the value at (1,1) to 0.
% Initialize the first column.
for i=2:numRows
% Do something here!
end
% Initialize the first row.
for c=2:numCols
% Do something here!
end
% Init the rest of the matrix.
for r=2:numRows
for c=2:numCols
% Do something here!
end
end
end
% Print out a matrix
function printMatrix(mat, seq1, seq2)
len1 = length(seq1);
len2 = length(seq2);
% Print the first row of sequence letters
fprintf(1,’\t\t’);
for r=1:len2
fprintf(1,’%2c\t’, seq2(r));
end
fprintf(1,’\n’);
% Print the first row of scores.
fprintf(1,’\t’);
for c=1:len2+1
fprintf(‘%2d\t’, mat(1,c));
end
fprintf(1,’\n’);
% Print the rest of the scores.
for r=2:len1+1
% Print the first letter of vertical sequence
fprintf(1, ‘%c\t’, seq1(r-1));
for c=1:len2+1
fprintf(1,’%2d\t’,mat(r,c));
end
fprintf(1,’\n’);
end
end
function value = scorePair(char1, char2)
% Order: ARNDCQEGHILKMFPSTWYV
i1 = aaToInt(char1);
i2 = aaToInt(char2);
A = [
4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0;
-1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3;
-2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3;
-2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3;
0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1;
-1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2;
-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2;
0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3;
-2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3;
-1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3;
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1;
-1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2;
-1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1;
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1;
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2;
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2;
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0;
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3;
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1;
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4];
value = A(i1, i2);
end
function value = aaToInt(aa)
switch aa
case ‘A’,
value = 1;
case ‘R’,
value = 2;
case ‘N’,
value = 3;
case ‘D’,
value = 4;
case ‘C’,
value = 5;
case ‘Q’,
value = 6;
case ‘E’,
value = 7;
case ‘G’,
value = 8;
case ‘H’,
value = 9;
case ‘I’,
value = 10;
case ‘L’,
value = 11;
case ‘K’,
value = 12;
case ‘M’,
value = 13;
case ‘F’,
value = 14;
case ‘P’,
value = 15;
case ‘S’,
value = 16;
case ‘T’,
value = 17;
case ‘W’,
value = 18;
case ‘Y’,
value = 19;
case ‘V’,
value = 20;
otherwise,
value = 1;
end
end