Homework #4 for Bioimaging and Bioinformatics
BME2210 – Spring 2017
Bioinformatics portion
Due (as 5 files – 4 .cpp files and 1 .txt file – to the ICON dropbox) by 11am on Monday, February 20
Implement Needleman-Wunsch algorithm
Implement the global alignment Needleman-Wunsch algorithm for protein sequences in an iterative fashion (i.e., do not use recursion, which simplifies the implementation for those that do not have experience with recursion.) For an iterative solution, we will use 3 matrices:
- The “A” matrix will store the final alignment scores.
- The “B” matrix is simply an initialized “A” matrix with Blosum62 scores for each pair of
residues.
- The “D” matrix will store our directions.
If you wish, you can look at the sample code (included below), which provides you with functions to:
- initialize A and/or B matrices
- print matrices, and
- use the BLOSUM62 substitution matrix for scoring
However, note that the sample code is written in Matlab, to help you understand the overall algorithm, but you need to turn in your code for parts 1-4 in C++ (.cpp) files. Thus, while the Matlab code limits the number of amino acids to 25, your code should use strings in order to handle any size of input; and while the Matlab code merely initializes the matrix to be filled with zeros (instead of undefined values), you actually need to fill the matrix with real values. Its’ main use is to show you how a matrix can be fairly easily printed.
Hint 1: C++ can easily handle padding a string with spaces using the setw and setfill commands, for which you would need to include iomanip as well as iostream – i.e., as described at http://www.cplusplus.com/reference/iomanip/setw/. (actually you won’t even need setfill(), since spaces are the default padding material – I just mention that for future reference) If you try to use the older C style of using a statement such as ‘printf (“% 3s\n”, x);’, for which you would need to include stdio.h, you will have a much harder time printing off the matrices in this assignment, since you will have to figure out a way to convert an object of string class (and being of any arbitrarily-given size) to the older C-style ‘character array’ that printf demands. Thus, I highly recommend the C++ style, as shown in the code below where I print an arbitrary string (here, a value of -2) with an arbitrary amount of space-padding (here, 3 spaces, although note that the maximum width of any value in the BLOSUM62 matrix is only 2 character spaces, since its’ maximum value is ‘11’ and its’ minimum value is ‘-4’):
#include <string> #include <iostream> #include <iomanip> using namespace std;
int main()
{
string x=”-2″;
cout << setw(3) << x << endl;
//somewhat equivalent to: printf (“% 3s\n”, x); return 0;
}
Hint 2: the Needleman-Wunsch algorithm is described in the Reading Material (especially L8- PairwiseAlignment1-Fall2015.pdf), and in even more helpful detail in the book (chapter 3, especially pages 76-80, and it may be helpful to keep going and read portions of the rest of the chapter, like Smith-Waterman on 82, heuristics on page 84, dot plots on 85, etc.). Pay attention to the details – e.g., each matrix is of dimensionality (n+1) * (m+1), where the length of the input sequence 1 is n and the length of the input sequence 2 is m. In other words, these matrices are one dimension longer (in both # of rows and columns) than the lengths of the two sequences. (do you see why this is so? the 1 added to the n is the left column, and the 1 added to the m is the upper row, shown in Figure 3.21(a) in the online book; note that these are filled in first before anything else is added to the matrix, as shown in parts (b)-(g) of that figure).
Hint 3: You already know that C++ array indices start with 0, so that the first position of a string is stored in array index 0, the second in array index 1, etc. However, Needleman-Wunsch introduces a twist, in that b/c of the additional dimension added to both the row and the column (added at the 0th position), the first position of the sequence 1 corresponds to the SECOND column of the matrix, which ends up making its’ array index equal to 1 after all. The same applies to sequence 2 and the rows of the matrix. Rather than attempt to describe this further in 1,000+ words, here is a picture instead:
0123 4
ARNC D
5 Sequence Indices
E Sequence Characters
0 1 2 3 4 5
00 -2 -4 -6 -8-10-12 0A1-2 4-1-2 0-2-1 1R2-4-1 5 0-3-2 0 2N3-6-2 0 6-3 1 0 3D4-8-2-2 1-3 6 2 4C5-10 0-3-3 9-3-4 5Q6-12-1 1 0-3 0 2 6E7-14-1 0 0-4 2 5
Hint 4: To forestall questions from people who are not careful readers, I will point out that
6MatrixIndices
while some parts of the online book talks about using an identity matrix to score the alignment, other parts talk about using a BLOSUM62 one instead. e.g., Figure 3.20 demonstrates an example with BLOSUM62. The lesson here is: do not simply read a sentence out of context! Yes, you should actually read the WHOLE thing, if you want to know how it works.
Part 1: Create a file hw4.1.cpp. In this part, simply initialize the “B” matrix and print it out.
Scoring scheme:
- Gap penalty of -2
- Match/mismatch scores from BLOSUM62
You can get the latter from: ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62. Note that you are not required to write code to read this matrix from a file – you may hard-code it into your program as a matrix if you prefer.
Example output:
Sequence 1: ARNDCQE Sequence 2: ARNCDE
B matrix
ARNCDE 0 -2 -4 -6 -8-10-12 A-2 4-1-2 0-2-1 R-4-1 5 0-3-2 0 N -6 -2 0 6 -3 1 0 D-8-2-2 1-3 6 2 C-10 0-3-3 9-3-4 Q -12 -1 1 0 -3 0 2 E -14 -1 0 0 -4 2 5
Part 2: Create a file hw4.2.cpp. In this part, now create the “A” matrix and print it out. (Remember that BEFORE you do the calculations to obtain the final values that this matrix will contain, you will need to initialize this matrix the same way that you initialized the “B” matrix. e.g., you could simply copy the “B” matrix into “A”, if that is easier for you. Although I found it easier to just put the initialization code into a function, and then I called it twice, first with B and then with A.)
Hint: except for the first column and first row, each matrix score is the maximum of 3 possibilities: an alignment of two residues, a gap in sequence 1, or a gap in sequence 2.
Continuing the example output from above:
A matrix
ARNCDE 0 -2 -4 -6 -8-10-12 A-2 4 2 0-2-4-6 R -4 2 9 7 5 3 1 N-6 0 7151311 9 D-8-2 513121917 C-10-4 311222018 Q-12-6 1 9202222 E-14-8-1 7182227
score = 27
(do you see where “score =” comes from? hint: bottom-right corner of the matrix…)
Part 3: Create a file hw4.3.cpp. In this part, now create the “D” matrix and print out the contents.
As you were calculating the A matrix in part 2 above, you had to select a direction from which to calculate the new sub-totals (Vertical (V), Horizontal (H), or Diagonal (D) – these are the arrows shown in the lecture slides / reading material / online book figures). Keep track of these directions in this “D” matrix. Print out this D matrix as part of your answer. If more than one direction is possible that includes Diagonal (D), then simply select D. If more than one direction is possible but D is not included (i.e., H and V only), then arbitrarily select V. Note that you can initialize the first row/column to all H/V.
[Optional]: Instead of picking arbitrarily, you could encode the combinations of different directions with additional characters. I’d suggest the following scheme, for compatibility with other people’s code: for V & H => +, for D & V => y, for D & H => 2, and for V, D & H => *. Note that one reason to do this is that if you do not deal with such scenarios, then the code becomes non-deterministic (meaning that if I reverse the order of the two input sequences – i.e., swap sequence #1 and #2 – I can get a different answer, mainly b/c of arbitrarily picking V, and therefore swapping H & V will affect the outcome).
Continuing the sample output from above:
D matrix
ARNCDE 0HHHHHH AVDHHHHH
RVVDHHHH NVVVDHHH DVVVVDDH CVVVVDHH QVVVVVDD EVVVVVDD
Part 4: Create a file hw4.4.cpp. In this part, it is now straight-forward to use the “D” matrix to generate the alignment. Do that, and then print out that alignment.
Continuing the sample output from above, the alignment would be:
ARNDCQE ||| | | ARN-CDE
Note how the vertical bar is only shown for identical letters, but not for gaps or mismatches.
[Optional Challenge]. Purely for your own interest, you could implement this Needleman- Wunsch algorithm using recursion. There are a number of reasons that you might want to do this – but if you end up doing that, which do you find that you prefer?
Part 5: Create a plain-text file hw4.5.txt. In this part, write a few sentences to briefly discuss (i.e., just a few sentences each) the following issues:
- a) What were the major obstacles/hurdles that you had to overcome during this assignment?
- Personally, was it your mastery of the tool that you used (C++), or basic biology (knowing what an ‘amino acid’ is), or bioinformatics (knowing what an ‘alignment’ is), or just finding time to do the homework? Those are all honest and valid responses to this question.
- What I am particularly interested in, though, is: what were the major engineering challenges that had to be overcome to solve problems of this type?
- b) Now that you have code that implements the Needleman-Wunsch algorithm, what would be the next step(s) for someone to take? (don’t worry, it doesn’t have to be you,
at least not for this class:-)
- If you were going to make improvements to your code, what are some ideas that you could start off with, and why might you want to do those improvements?
(I gave you some ideas above, like adding symbols that go beyond just H/V/D, which would allow more options to be explored instead of being completely arbitrary; and another idea would be to read in the BLOSUM62 matrix from a file instead of hard-coding it in, b/c that would allow you to use other, alternative matrices in the future, such as BLOSUM45, or PAM250, etc. But since I gave you these ideas, please pick other ones – note that recursion is fine to include, as long as you state why it would be useful, but in that case please also include at least one more idea as well)
- Going beyond just this code, what are some more general improvements that you could make, and why would you want to do them?
Hint: the main answer that I am looking for here is the name of an algorithm that is better than Needleman-Wunsch…can you think of any?
c) Pretend that I am a world-class expert at the subject of bioinformatics (e.g., I know all about alignments, and algorithms, and scoring, etc.), but yet somehow I knew nothing about the Needleman-Wunsch algorithm (as if that were possible! Maybe I’m from an alternate universe or something:-). Describe for me what Needleman-Wunsch is in general, what it does in particular, and what it can do for me. Like, when should I use it? When would I be better off not to use it, and should use something else instead? Notably, what are the key concepts that distinguish it from all other bioinformatics algorithms?
GRADING Rubric (100 points total):
15 points for working code, part 1. 15 points for working code, part 2. 30 points for working code, part 3. 40 points for working code, part 4. 5 points for answers, part 5.
NOTE: If your program does not compile you will receive a zero on that part of your homework. In addition, late homework will not be accepted. DO NOT WORK TOGETHER! Students caught working together on this or any assignment will drop a whole letter grade for the course!
% Code to start part HW4 from. function HW4_0
% A trial sequence with all amino acids: % ARNDCQEGHILKMFPSTWYV
% The trial sequences in the HW description: % ARNDCQE % ARNCDE
% Ask the user for protein sequence 1. prompt = 'Enter the first amino acid sequence of length 1 to 25: '; seq1 = input(prompt, 's'); len1 = length(seq1); fprintf(1, 'Seq 1 %s has %d residues\n', seq1, len1);
% Ask the user for protein sequence 2. prompt = 'Enter the second amino acid sequence of length 1 to 25: '; seq2 = input(prompt, 's'); len2 = length(seq2); fprintf(1, 'Seq 2 %s has %d residues\n', seq2, len2);
% Initialize a 25x25 matrix of zeros. A = zeros(25,25);
% The supplied initMatrix must be modified! A = initMatrix(A, seq1, seq2);
% Print the matrix A. printMatrix(A, seq1, seq2);
end
% Print out a matrix function mat = initMatrix(mat, seq1, seq2)
numRows = length(seq1) + 1; numCols = length(seq2) + 1;
mat(1,1) = 0; % Init the value at (1,1) to 0.
% Initialize the first column. for i=2:numRows
% Do something here! end
% Initialize the first row. for c=2:numCols
% Do something here! end
% Init the rest of the matrix. for r=2:numRows
for c=2:numCols % Do something here!
end end
end
% Print out a matrix function printMatrix(mat, seq1, seq2)
len1 = length(seq1); len2 = length(seq2);
% Print the first row of sequence letters fprintf(1,'\t\t'); for r=1:len2
fprintf(1,'%2c\t', seq2(r)); end
fprintf(1,'\n');
% Print the first row of scores. fprintf(1,'\t'); for c=1:len2+1
fprintf('%2d\t', mat(1,c)); end
fprintf(1,'\n');
% Print the rest of the scores. for r=2:len1+1
% Print the first letter of vertical sequence fprintf(1, '%c\t', seq1(r-1)); for c=1:len2+1
fprintf(1,'%2d\t',mat(r,c)); end
fprintf(1,'\n'); end
end
function value = scorePair(char1, char2) % Order: ARNDCQEGHILKMFPSTWYV i1 = aaToInt(char1); i2 = aaToInt(char2);
A=[
4-1-2-2 0-1-1 0-2-1-1-1-1-2-1 1 0-3-2 0;
-1 5 0-2-3 1 0-2 0-3-2 2-1-3-2-1-1-3-2-3; -2 0 6 1-3 0 0 0 1-3-3 0-2-3-2 1 0-4-2-3; -2-2 1 6-3 0 2-1-1-3-4-1-3-3-1 0-1-4-3-3;
0-3-3-3 9-3-4-3-3-1-1-3-1-2-3-1-1-2-2-1; -1 1 0 0-3 5 2-2 0-3-2 1 0-3-1 0-1-2-1-2; -1 0 0 2-4 2 5-2 0-3-3 1-2-3-1 0-1-3-2-2;
0-2 0-1-3-2-2 6-2-4-4-2-3-3-2 0-2-2-3-3; -2 0 1-1-3 0 0-2 8-3-3-1-2-1-2-1-2-2 2-3; -1-3-3-3-1-3-3-4-3 4 2-3 1 0-3-2-1-3-1 3; -1-2-3-4-1-2-3-4-3 2 4-2 2 0-3-2-1-2-1 1; -1 2 0-1-3 1 1-2-1-3-2 5-1-3-1 0-1-3-2-2; -1-1-2-3-1 0-2-3-2 1 2-1 5 0-2-1-1-1-1 1; -2-3-3-3-2-3-3-3-1 0 0-3 0 6-4-2-2 1 3-1; -1-2-2-1-3-1-1-2-2-3-3-1-2-4 7-1-1-4-3-2;
1-1 1 0-1 0 0 0-1-2-2 0-1-2-1 4 1-3-2-2;
0-1 0-1-1-1-1-2-2-1-1-1-1-2-1 1 5-2-2 0; -3-3-4-4-2-2-3-2-2-3-2-3-1 1-4-3-211 2-3; -2-2-2-3-2-1-2-3 2-1-1-2-1 3-3-2-2 2 7-1;
0-3-3-3-1-2-2-3-3 3 1-2 1-1-2-2 0-3-1 4]; value = A(i1, i2);
function value = aaToInt(aa) switch aa
case 'A', value = 1;
case 'R', value = 2;
case 'N', value = 3;
case 'D', value = 4;
case 'C', value = 5;
case 'Q', value = 6;
case 'E', value = 7;
case 'G', value = 8;
case 'H', value = 9;
case 'I', value = 10;
case 'L', value = 11;
case 'K', value = 12;
case 'M', value = 13;
case 'F', value = 14;
case 'P', value = 15;
case 'S', value = 16;
case 'T', value = 17;
case 'W', value = 18;
case 'Y', value = 19;
case 'V', value = 20;
otherwise, value = 1;
end end
end