Lecture 2 Sequence Alignment
Burr Settles
IBS Summer Research Program 2008 bsettles@cs.wisc.edu www.cs.wisc.edu/~bsettles/ibs08/
Sequence Alignment: Task Definition
• given:
– a pair of sequences (DNA or protein)
– a method for scoring a candidate alignment
• do:
– determine the correspondences between substrings in the sequences such that the similarity score is maximized
Why Do Alignment?
• homology: similarity due to descent from a common ancestor
• often we can infer homology from similarity
• thus we can sometimes infer structure/function from sequence similarity
Homology Example: Evolution of the Globins
Homology
• homologous sequences can be divided into two groups
– orthologous sequences: sequences that differ because they are found in different species (e.g. human α
-globin and mouse α-globin)
– paralogous sequences: sequences that differ because of a gene duplication event (e.g. human α-globin and human β-globin, various versions of both )
Issues in Sequence Alignment
• the sequences we’re comparing probably differ in length
• there may be only a relatively small region in the
sequences that match
• we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others)
• variable length regions may have been inserted/deleted from the common ancestral sequence
Sequence Variations
• sequences may have diverged from a common ancestor through various types of mutations:
– substitutions (ACGA – insertions (ACGA
– deletions (ACGGAGA
AGGA) ACCGGAGA)
AGA)
• the latter two will result in gaps in alignments
Insertions, Deletions and Protein Structure
• Why is it that two “similar” sequences may have large insertions/deletions?
– some insertions and deletions may not significantly affect the structure of a protein
loop structures: insertions/deletions here not so significant
Example Alignment: Globins
• figure at right shows prototypical structure of globins
• figure below shows part of alignment for 8 globins (-’s indicate gaps)
Three Key Questions
• Q1: what do we want to align?
• Q2: how do we “score” an alignment?
• Q3: how do we find the “best” alignment?
Q1: What Do We Want to Align?
• global alignment: find best match of both sequences in their entirety
• local alignment: find best subsequence match
• semi-global alignment: find best match without penalizing gaps on the ends of the alignment
The Space of Global Alignments
• some possible global alignments for ELV and VIS
ELV VIS
E-LV VIS-
-ELV –ELV ELV- VIS- VIS– -VIS
ELV– EL-V
–VIS -VIS
Q2: How Do We Score Alignments?
• gap penalty function
– w(k) indicates cost of a gap of length k
• substitution matrix
– s(a,b) indicates score of aligning character a with character b
Linear Gap Penalty Function
• different gap penalty functions require somewhat different dynamic programming algorithms
• the simplest case is when a linear gap function is used w(k)= g×k
where g is a constant
• we’ll start by considering this case
€
Scoring an Alignment
• the score of an alignment is the sum of the scores for pairs of aligned characters plus the scores for gaps
• example: given the following alignment
VAHV—D–DMPNALSALSDLHAHKL
AIQLQVTGVVVTDATLKNLGSVHVSKG
• we would score it by
s(V,A) + s(A,I) + s(H,Q) + s(V,L) + 3g + s(D,G) + 2g …
• •
Q3: How Do We Find the Best Alignment?
simple approach: compute & score all possible alignments but there are
2n (2n)! 22n
n=(n!)2 ≈
•
πn
possible global alignments for 2 sequences of length n
e.g. two sequences of length 100 have ≈ 1077 possible alignments
Pairwise Alignment Via Dynamic Programming
• dynamic programming: solve an instance of a problem by taking advantage of solutions for subparts of the problem
– reduce problem of best alignment of two sequences to best alignment of all prefixes of the sequences
– avoid recalculating the scores already considered
• example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34…
• first used in alignment by Needleman & Wunsch, Journal of Molecular Biology, 1970
Dynamic Programming Idea
• consider last step in computing alignment of AAAC with AGC
• three possible options; in each we’ll choose a different pairing for end of alignment, and add this to best alignment of previous characters
AAA AG
C- CC
AAAC AG
AAA AGC
C
– these prefixes
score of aligning this pair
consider best + alignment of
Dynamic Programming Idea
• given an n-character sequence x, and an m-character sequence y
• construct an (n+1) × (m+1) matrix F
• F ( i, j ) = score of the best alignment of
x[1…i ] with y[1…j ] AGC
A A A C
score of best alignment of AAA to AG
Needleman-Wunch Algorithm
• one way to specify the DP is in terms of its recurrence relation:
match xi with yj
F(i, j)=maxF(i−1, j)+g
F(i−1, j−1)+s(xi,yj)
F ( i , j − 1 ) + g
insertion in x insertion in y
DP Algorithm Sketch: Global Alignment
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left to right
• for each F ( i, j ), save pointer(s) to cell(s) that resulted in best score
• F (m, n) holds the optimal alignment score; trace pointers back from F (m, n) to F (0, 0) to recover alignment
Initializing Matrix
AGC
0
g
2g
3g
g
2g
3g
4g
A A A C
Global Alignment Example
• suppose we choose the following scoring scheme: s(xi , yi ) =
+1 when xi = yi
when xi ≠ yi
g (penalty for aligning with a gap) = -2
-1
Global Alignment Example
AGC
A A A C
s(xi , yi ) =
+1 when xi = yi
-1when xi ≠yi g = -2
Global Alignment Example
AGC
0
-2
-4
-6
-2
1
-1
-3
-4
-1
0
-2
-6
-3
-2
-1
-8
-5
-4
-1
A A A C
one optimal alignment
x: A A A C y: A G – C
Equally Optimal Alignments
• many optimal alignments may exist for a given pair of sequences
• can use preference ordering over paths when doing traceback
highroad 1 lowroad 3 22
31
• highroad and lowroad alignments show the two most different optimal alignments
Highroad & Lowroad Alignments
AGC
0
-2
-4
-6
-2
1
-1
-3
-4
-1
0
-2
-6
-3
-2
-1
-8
-5
-4
-1
A A A C
highroad alignment
x: A A A C y: A G – C
lowroad alignment
x: A A A C y: – A G C
DP Comments
• works for either DNA or protein sequences, although the substitution matrices used differ
• finds an optimal alignment
• the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this)
Local Alignment
• so far we have discussed global alignment, where we are looking for best match between sequences from one end to the other
• more commonly, we will want a local alignment, the best match between subsequences of x and y
Local Alignment Motivation
• useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere
• useful for comparing DNA sequences that share a similar motif but differ elsewhere
• useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence)
• more sensitive when comparing highly diverged sequences
• •
Local Alignment DP Algorithm
original formulation: Smith & Waterman, Journal of Molecular Biology, 1981
interpretation of array values is somewhat different
– F ( i, j ) = score of the best alignment of a suffix of x[1…i ] and a suffix of y[1…j ]
•
Local Alignment DP Algorithm
the recurrence relation is slightly different than for global algorithm
F(i−1, j−1)+s(xi,yj)
F ( i − 1, j ) + g F(i, j)=maxF(i, j−1)+g
0
Local Alignment DP Algorithm
• initialization: first row and first column initialized with 0’s • traceback:
– find maximum value of F(i, j); can be anywhere in matrix
– stop when we get to a cell with value 0
s(xi , yi ) =
+1when xi =yi
-1when xi ≠yi g=-2
T T A A G
Local Alignment Example
AAGA
Local Alignment Example
AAGA
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
2
0
1
0
0
0
3
1
x: A A G y: A A G
T T A A G
•
More On Gap Penalty Functions
a gap of length k is more probable than k gaps of length 1 – a gap may be due to a single mutational event that
inserted/deleted a stretch of characters
– separated gaps are probably due to distinct mutational events
a linear gap penalty function treats these cases the same
it is more common to use an affine gap penalty function, which involves two terms:
– a penalty h associated with opening a gap – a smaller penalty g for extending the gap
• •
Gap Penalty Functions
• linear
w(k) = gk
• affine
w(k)=h+gk, k≥1
0, k=0
Dynamic Programming for the Affine Gap Penalty Case
• to do in O(n2 ) time, need 3 matrices instead of 1
best score given that x[i] is aligned to y[j]
best score given that x[i] is aligned to a gap
best score given that y[j] is aligned to a gap
M(i, j) Ix (i, j)
Iy (i, j)
Global Alignment DP for the Affine Gap Penalty Case
M (i −1, j −1) +s(xi, yj) M (i, j) = maxIx (i −1, j −1) + s(xi, yj) I y (i −1, j −1) + s(xi, yj)
Ix(i, j)=maxM(i−1, j)+h+g I x ( i − 1, j ) + g
I (i,j)=maxM(i,j−1)+h+g y I y ( i , j − 1 ) + g
match xi with yj insertion in x insertion in y
open gap in x extend gap in x
open gap in y extend gap in y
Global Alignment DP for the Affine Gap Penalty Case
• initialization M (0,0) = 0
I x (i,0) = h + g × i Iy(0, j)=h+g× j
other cells in top row and leftmost column = −∞
– stop at any of
– note that pointers may traverse all three matrices
• traceback
– start at largest of
M (m, n), I x (m, n), I y (m, n) M (0,0), I x (0,0), I y (0,0)
h = -3, g = -1
(Affine Gap Penalty)
ACACT
Global Alignment Example
M: 0
Ix : -3
Iy : -3
-∞ -∞
-4
-∞
-∞
-5
-∞
-∞
-6
-∞
-∞
-7
-∞
-∞
-8
-∞ -4
-∞
1
-∞
-∞
-5
-∞
-3
-4
-∞
-4
-7
-∞
-5
-8
-∞
-6
-∞
-5
-∞
-3
-3
-∞
0
-9
-7
-2
-8
-4
-5
-11
-5
-6
-12
-6
-∞
-6
-∞
-6
-4
-∞
-4
-4
-10
-1
-6
-8
-3
-9
-5
-4
-10
-6
A
A
T
Global Alignment Example (Continued)
ACACT
M: 0
Ix : -3
Iy : -3
-∞ -∞
-4
-∞
-∞
-5
-∞
-∞
-6
-∞
-∞
-7
-∞
-∞
-8
-∞ -4
-∞
1
-∞
-∞
-5
-∞
-3
-4
-∞
-4
-7
-∞
-5
-8
-∞
-6
-∞
-5
-∞
-3
-3
-∞
0
-9
-7
-2
-8
-4
-5
-11
-5
-6
-12
-6
-∞
-6
-∞
-6
-4
-∞
-4
-4
-10
-1
-6
-8
-3
-9
-5
-4
-10
-6
A
A
T
three optimal alignments: ACACT ACACT ACACT AA–T A–AT –AAT
Local Alignment DP for the Affine Gap Penalty Case
M(i−1, j−1)+s(xi,yj)
Ix(i−1, j−1)+s(xi,yj) M(i,j)=maxI (i−1,j−1)+s(xi,yj)
Ix(i, j)=maxM(i−1, j)+h+g
I x ( i − 1, j ) + g
I (i,j)=maxM(i,j−1)+h+g y I y ( i , j − 1 ) + g
y 0
Local Alignment DP for the Affine Gap Penalty Case
• initialization M (0,0) = 0
M (i,0) = 0
M (0, j) = 0 cellsintoprowandleftmostcolumnof Ix,Iy =−∞
• traceback
– start at largest M (i, j) – stop at M (i, j) = 0
Gap Penalty Functions
• linear: w(k)=gk
• affine:
w(k)=h+gk, k≥1 0, k=0
• concave: a function for which the following holds for all k, l, m ≥ 0
w(k + m + l) − w(k + m) ≤ w(k + l) − w(k) w(k ) = h + g × log(k )
e.g.
Concave Gap Penalty Functions
8
7 w(k + m + l) − w(k + m)
l
6 5 4 3 2 1
01 2 3 4 5 6 7 8 9 10
w(k + l) − w(k)
w(k + m + l) − w(k + m) ≤ w(k + l) − w(k)
More On Scoring Matches
• so far, we’ve discussed multiple gap penalty functions, but only one match-scoring scheme:
s(xi , yi ) =
+1
-1 whenxi≠yi
when xi = yi
• for protein sequence alignment, some amino acids have
similar structures and can be substituted in nature:
aspartic acid (D) glutamic acid (E)
Substitution Matrices
• two popular sets of matrices for protein sequences
– PAM matrices [Dayhoff et al., 1978]
– BLOSUM matrices [Henikoff & Henikoff, 1992]
• both try to capture the the relative substitutability of amino acid pairs in the context of evolution
BLOSUM62 Matrix
Heuristic Methods
• the algorithms we learned today take O(nm) time to align sequences, which is too slow for searching large databases
– imagine an internet search engine, but where queries and results are protein sequences
• heuristic methods do fast approximation to dynamic programming
– example: BLAST [Altschul et al., 1990; Altschul et al., 1997] – break sequence into small (e.g. 3 base pair) “words”
– scan database for word matches
– extend all matches to seek high-scoring alignments
– tradeoff: sensitivity for speed
Multiple Sequence Alignment
• we’ve only discussed aligning 2 sequences, but we may want to do more
• discover common motifs in a set of sequences
(e.g. DNA sequences that bind the same protein)
• characterize a set of sequences
(e.g. a protein family)
• much more complex
Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
Next Time…
• basic molecular biology
• sequence alignment
• probabilistic sequence models • gene expression analysis
• protein structure prediction – by Ameet Soni