Analysis of Algorithms, I
CSOR W4231
Computer Science Department
Copyright By PowCoder代写 加微信 powcoder
Columbia University
More dynamic programming: sequence alignment
1 Sequence alignment
String similarity
This problem arises when comparing strings. Example: consider an online dictionary.
Input: a word, e.g., “ocurrance”
Output: did you mean “occurrence”?
Similarity: intuitively, two words are similar if we can “almost” line them up by using gaps and mismatches.
Aligning strings using gaps and mismatches
We can align “ocurrance” and “occurrence” using one gap and one mismatch
or, three gaps
Strings in biology
Similarity of english words is rather intuitive.
Determining similarity of biological strings is a central computational problem for molecular biologists.
Chromosomes again: an organism’s genome (set of genetic material) consists of chromosomes (giant linear DNA molecules)
We may think of a chromosome as an enormous linear tape containing a string over the alphabet {A, C, G, T }.
The string encodes instructions for building protein molecules.
Why similarity?
Why are we interested in similarity of biological strings?
Roughly speaking, the sequence of symbols in an organism’s genome determines the properties of the organism.
So similarity can guide decisions about biological experiments.
How do we define similarity between two strings?
Similarity based on the notion of “lining up” two strings
Informally, an alignment between two strings tells us which pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
The set of pairs {(2, 1), (3, 2), (4, 3)} is an alignment of X, Y : these are the pairs of positions in X, Y that are matched.
Definition of alignment of two strings
AnalignmentLofX=x1…xm,Y =y1…yn isasetof ordered pairs of indices (i, j) with i ∈ [1, m], j ∈ [1, n] such that the following two properties hold:
P1. every i ∈ [1,m], j ∈ [1,n] appears at most once in L; P2. pairs do not cross: if (i,j),(i′,j′) ∈ L and i < i′, then
Example: X = GCAT, Y = CATG
1. {(2,1),(3,2),(4,3)}isanalignment;but
2. {(2,1),(3,2),(4,3),(1,4)}isnotanalignment(violatesP2).
Cost of an alignment
LetLbeanalignmentofX=x1...xm,Y =y1...yn.
1. Gap penalty δ: there is a cost δ for every position of X and
every position of Y that is not matched.
2. Mismatch cost: there is a cost αpq for every pair of alphabet symbols p, q that are matched in L.
So every pair (i,j) ∈ L incurs a cost of αxiyj .
Assumption: αpp = 0 for every symbol p (matching a
symbol with itself incurs no cost).
The cost of alignment L is the sum of all the gap and the mismatch costs.
Cost of alignment in symbols
In symbols, given alignment L, let
XiL = 1 iff position i of X is not matched (gap), YjL = 1 iff position j of Y is not matched (gap).
Then the cost of alignment L is given by
cost(L)= XiLδ + YjLδ + αxiyj 1≤i≤m 1≤j ≤n (i,j )∈L
Example 1.
Let L1 be the alignment shown below.
x9 e e y10
Example 1.
Let L1 be the alignment shown below.
x9 e e y10
L1 ={(1,1),(2,2),(3,4),(4,5),(5,6),(6,7),(7,8),(8,9),(9,10)}
cost(L1)=δ+αae (ThisisYL1 +αx6y7.) 3
Example 2.
Let L2 be the alignment shown below.
x9 e e y10
Example 2.
Let L2 be the alignment shown below.
x9 e e y10
L1 ={(1,1),(2,3),(3,4),(4,5),(5,6),(7,8),(8,9),(9,10)}
cost(L2)=3δ (ThisisXL2 +YL2 +YL2.) 627
Example 3.
Let L3, L4 be the alignments shown below.
Example 3.
Let L3, L4 be the alignments shown below.
L3 = {(1,1),(2,2),(3,3),(4,4)} L4 = {(2,1),(3,2),(4,3)} cost(L3) = αGC + αCA + αAT + αTG cost(L4) = 2δ
The sequence alignment problem
two strings X, Y consisting of m, n symbols respectively;
each symbol is from some alphabet Σ
the gap penalty δ
the mismatch costs {αpq} for every pair (p,q) ∈ Σ2
Output: the minimum cost to align X and Y , and an optimal alignment.
Towards a recursive solution
Let L be the optimal alignment. Then either
1. the last two symbols xm, yn of X, Y are matched in L,
hence the pair (m, n) ∈ L; or
2. xm, yn are not matched in L, hence (m, n) ̸∈ L.
In this case, at least one of xm, yn is not matched in L, hence at least one of m, n does not appear in L.
Proof of Claim 1
By contradiction.
Suppose (m, n) ̸∈ L but xm and yn are both matched in L. That is,
1. xm is matched with yj for some j < n, hence (m,j) ∈ L; 2. yn is matched with xi for some i < m, hence (i,n) ∈ L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.
Rewriting Claim 1
The following equivalent way of stating Claim 1 will allow us to easily derive a recurrence.
In an optimal alignment L, at least one of the following is true 1. (m, n) ∈ L; or
2. xm is not matched; or
3. yn is not matched.
The subproblems for sequence alignment
OPT(i,j) = minimum cost of an alignment between x1 ...xi, y1 ...yj
We want OPT(m,n). From Fact 4,
1. If(m,n)∈L,wepayαxmyn +OPT(m−1,n−1). 2. If xm is not matched, we pay δ+OPT(m−1,n). 3. If yn is not matched, we pay δ+OPT(m,n−1).
How do we decide which of the three to use for OPT(m,n)?
The recurrence for the sequence alignment problem
OP T (i, j) =
δ + OP T (i − 1, j) δ + OP T (i, j − 1)
αxy +OPT(i−1,j−1)
, if i = 0
, if i, j ≥ 1 , if j = 0
Boundary cases: OPT(0,j) = jδ and OPT(i,0) = iδ.
Pair (i,j) appears in the optimal alignment for subproblem x1 ...xi,y1 ...yj if and only if the minimum is achieved by the first of the three values inside the min computation.
Computing the cost of the optimal alignment
Misan(m+1)×(n+1)dynamicprogrammingtable.
Fill in M so that all subproblems needed for entry M[i,j]
have already been computed when we compute M[i,j] (e.g., column-by-column).
01 j-1jn 0
Pseudocode
SequenceAlignment(X, Y )
Initialize M[i,0] to iδ Initialize M[0,j] to jδ for j = 1 to n do
for i = 1 to m do
M[i,j]=minαxiyj +M[i−1,j−1],
end for end for
return M[m,n] Running time?
δ + M[i − 1, j], δ + M[i, j − 1]
Reconstructing the optimal alignment
Given M, we can reconstruct the optimal alignment as follows.
TraceAlignment(i, j)
if i==0orj==0thenreturn else
if M[i,j]==αxiyj +M[i−1,j−1]then TraceAlignment(i − 1, j − 1)
Output (i, j),
if M[i,j]==δ+M[i−1,j]thenTraceAlignment(i−1,j) else TraceAlignment(i, j − 1)
end if end if
Initial call: TraceAlignment(m, n) Running time?
Resources used by dynamic programming algorithm
Time: O(mn) Space: O(mn)
English words: m, n ≤ 10
Computational biology: m = n = 100000
Time: 10 billion ops
Space: 10GB table!
Can we avoid using quadratic space while maintaining quadratic running time?
Using only O(m + n) space
1. First, suppose we are only interested in the cost of the optimal alignment.
Easy: keep a table M with 2 columns, hence 2(m + 1) entries.
2. What if we want the optimal alignment too? No longer possible in O(n + m) time.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com