代写代考 Analysis of Algorithms, I

Analysis of Algorithms, I
CSOR W4231

Computer Science Department

Copyright By PowCoder代写 加微信 powcoder

Columbia University
More dynamic programming: sequence alignment

1 Sequence alignment

String similarity
This problem arises when comparing strings. Example: consider an online dictionary.
􏰉 Input: a word, e.g., “ocurrance”
􏰉 Output: did you mean “occurrence”?
Similarity: intuitively, two words are similar if we can “almost” line them up by using gaps and mismatches.

Aligning strings using gaps and mismatches
We can align “ocurrance” and “occurrence” using 􏰉 one gap and one mismatch
􏰉 or, three gaps

Strings in biology
􏰉 Similarity of english words is rather intuitive.
􏰉 Determining similarity of biological strings is a central computational problem for molecular biologists.
􏰉 Chromosomes again: an organism’s genome (set of genetic material) consists of chromosomes (giant linear DNA molecules)
􏰉 We may think of a chromosome as an enormous linear tape containing a string over the alphabet {A, C, G, T }.
􏰉 The string encodes instructions for building protein molecules.

Why similarity?
Why are we interested in similarity of biological strings?
􏰉 Roughly speaking, the sequence of symbols in an organism’s genome determines the properties of the organism.
􏰉 So similarity can guide decisions about biological experiments.
How do we define similarity between two strings?

Similarity based on the notion of “lining up” two strings
Informally, an alignment between two strings tells us which pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
The set of pairs {(2, 1), (3, 2), (4, 3)} is an alignment of X, Y : these are the pairs of positions in X, Y that are matched.

Definition of alignment of two strings
AnalignmentLofX=x1…xm,Y =y1…yn isasetof ordered pairs of indices (i, j) with i ∈ [1, m], j ∈ [1, n] such that the following two properties hold:
P1. every i ∈ [1,m], j ∈ [1,n] appears at most once in L; P2. pairs do not cross: if (i,j),(i′,j′) ∈ L and i < i′, then Example: X = GCAT, Y = CATG 1. {(2,1),(3,2),(4,3)}isanalignment;but 2. {(2,1),(3,2),(4,3),(1,4)}isnotanalignment(violatesP2). Cost of an alignment LetLbeanalignmentofX=x1...xm,Y =y1...yn. 1. Gap penalty δ: there is a cost δ for every position of X and every position of Y that is not matched. 2. Mismatch cost: there is a cost αpq for every pair of alphabet symbols p, q that are matched in L. 􏰉 So every pair (i,j) ∈ L incurs a cost of αxiyj . 􏰉 Assumption: αpp = 0 for every symbol p (matching a symbol with itself incurs no cost). The cost of alignment L is the sum of all the gap and the mismatch costs. Cost of alignment in symbols In symbols, given alignment L, let 􏰉 XiL = 1 iff position i of X is not matched (gap), 􏰉 YjL = 1 iff position j of Y is not matched (gap). Then the cost of alignment L is given by cost(L)= 􏰆 XiLδ + 􏰆 YjLδ + 􏰆 αxiyj 1≤i≤m 1≤j ≤n (i,j )∈L Example 1. Let L1 be the alignment shown below. x9 e e y10 Example 1. Let L1 be the alignment shown below. x9 e e y10 L1 ={(1,1),(2,2),(3,4),(4,5),(5,6),(6,7),(7,8),(8,9),(9,10)} cost(L1)=δ+αae (ThisisYL1 +αx6y7.) 3 Example 2. Let L2 be the alignment shown below. x9 e e y10 Example 2. Let L2 be the alignment shown below. x9 e e y10 L1 ={(1,1),(2,3),(3,4),(4,5),(5,6),(7,8),(8,9),(9,10)} cost(L2)=3δ (ThisisXL2 +YL2 +YL2.) 627 Example 3. Let L3, L4 be the alignments shown below. Example 3. Let L3, L4 be the alignments shown below. L3 = {(1,1),(2,2),(3,3),(4,4)} L4 = {(2,1),(3,2),(4,3)} cost(L3) = αGC + αCA + αAT + αTG cost(L4) = 2δ The sequence alignment problem 􏰉 two strings X, Y consisting of m, n symbols respectively; each symbol is from some alphabet Σ 􏰉 the gap penalty δ 􏰉 the mismatch costs {αpq} for every pair (p,q) ∈ Σ2 Output: the minimum cost to align X and Y , and an optimal alignment. Towards a recursive solution Let L be the optimal alignment. Then either 1. the last two symbols xm, yn of X, Y are matched in L, hence the pair (m, n) ∈ L; or 2. xm, yn are not matched in L, hence (m, n) ̸∈ L. In this case, at least one of xm, yn is not matched in L, hence at least one of m, n does not appear in L. Proof of Claim 1 By contradiction. Suppose (m, n) ̸∈ L but xm and yn are both matched in L. That is, 1. xm is matched with yj for some j < n, hence (m,j) ∈ L; 2. yn is matched with xi for some i < m, hence (i,n) ∈ L. Since pairs (i, n) and (m, j) cross, L is not an alignment. Rewriting Claim 1 The following equivalent way of stating Claim 1 will allow us to easily derive a recurrence. In an optimal alignment L, at least one of the following is true 1. (m, n) ∈ L; or 2. xm is not matched; or 3. yn is not matched. The subproblems for sequence alignment OPT(i,j) = minimum cost of an alignment between x1 ...xi, y1 ...yj We want OPT(m,n). From Fact 4, 1. If(m,n)∈L,wepayαxmyn +OPT(m−1,n−1). 2. If xm is not matched, we pay δ+OPT(m−1,n). 3. If yn is not matched, we pay δ+OPT(m,n−1). How do we decide which of the three to use for OPT(m,n)? The recurrence for the sequence alignment problem OP T (i, j) = δ + OP T (i − 1, j) δ + OP T (i, j − 1) αxy +OPT(i−1,j−1) , if i = 0 , if i, j ≥ 1 , if j = 0 􏰉 Boundary cases: OPT(0,j) = jδ and OPT(i,0) = iδ. 􏰉 Pair (i,j) appears in the optimal alignment for subproblem x1 ...xi,y1 ...yj if and only if the minimum is achieved by the first of the three values inside the min computation. Computing the cost of the optimal alignment 􏰉 Misan(m+1)×(n+1)dynamicprogrammingtable. 􏰉 Fill in M so that all subproblems needed for entry M[i,j] have already been computed when we compute M[i,j] (e.g., column-by-column). 01 j-1jn 0 Pseudocode SequenceAlignment(X, Y ) Initialize M[i,0] to iδ Initialize M[0,j] to jδ for j = 1 to n do for i = 1 to m do M[i,j]=min􏰇αxiyj +M[i−1,j−1], end for end for return M[m,n] Running time? δ + M[i − 1, j], δ + M[i, j − 1]􏰈 Reconstructing the optimal alignment Given M, we can reconstruct the optimal alignment as follows. TraceAlignment(i, j) if i==0orj==0thenreturn else if M[i,j]==αxiyj +M[i−1,j−1]then TraceAlignment(i − 1, j − 1) Output (i, j), if M[i,j]==δ+M[i−1,j]thenTraceAlignment(i−1,j) else TraceAlignment(i, j − 1) end if end if Initial call: TraceAlignment(m, n) Running time? Resources used by dynamic programming algorithm 􏰉 Time: O(mn) 􏰉 Space: O(mn) 􏰉 English words: m, n ≤ 10 􏰉 Computational biology: m = n = 100000 􏰉 Time: 10 billion ops 􏰉 Space: 10GB table! 􏰉 Can we avoid using quadratic space while maintaining quadratic running time? Using only O(m + n) space 1. First, suppose we are only interested in the cost of the optimal alignment. Easy: keep a table M with 2 columns, hence 2(m + 1) entries. 2. What if we want the optimal alignment too? 􏰉 No longer possible in O(n + m) time. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com