Lecture 7:
Dynamic Programming II (Adv)
The University of Sydney
Page 1
Fast Fourier Transform
How does Google know what I want?
The University of Sydney Page 4
Edit distance
– How similar are two strings? – S = “ocurrance”
– T = “occurrence”
o
c
–
u
r
r
a
n
c
e
o
c
c
u
r
r
e
n
c
e
1 mismatch, 1 gap
Modify S to get T by adding c, swapping a-e The University of Sydney
Page 5
Which edit is better?
o
c
–
u
r
r
a
n
c
e
o
c
c
u
r
r
e
n
c
e
1 mismatch, 1 gap
o
c
–
u
r
r
–
a
n
c
e
o
c
c
u
r
r
e
–
n
c
e
0 mismatches, 3 gaps
o
c
u
r
r
a
n
c
e
–
o
c
c
u
r
r
e
n
c
e
The University of Sydney
5 mismatches, 1 gap
Page 8
Edit Distance
– Applications.
– Basis for Unix diff.
– Speech recognition.
– Computational biology.
– Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] – Gap penalty d and mismatch penalty apq.
– Cost = sum of gap and mismatch penalties.
C
T
G
A
C
C
T
A
C
C
T
–
C
T
G
A
C
C
T
A
C
C
T
C
C
T
G
A
C
T
A
C
A
T
C
C
T
G
A
C
–
T
A
C
A
T
aTC + aGT + aAG+ 2aCA
The University of Sydney Page 9
2d + aCA
Sequence Alignment
– Goal: Given two strings X = x1 x2 . . . xm and Y = y1 y2 . . . yn find alignment of minimum cost.
– Definition: An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings.
– Definition: The pair xi-yj and xa-yb cross if i < a, but j > b.
Example: CTACCG vs. TACATG.
x1 x2 x3 x4 x5 x6
C
T
A
C
C
–
G
–
T
A
C
A
T
G
The University of Sydney M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. Page 10
y1 y2 y3 y4 y5 y6
Sequence Alignment
cost(M) = Saxi,yj + Sd + Sd (xi,yj)ÎM i: xi unmatched j: yj unmatched
mismatch Example: CTACCG vs. TACATG.
x1 x2 x3 x4 x5 x6
C
T
A
C
C
–
G
–
T
A
C
A
T
G
The University of Sydney
Page 11
y1 y2 y3 y4 y5 y6
M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6.
Key steps: Dynamic programming
1. Define subproblems 2. Find recurrences
3. Solve the base cases
4. Transform recurrence into an efficient algorithm
The University of Sydney Page 12
Sequence Alignment: Problem Structure Step 1: Define subproblems
OPT(i, j) =
min cost of aligning strings x1 x2 . . . xi and y1 y2 …yj.
x1 x2 x3 x4 x5 xi
··· ···
C
T
A
C
C
–
G
–
T
A
C
A
T
G
The University of Sydney
Page 14
y1 y2 y3 y4 y5 yj
Sequence Alignment: Problem Structure
– Definition: OPT(i, j) = min cost of aligning strings x1 x2 . . . xi and y1 y2 …yj.
– Case 1: OPT matches xi-yj.
• pay mismatch for xi-yj + min cost of aligning
twostringsx1 x2 …xi-1 andy1 y2 …yj-1 – Case 2a: OPT leaves xi unmatched.
• paygapforxi andmincostofaligningx1 x2 …xi-1 andy1 y2 …yj – Case 2b: OPT leaves yj unmatched.
• paygapforyj andmincostofaligningx1 x2 …xi andy1 y2 …yj-1
OPT(i, j) = min
axiyj + OPT(i-1,j-1) d + OPT(i-1, j)
d + OPT(i, j-1)
The University of Sydney Page 19
Sequence Alignment: Problem Structure Step 3: Solve the base cases
OPT(0,j) = j·d and OPT(i,0) = i·d
OPT(i,j) = j·d if i=0 OPI(i, j) = OPT(i,j) = i·d if j=0
min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)} otherwise
The University of Sydney
Page 20
Sequence Alignment: Algorithm
Sequence-Alignment(m, n, x1x2…xm, y1y2…yn, d, a) { for i = 0 to m
M[0, i] = id for j = 0 to n
M[j, 0] = jd
for i = 1 to m
for j = 1 to n
M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],
return M[m, n]
}
d + M[i, j-1]}
– Analysis. Q(mn) time and space.
– English words or sentences: m, n £ 100.
– Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array?
The University of Sydney
Page 21
Sequence Alignment: Algorithm
Sequence-Alignment(m, n, x1x2…xm, y1y2…yn, d, a) { for i = 0 to m
M[0, i] = id for j = 0 to n
M[j, 0] = jd
for i = 1 to m
for j = 1 to n
M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],
return M[m, n]
}
d + M[i, j-1]}
– To get the alignment itself we can trace back through the array M.
The University of Sydney Page 22
Sequence Alignment: Linear Space
Question: Can we avoid using quadratic space?
Easy. Optimal value in O(m + n) space and O(mn) time.
– Compute OPT(i, j) from OPT(i-1, j), OPT(i-1, j-1) and OPT(i, j-1). – BUT! No longer a simple way to recover alignment itself.
Theorem: [Hirschberg 1975] Optimal alignment in O(m + n) space and O(mn) time.
– Clever combination of divide-and-conquer and dynamic programming. – Inspired by idea of Savitch from complexity theory.
The University of Sydney Page 25
Sequence Alignment: Linear Space
– Edit distance graph.
– m ́n grid graph GXY (as shown in the figure)
– Horizontal/vertical edges have cost d (gap)
– Diagonal edges from (i-1, j-1) to (i,j) have cost axiyj. (match)
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
e
x1 x2
x3
m-n
The University of Sydney
Page 26
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be cheapest path from (0,0) to (i, j). – Observation: f(i, j) = OPT(i, j).
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
e
x1 x2
x3
m-n
The University of Sydney
Page 27
Sequence Alignment: Linear Space
min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)}
e
x1 x2
x3
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
m-n
The University of Sydney
Page 28
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be cheapest path from (0,0) to (i, j).
– Can compute f(m,n) in O(mn) time and O(mn) space.
j
y4
i-j
e y1 y2 y3 0-0
x1 x2
y5 y6
e
x3
m-n
The University of Sydney
Page 29
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be cheapest path from (0,0) to (i, j).
– If only interested in the value of the optimal alignment we do it in O(mn)
time and O(m + n) space.
e y1 y2 y3 0-0
x1 x2
j
y4
i-j
y5 y6
e
x3
m-n
The University of Sydney
Page 30
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be cheapest path from (0,0) to (i, j).
– This corresponds to optimal alignment of X[1..i] and Y[1..j]
– For any column j, can compute f(•, j) in O(mn)j time and O(m + n) space
y4
i-j
e y1 y2 y3 0-0
x1 x2
y5 y6
e
x3
m-n
The University of Sydney
Page 31
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be cheapest path from (0,0) to (i, j).
– If only interested in the value of the optimal alignment we do it in O(mn) time and O(m + n) space.
Space-Efficient-Alignment(X,Y) {
array B[0..m,0..1]
for i = 0 to m
B[i,0] = id for j = 1 to n B[0,1] = jd
#Collapse A into an m ́2 array] # B[i,0] = A[i,j-1]
# B[i,1] = A[i,j]
#corresponds to A[0,j]
for i = 1 to m
B[i,1] = min(a[xi, yj] + B[i-1,0],
d + B[i-1,1],d + B[i,0])
endFor
Move column i of B to column 0 #(B[i,0]=B[i,1]) endFor
}
The University of Sydney Page 32
Sequence Alignment: Linear Space
– Edit distance graph.
– Let f(i, j) be length of cheapest path from (0,0) to (i, j).
– This corresponds to cost of optimal alignment of X[1..i] and Y[1..j] – Observation: f(i, j) = OPT(i, j).
e y1 y2 y3 y4 y5 y6 0-0
e
x1 x2
d d
i-j
x3
m-n
The University of Sydney
Page 33
Sequence Alignment: Linear Space
– Edit distance graph.
– Let g(i, j) be cheapest path from (i, j) to (m, n).
– This corresponds to optimal alignment of X[m..i+1] and Y[m..j+1]
– Can compute by reversing the edge orientations and inverting the roles of (0, 0) and (m, n)
0-0
x1
x2 x3
ePage 34
i-j d
d
The University of Sydney m-n y1 y2 y3 y4 y5 y6 e
Sequence Alignment: Linear Space
– Edit distance graph.
– Let g(i, j) be cheapest path from (i, j) to (m, n).
– For any column j, can compute g(•, j) for all j in O(mn) time and O(m +
n) space.
e y1 y2 0-0
x1 x2
j
y3
i-j
y4 y5 y6
e
x3
m-n
The University of Sydney
Page 35
Sequence Alignment: Linear Space
Observation 1: The cost of the cheapest path that uses (i, j) is f(i, j) + g(i, j).
e y1 y2 y3 y4 y5 y6 0-0
x1 i-j x2
e
x3
m-n
The University of Sydney
Page 36
Sequence Alignment: Linear Space
Observation 2: Let q be an index that minimizes f(q, n/2) + g(q, n/2). Then, the cheapest path from (0, 0)
e y1 y2 0-0
x1 x2
y4 y5 y6
to (m, n) uses (q, n/2).
n/2
y3
i-j
e
q
x3
m-n
The University of Sydney
Page 37
Sequence Alignment: Linear Space
Divide: Find index q that minimizes f(q, n/2) + g(q, n/2) using DP. Can be done in O(nm) time and O(n + m) space
– Align xq and yn/2.
Conquer: recursively compute optimal alignment in each piece.
e y1 y2 0-0
x1
e
y3
q,n/2
n/2
m-n
y4 y5 y6
q
x2
x3
The University of Sydney
Page 38
Pseudocode
Divide-and-Conquer alignment(X,Y) { If |X|≤2 or |Y|≤2 then
OptimalAlignment(X,Y) #Alg using quadratic space f(·,n/2)=Space-Efficient-Alignment(X,Y[1..n/2]) g(·,n/2)=Backward-S-E-Alignment(X,Y[n/2..n])
Let q be the index minimizing f(q,n/2)+g(q,n/2)
A = Divide-and-Conquer alignment(X[1..q],Y[1..n/2])
B = Divide-and-Conquer alignment(X[m..q+1],Y[n..n/2+1]) return concatenation of A and B
}
n/2
q
The University of Sydney
Page 39
Sequence Alignment: Running Time Analysis Warmup
Theorem: Let T(m, n) = max running time of algorithm on strings of length at most m and n. T(m, n) = O(mn log n).
T(m, n) £ 2T(m, n/2) + O(mn) Þ T(m, n) = O(mn logn)
Remark: Analysis is not tight because two subproblems are of size (q, n/2) and (m – q, n/2).
The University of Sydney
Page 40
Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.
The University of Sydney
Page 41
Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.
– For some constant c we have:
T(m,2) £ cm
T(2, n) £ cn
T(m, n) £ cmn+T(q, n/2)+T(m-q, n/2)
The University of Sydney
Page 42
Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.
– For some constant c we have:
– Basecases:m≤2orn≤2.
– Inductive hypothesis: T(m’, n’) £ 2cm’n’ for m’ and n’ with m’n’ < mn.
T(m,2) £ cm
T(2, n) £ cn
T(m, n) £ cmn+T(q, n/2)+T(m-q, n/2)
T(m,n) £ T(q,n/2)+T(m-q,n/2)+cmn £ 2cqn/2+2c(m-q)n/2+cmn
= cqn+cmn-cqn+cmn = 2cmn
Page 43
The University of Sydney
Sequence Alignment: Running Time Analysis
Theorem: An optimal alignment can be computed in O(mn) time using O(m+n) space.
The University of Sydney
Page 44
Sequence Alignment: History [m=n]
– Needleman and Wunsch 1970 O(n3)
– Sankoff 1972 O(n2)
[see also Vintsyuk’68 for speech processing
Wagner and Fisher’74 for string matching]
– Still an active research area (experimental research) Chakraborty and Angana’13 (claimed 54-90% speedup)
The University of Sydney
Page 45
Generalising the algorithm
Problem:
Nature often inserts or removes entire substrings of nucleotides (creating long gaps), rather than editing just one position at a time.
The penalty for a gap of length 10 should not be 10 times the penalty for a gap of length 1, but something significantly smaller. Can we modify the scoring function in which the penalty for a gap of length k is:
d0+d1·k ?
The University of Sydney
Page 46
Dynamic Programming Summary
– 1Ddynamicprogramming
– Weightedintervalscheduling
– SegmentedLeastSquares
– Maximum-sumcontiguoussubarray – Longestincreasingsubsequence
– 2Ddynamicprogramming – Knapsack
– Sequencealignment
– Dynamicprogrammingoverintervals
– RNASecondaryStructure
– Dynamicprogrammingoversubsets – TSP
– k-path – Playlist
The University of Sydney
Page 47