COMP1730/COMP6730 Programming for Scientists
Control, part 3: Dynamic programming
Outline
* Dynamic programming.
* (DNA) sequence alignment.
Dynamic programming
Recursion or iteration?
* Examples of problems that could be solved both with recursion and with iteration:
– Counting boxes in a stack.
– Solving an equation (the interval-halving
algorithm).
* Examples pf problems that we have only seen recursive solutions for:
– Counting selections (“n choose k”). – The subset sum problem.
Example: Counting selections
* Compute the number of ways to choose k elements from a set of n, C(n,k).
2 from {, △, ⃝}
out in
2 from {△,⃝}: 1 1 from {△,⃝}
△ out △ in
1 from {⃝}: 1 0 from {⃝}: 1
* Simple recursive formulation:
C(n, k) = C(n − 1, k) + C(n − 1, k − 1) C(n,0) = 1
C(n,n) = 1
* Simple recursive implementation:
def choices(n, k):
if k == n or k == 0:
return 1
else:
return choices(n – 1, k) + \ choices(n – 1, k – 1)
* How to implement with iteration?
* Recursive calls by choices(5, 3): (5,3)
(4,2)
(4,3)
(3,3) (3,2)
(3,1) (2,1) (2,0)
(3,2) (2,2) (2,1) (2,2) (2,1)
(1,1) (1,0) (1,1) (1,0) (1,1) (1,0)
* Note repeated work.
* The idea of dynamic programming is to store answers to (recursively defined) subproblems, to avoid computing them repeatedly.
– Trade memory for computation time.
* By computing subproblem solutions “from the bottom up”, we can also transform a recursive algorithm into an iterative one:
– solve the base cases first;
– then, repeatedly, solve problems whose
subproblems are already solved; – until the whole problem is solved.
* Need a way to index subproblems.
* Array of subproblems:
(0,0)
(1,0)
(2,0)
(3,0)
(4,0)
(5,0)
(1,1)
(2,1)
(3,1)
(4,1)
(2,2)
(3,2)
(5,1)
(4,2)
(5,2)
(3,3)
(4,3)
(5,3)
k
n
* With base cases solved:
(0,0) =1
(1,0) =1
(2,0) =1
(3,0) =1
(4,0) =1
(5,0) =1
(1,1) =1
(2,1)
(3,1)
(4,1)
(2,2) =1
(3,2)
(5,1)
(4,2)
(5,2)
(3,3) =1
(4,3)
(5,3)
k
n
* Complete:
(0,0) =1
(1,0) =1
(2,0) =1
(3,0) =1
(4,0) =1
(5,0) =1
(1,1) =1
(2,1) =2
(3,1) =3
(4,1) =4
(5,1) =5
(2,2) =1
(3,2) =3
(4,2) =6
(5,2) = 10
(3,3) =1
(4,3) =4
(5,3) = 10
k
n
(DNA) sequence alignment
BRCA 1 gene (BReast CAncer)
CTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGGAAAAGCGCGGGAATTACAGATAAATTAAAACTGCGACTGCGCGGCGTGAGCTCGC TGAGACTTCCTGGACGGGGGACAGGCTGTGGGGTTTCTCAGATAACTGGGCCCCTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGGTTC ATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTC CCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAACTTCTCAACCAG AAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGA AGAGCTATTGAAAATCATTTGTGCTTTTCAGCTTGACACAGGTTTGGAGTATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACT CTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAACCGTGCCAAAAGACTTCTACAGAGTGAACCCGAA AATCCTTCCTTGGAAACCAGTCTCAGTGTCCAACTCTCTAACCTTGGAACTGTGAGAACTCTGAGGACAAAGCAGCGGATACAACCTCA AAAGACGTCTGTCTACATTGAATTGGGATCTGATTCTTCTGAAGATACCGTTAATAAGGCAACTTATTGCAGTGTGGGAGATCAAGAAT TGTTACAAATCACCCCTCAAGGAACCAGGGATGAAATCAGTTTGGATTCTGCAAAAAAGGCTGCTTGTGAATTTTCTGAGACGGATGTA ACAAATACTGAACATCATCAACCCAGTAATAATGATTTGAACACCACTGAGAAGCGTGCAGCTGAGAGGCATCCAGAAAAGTATCAGGG TGAAGCAGCATCTGGGTGTGAGAGTGAAACAAGCGTCTCTGAAGACTGCTCAGGGCTATCCTCTCAGAGTGACATTTTAACCACTCAGC AGAGGGATACCATGCAACATAACCTGATAAAGCTCCAGCAGGAAATGGCTGAACTAGAAGCTGTGTTAGAACAGCATGGGAGCCAGCCT TCTAACAGCTACCCTTCCATCATAAGTGACTCTTCTGCCCTTGAGGACCTGCGAAATCCAGAACAAAGCACATCAGAAAAAGCAGTATT AACTTCACAGAAAAGTAGTGAATACCCTATAAGCCAGAATCCAGAAGGCCTTTCTGCTGACAAGTTTGAGGTGTCTGCAGATAGTTCTA CCAGTAAAAATAAAGAACCAGGAGTGGAAAGGTCATCCCCTTCTAAATGCCCATCATTAGATGATAGGTGGTACATGCACAGTTGCTCT GGGAGTCTTCAGAATAGAAACTACCCATCTCAAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCC ACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGAGGGAACCCCTTACCTGGAATCTGGAATCAGCCTCTTCTCTGATG ACCCTGAATCTGATCCTTCTGAAGACAGAGCCCCAGAGTCAGCTCGTGTTGGCAACATACCATCTTCAACCTCTGCATTGAAAGTTCCC CAATTGAAAGTTGCAGAATCTGCCCAGAGTCCAGCTGCTGCTCATACTACTGATACTGCTGGGTATAATGCAATGGAAGAAAGTGTGAG CAGGGAGAAGCCAGAATTGACAGCTTCAACAGAAAGGGTCAACAAAAGAATGTCCATGGTGGTGTCTGGCCTGACCCCAGAAGAATTTA TGCTCGTGTACAAGTTTGCCAGAAAACACCACATCACTTTAACTAATCTAATTACTGAAGAGACTACTCATGTTGTTATGAAAACAGAT GCTGAGTTTGTGTGTGAACGGACACTGAAATATTTTCTAGGAATTGCGGGAGGAAAATGGGTAGTTAGCTATTTCTGGGTGACCCAGTC TATTAAAGAAAGAAAAATGCTGAATGAG
Biological sequence data
* DNA and RNA.
* Protein amino acid
sequence.
* Arrangement of genes in chromosome / genome.
* Human DNA is ∼3 billion bp.
* BRCA 1 & 2 genes are ∼80kb (incl. exons).
* Harmful mutations change as few as 2 bases.
* DNA sequencer reads are 100–2k bases.
* Alignment * Mapping
* Assembly
Edit distance
* Minimum (weighted) number of “edit operations” needed to transform one sequence into the other.
* Levenshtein (string edit) distance:
– insert a character (gap in other string);
– delete a character (gap in this string); – substitute a character.
* Minimum edit equals best sequence alignment.
* distance(GAATTCA, GGATCGA) = 3. * Edits:
GAATTCA (subst. 1 G) ⇒ G G A T T C A
(del4) ⇒GGATCA (ins5G) ⇒GGATCGA
* Alignment:
GAATTC A GGAT CGA
+1 +1 +1
Recursive formulation
dist(s, ’’) = len(s) ∗ wgap dist(’’, t ) = len(t ) ∗ wgap
dist(s + x , t + y ) =
0 dist(s,t)+
min
dist(s,t +y)+wgap
dist(s+x,t)+wgap * In example, wsub = wgap = 1.
ifx = y wsub otherwise
def align(s, t):
if len(s) == 0:
return len(t) * w gap elif len(t) == 0:
return len(s) * w gap else:
if s[-1] == t[-1]:
d1 = align(s[:-1], t[:-1])
else:
d1 = align(s[:-1], t[:-1]) + w sub
d2 = align(s, t[:-1]) + w gap d3 = align(s[:-1], t) + w gap return min(d1, d2, d3)
Iterative formulation
* How to index subproblems?
– Each call aligns two sequence prefixes. – (i,j): align(s[:i], t[:j]).
* Base cases?
– Onesequenceisempty(i =0orj =0).
* Update: min of (i − 1, j − 1) (plus subst. weight if s[i] != t[j]),(i−1,j)plusgapweight,and (i , j − 1) plus gap weight.
GCATA
0
1
2
3
4
5
1
1
2
···
2
1
2
···
3
2
1
2
···
4
···
2
···
5
···
2
T G C T A
Summary
* Recursion, iteration and dynamic programming are all useful algorithm design ideas.
* There is no single “best” idea.
* It is not always easy to know which is the right one to apply to a given problem.