CS代考程序代写 Bioinformatics algorithm Lecture 7: Dynamic Programming II

Lecture 7: Dynamic Programming II
The University of Sydney Page 1

Changes to the unit
More resources
– Recorded tutorials
– Online office hours (Tue 4-6, Fri 2-4). Contact me beforehand on Slack. Maybe available at other times as well
Assignments
– Scale back in both difficulty and workload
Final Exam
– Online, 2 hours long on ProctorU
– Some knowledge-based questions, some algorithm design problems, easier than assignments
The University of Sydney
Page 2

How do I get good at problem solving?
– Work on lots of problems, especially the tutorial ones
– Present your solution during tutorial or on Ed to get
feedback
– Feel free to ask about solution to other problems, such as those from Jeff Erickson’s textbook
– Approaches:
– Make simplifying assumptions, solve special
cases
– Reuse computation (esp. useful for D&C) – Reformulate problem
The University of Sydney
Page 3

General techniques in this course
– Greedy algorithms [Lecture 2]
– Divide & Conquer algorithms [Lecture 3]
– Dynamic programming algorithms [Lectures 4 and today] – Network flow algorithms [Lectures 6 and 7]
The University of Sydney Page 4

Algorithmic Paradigms
– Greedy. Build up a solution incrementally, myopically optimizing some local criterion.
– Divide-and-conquer. Break up a problem into two sub- problems, solve each sub-problem independently, and combine solution to sub-problems to form solution to original problem.
– Dynamic programming. Break up a problem into a series of overlapping sub-problems, and build up solutions to larger and larger sub-problems.
The University of Sydney Page 5

Main Idea: Dynamic Programming
Recurrence equation relating optimal solution in terms optimal solutions to smaller subproblems
Given optimal solutions to smaller subproblems, how to construct solution to original problem?
The University of Sydney
Page 6

Key steps: Dynamic programming
1. Define subproblems 2. Find recurrences
3. Solve the base cases
4. Transform recurrence into an efficient algorithm
The University of Sydney Page 7

Recap: Weighted Interval Scheduling
– Job j starts at sj, finishes at fj, and has weight vj.
– Two jobs compatible if they don’t overlap.
– Goal: find maximum weight subset of mutually compatible jobs.
b
a
c
d
e
f
g
h
0 1 2 3 4 5 6 7 8 9 10 11 The University of Sydney
Time
Page 8

Recap: Weighted Interval Scheduling
Notation. Label jobs by finishing time: f1 £ f2 £ . . . £ fn .
Def. p(j) = largest index i M[j-1])
print j
Find-Solution(p(j))
else
Find-Solution(j-1)
}
# of recursive calls £ n Þ O(n). The University of Sydney
picked job j
16
Page 16

Recap: Knapsack Problem
– Knapsackproblem.
– Givennobjectsanda”knapsack.”
– Item i weighs wi > 0 kilograms and has value vi > 0. – KnapsackhascapacityofWkilograms.
– Goal: fill knapsack so as to maximize total value.
W = 11
Item
Value
Weight
1
1
1
2
6
2
3
18
5
4
22
6
5
28
7
The University of Sydney
Page 17

Recap: Dynamic Programming – Step 1 Step 1: Define subproblems
OPT(i, w) = max profit with subset of items 1, …, i with weight limit w.
The University of Sydney
Page 18

Recap: Dynamic Programming – Step 2 Step 2: Find recurrences
– Case 1: OPT does not select item i.
• OPT selects best of { 1, 2, …, i-1 } using weight limit w
– Case 2: OPT selects item i.
• newweightlimit=w–wi
• OPT selects best of { 1, 2, …, i–1 } using this new weight limit
OPT(i,w) = max { vi + OPT (i-1,w-wi), OPT(i-1,w) }
The University of Sydney
Page 19

Recap: Dynamic Programming – Step 2 Step 2: Find recurrences
– Case 1: OPT does not select item i.
• OPT selects best of { 1, 2, …, i-1 } using weight limit w
– Case 2: OPT selects item i.
• newweightlimit=w–wi
• OPT selects best of { 1, 2, …, i–1 } using this new weight limit
If wi>w
OPT(i,w) = OPT (i-1,w)
otherwise
OPT(i,w) = max { vi + OPT (i-1,w-wi), OPT(i-1,w) }
The University of Sydney
Page 20

Recap: Dynamic Programming – Step 3 Step 3: Solve the base cases
OPT(0,w) = 0
ì0 if i=0 OPT(i,w)=ïíOPT(i-1,w) if i>0andwi >w
ïîmax{OPT(i-1,w),vi+OPT(i-1,w-wi)} otherwise
The University of Sydney
Page 21
Done…more or less

Knapsack Problem: Bottom-Up
– Knapsack. Fill up an (n+1)-by-(W+1) array. Space complexity: O(nW).
Input: n, w1,…,wN, v1,…,vN for w = 0 to W
M[0, w] = 0
for i = 1 to n
for w = 1 to W
if (wi > w)
M[i, w] = M[i-1, w]
else
M[i, w] = max {M[i-1, w], vi + M[i-1, w-wi ]} return M[n, W]
The University of Sydney
Page 22

n+1
W+1
0 1 2 3 4 5 6 7 8 9 10 11
Æ 000000000000 {1} 011111111111 {1,2} 016777777777
{1,2,3} 0 1 6 7 7 18 19 24 25 25 25 25
Knapsack Algorithm
The University of Sydney
Page 23
{1,2,3,4} {1,2,3,4,5}
0 1 6 7 7 18 22 24 28 29 29 40 0 1 6 7 7 18 22 28 29 34 34 40
OPT: {4,3}
value = 22 + 18 = 40
Item Value Weight
111 262 W=11 3 18 5 4 22 6
5 28 7

Longest Common Subsequence
Given two sequences X[1..n] and Y[1..m], find the longest subsequences X’ of X and Y’ of Y such that X’ = Y’.
Example 1 X = BANANAS Y = KATANA
LCS(X,Y) = AANA
Example 2 X = DYNAMIC
Y = PROGRAMMING
LCS(X,Y) = AMI The University of Sydney
Page 24

Longest Common Subsequence: Matching Definition
Given two sequences X[1..n] and Y[1..m], find the longest non- crossing matching between elements of X and elements of Y
Example 1 X = BANANAS Y = KATANA
LCS(X,Y) = AANA. Matching: X[2] – Y[2], X[4] – Y[4], X[5] – Y[5], X[6] – Y[6].
Example 2 X = DYNAMIC
Y = PROGRAMMING
LCS(X,Y) = AMI. Matching: X[4] – Y[6], X[5] – Y[7], X[6] – Y[8],
Note: if crossings allowed, then can add X[3] – Y[10]. The University of Sydney
Page 25

Longest Common Subsequence: Applications
– Measures similarity of two strings – Bioinformatics
– Merging in version control
The University of Sydney Page 26

LCS Dynamic Programming: Step 1 Step 1: Define subproblems
OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Subproblems: finding LCS of prefixes of X and Y
Example
X = BANANAS Y = KATANA
OPT(6, 4) = length of LCS(BANANA, KATA) = 2 The University of Sydney
Page 28

LCS Dynamic Programming: Step 2
Notations:
– OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Step 2: Finding recurrences
– Case 1: X[i] ≠ Y[j].
– Leave X[i] unmatched and take LCS(X[1..i-1], Y[1..j]). OR – Leave Y[j] unmatched and take LCS(X[1..i], Y[1..j-1]).
– OPT(i,j)= max{OPT(i-1,j), OPT(i,j-1)}
Example
X[1..5] = BANAN Y[1..6] = KATANA
X[5] unmatched: LCS = LCS(X[1..4], Y[1..6]) = ANA
Y[6] unmatched: LCS = LCS(X[1..5], Y[1..5]) = AAN The University of Sydney
Page 29

LCS Dynamic Programming: Step 2
Notations:
– OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Step 2: Finding recurrences
– Case 1: X[i] ≠ Y[j].
– Leave X[i] unmatched and take LCS(X[1..i-1], Y[1..j]). OR – Leave Y[j] unmatched and take LCS(X[1..i], Y[1..j-1]).
– OPT(i,j)= max{OPT(i-1,j), OPT(i,j-1)}
Example
X[1..5] = BANAN Y[1..6] = KATANA
X[5] unmatched: LCS = LCS(X[1..4], Y[1..6]) = ANA
Y[6] unmatched: LCS = LCS(X[1..5], Y[1..5]) = AAN
The University of Sydney
Page 30

LCS Dynamic Programming: Step 2
Notations:
– OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Step 2: Finding recurrences
– Case 1: X[i] ≠ Y[j].
– Leave X[i] unmatched and take LCS(X[1..i-1], Y[1..j]). OR – Leave Y[j] unmatched and take LCS(X[1..i], Y[1..j-1]).
– OPT(i,j)= max{OPT(i-1,j), OPT(i,j-1)}
Example
X[1..3] = BAN Y[1..6] = KATANA
X[3] unmatched: LCS(X[1..2], Y[1..6]) = A
Y[6] unmatched: LCS(X[1..3], Y[1..5]) = AN The University of Sydney
Page 31

LCS Dynamic Programming: Step 2
Notations:
– OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Step 2: Finding recurrences
– Case 1: X[i] ≠ Y[j].
– Leave X[i] unmatched and take LCS(X[1..i-1], Y[1..j]). OR – Leave Y[j] unmatched and take LCS(X[1..i], Y[1..j-1]).
– OPT(i,j)= max{OPT(i-1,j), OPT(i,j-1)}
Example
X[1..3] = BAN Y[1..6] = KATANA
X[3] unmatched: LCS(X[1..2], Y[1..6]) = A
Y[6] unmatched: LCS(X[1..3], Y[1..5]) = AN
The University of Sydney
Page 32

LCS Dynamic Programming: Step 2
Notations:
– OPT(i,j) = length of LCS(X[1..i], Y[1..j]). Step 2: Finding recurrences
– Case 2: X[i] = Y[j].
– Match X[i] to Y[j] and take LCS(X[1..i-1], Y[1..j-1]) + X[i] OR
– Leave X[i] unmatched or Y[j] unmatched
– OPT(i,j) = max{OPT(i-1, j-1) + 1, OPT(i-1, j), OPT(i, j-1)
Example
X[1..6] = BANANA Y[1..6] = KATANA
Match X[6] – Y[6]: LCS = LCS(X[1..5], Y[1..5]) + A = AANA
X[6] unmatched: LCS = LCS(X[1..5], Y[1..6]) = AAN
Y[6] unmatched: LCS = LCS(X[1..6], Y[1..5]) = AAN The University of Sydney
Page 33

LCS Dynamic Programming: Step 3 Step 3: Solving the base cases
OPT(i,0) = 0 for all i, OPT(0,j) = 0 for all j
8>0 if i = 0 or j = 0 < OPT(i,j)= max{OPT(i1,j1)+1,OPT(i,j1),OPT(i1,j)} ifX[i]=Y[j] >
max{OPT(i, j 1), OPT(i 1, j)} if X[i] 6= Y [j]
:
The University of Sydney Page 34

LCS Dynamic Programming: Algorithm
INPUT: n, m, X[1..n], Y[1..m]
for j = 0 to m
M[0,j] = 0
for i = 0 to n
M[i,0] = 0
for i = 1 to n
for j = 1 to m
if (X[i] = Y[j])
M[i,j] = max(M[i-1,j-1] + 1, M[i-1,j], M[i,j-1])
else
M[i,j] = max(M[i-1,j], M[i,j-1])
return M[n,m]
8><>:0 if i = 0 or j = 0 OPT(i,j)= max{OPT(i1,j1)+1,OPT(i,j1),OPT(i1,j)} ifX[i]=Y[j] max{OPT(i, j 1), OPT(i 1, j)} if X[i] 6= Y [j]
The University of Sydney Page 35

LCS Dynamic Programming: Analysis
INPUT: n, m, X[1..n], Y[1..m]
for j = 0 to m
M[0,j] = 0
for i = 0 to n
M[i,0] = 0
for i = 1 to n
for j = 1 to m
O(n + m)
O(nm)
if (X[i] = Y[j])
M[i,j] = max(M[i-1,j-1] + 1, M[i-1,j], M[i,j-1])
else
M[i,j] = max(M[i-1,j], M[i,j-1])
return M[n,m] Running time: O(nm)
Space: O(nm)
See 3927 Lecture 6 for Sequence Alignment
The University of Sydney
Page 36

6.5 RNA Secondary Structure
Dynamic programming over intervals
The University of Sydney Page 37

RNA (Ribonucleic acid) Secondary Structure
– RNA. String B = b1b2…bn over alphabet { A, C, G, U }.
– Secondary structure. RNA is single-stranded so it tends to loop
back and form base pairs with itself. This structure is essential
for understanding behavior of molecule.
CA AA AU
GC CGUAA G
Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA UGAUUA
G
ACGCU CGCGAGC
G
G AU
The University of Sydney
complementary base pairs: A-U, C-G G
Page 38

RNA Secondary Structure
– Secondary structure. A set of pairs S = { (bi, bj) } that satisfy: – [Watson-Crick.] S is a matching and each pair in S is a Watson-Crick
complement: A-U, U-A, C-G, or G-C.
– [No sharp turns.] The ends of each pair are separated by at least 4
intervening bases. If (bi, bj) Î S, then i < j - 4. – [Non-crossing.] If (bi, bj) and (bk, bl) are two pairs in S, then we cannot have i < k < j < l. – Free energy. Usual hypothesis is that an RNA molecule will form the secondary structure with the optimum total free energy. approximated by number of base pairs – Goal. Given an RNA molecule B = b1b2...bn, find a secondary structure S that maximizes the number of base pairs. The University of Sydney Page 39 39 RNA Secondary Structure: Examples GGGGG CUGGCU CGCGCU AUAUAG UAUAUA base pair AUGUGGCCAU AUGGGGCAU AGUUGGCCAU £4 ok sharp turn crossing The University of Sydney Page 40 RNA Secondary Structure: Subproblems – First attempt (Step 1). OPT(j) = maximum number of base pairs in a secondary structure of the substring b1b2...bj. 1 match bt and bi tj – Difficulty (in Step 2). Results in two sub-problems. – Finding secondary structure in: b1b2...bt-1. – Finding secondary structure in: bt+1bt+2...bj-1. OPT(t-1) The University of Sydney need more sub-problems Page 41 Dynamic Programming Over Intervals – Step 1 Step 1: Define subproblems OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1...bj. The University of Sydney Page 42 Dynamic Programming Over Intervals – Step 2 Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1...bj. Step 2: Find recurrences The University of Sydney Page 43 i j Case 1. Base bj is not involved in a pair. OPT(i, j) = OPT(i, j-1) Dynamic Programming Over Intervals – Step 2 Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1...bj. Step 2: Find recurrences The University of Sydney Page 44 i match bt and bi tj Case2. Basebjpairswithbt forsomei£t M[w] + cvw) {
M[v] ¬ M[w] + cvw
successor[v] ¬ w }
} }
If no M[w] value changed in iteration i, stop.
}
}
Analysis: Q(mn) time, Q(n) working space. The University of Sydney Page 67

Shortest Paths: Practical Improvements
– Practical improvements
– Maintain only one array M[v] = shortest v-t path that we have
found so far.
– No need to check edges of the form (v, w) unless M[w] changed in previous iteration.
– Theorem: Throughout the algorithm, M[v] is length of some v-t path, and after i rounds of updates, the value M[v] is no larger than the length of shortest v-t path using £ i edges.
– Overall impact
– Working space: O(n).
– Total space (including input): O(m+n)
– Running time: O(mn) worst case, but substantially faster in practice.
The University of Sydney
Page 68
68

6.3 Segmented Least Squares
The University of Sydney Page 69

Segmented Least Squares
– Least squares.
– Foundational problem in statistic and numerical analysis.
– Given n points in the plane: (x1, y1), (x2, y2) , . . . , (xn, yn).
– Find a line y = ax + b that minimizes the sum of the squared error:
y
SSE = ån (yi -axi -b)2 i =1
x
– Solution. Calculus Þ min error is achieved when
The University of Sydney Page 70
a=nåixiyi -(åixi)(åi yi), b=åi yi -aåixi
n å x 2 – (å x )2 n ii ii
O(n)

Segmented Least Squares
– Segmented least squares.
– Points lie roughly on a sequence of several line segments.
– Givennpointsintheplane(x1,y1),(x2,y2),…,(xn,yn)with – x1 < x2 < ... < xn, find a sequence of lines that minimizes f(x). – Question. What's a reasonable choice for f(x) to balance accuracy and complexity? number of lines y The University of Sydney x Page 71 Segmented Least Squares – Segmented least squares. – Points lie roughly on a sequence of several line segments. – Givennpointsintheplane(x1,y1),(x2,y2),...,(xn,yn)with x1 < x2 < ... < xn, find a sequence of lines that minimizes: • the sum of the sums of the squared errors E in each segment • the number of lines L – Tradeoff function: E + c·L, for some constant c > 0.
y
The University of Sydney
x Page 72

Segmented Least Squares
– Segmented least squares.
– Points lie roughly on a sequence of several line segments.
– Givennpointsintheplane(x1,y1),(x2,y2),…,(xn,yn)with
x1 < x2 < ... < xn, find a partition into segments that minimizes: • the sum of the sums of the squared errors E in each segment • the number of segments L – Tradeoff function: E + c·L, for some constant c > 0.
y
The University of Sydney
x Page 73

Dynamic Programming: Multiway Choice – Step 1 Step 1: Define subproblems
OPT(j) = minimum cost for points p1, p2 , . . . , pj.
The University of Sydney Page 74

Dynamic Programming: Multiway Choice – Step 2
Notations:
– OPT(j)=minimumcostforpointsp1,p2 ,…,pj.
– e(i,j) =minimumsumofsquaresforpointspi,pi+1 ,…,pj.
Step 2: Finding recurrences
– Lastsegmentusespointspi,pi+1 ,…,pj forsomei. – Cost = e(i, j) + c + OPT(i-1).
OPT(j) = min { e(i,j) + c + OPT (i-1) } 1≤i≤j
The University of Sydney Page 75

Dynamic Programming: Multiway Choice – Step 3 Step 3: Solving the base cases
OPT(0) = 0
OPT(j) = (0 if j = 0 min1ije(i,j)+c+OPT(i1) ifj>0
The University of Sydney
Page 76

Segmented Least Squares: Algorithm
O(n2) iterations
O(n) iterations
INPUT: n, (p1,…,pn), c
Segmented-Least-Squares() {
M[0] = 0
for j = 1 to n
for i = 1 to j
compute the least square error eij for the segment pi,…, pj
O(n)
O(n)
for j = 1 to n
M[j]=min1 £ i £ j(eij +c+M[i-1])
return M[n] }
OPT(j) = (0 if j = 0 min1ije(i,j)+c+OPT(i1) ifj>0
The University of Sydney
Page 77

Segmented Least Squares: Algorithm
2 O(n )
iterations
O(n) iterations
for j = 1 to n fori=1toj
compute the least square error eij for the segment pi,…, pj
for j = 1 to n
M[j]=min1 £ i £ j(eij +c+M[i-1])
O(n)
O(n)
INPUT: n, (p1,…,pn), c
Segmented-Least-Squares() {
M[0] = 0
The University of Sydney
Page 78
return M[n] }
Running time: O(n3) Space: O(n2)

Dynamic Programming Summary I
3 steps:
1. Definesubproblems
2. Findrecurrences
3. Solvethebasecases
4. Transformrecurrenceintoanefficientalgorithm [usually bottom-up]
The University of Sydney
Page 79
79

Dynamic Programming Summary II
– 1D dynamic programming
– Weighted interval scheduling
– Segmented Least Squares
– Maximum-sum contiguous subarray – Longest increasing subsequence
– 2D dynamic programming – Knapsack
– Shortest path
– Longest common subsequence
– Dynamic programming over intervals – RNA Secondary Structure
The University of Sydney
Page 80