程序代写代做代考 DNA algorithm flex CS124 Lecture 9 Spring 2010

CS124 Lecture 9 Spring 2010

9.1 The String reconstruction problem

The greedy approach doesn’t always work, as we have seen. It lacks flexibility; if at some point, it makes a wrong

choice, it becomes stuck.

For example, consider the problem of string reconstruction. Suppose that all the blank spaces and punctuation

marks inadvertently have been removed from a text file. You would like to reconstruct the file, using a dictionary.

(We will assume that all words in the file are standard English.)

For example, the string might begin “thesearethereasons”. A greedy algorithm would spot that the first two

words were “the” and “sea”, but then it would run into trouble. We could backtrack; we have found that sea is

a mistake, so looking more closely, we might find the first three words “the”,“sear”, and “ether”. Again there is

trouble. In general, we might end up spending exponential time traveling down false trails. (In practice, since

English text strings are so well behaved, we might be able to make this work– but probably not in other contexts,

such as reconstructing DNA sequences!)

This problem has a nice structure, however, that we can take advantage of. The problem can be broken down

into entirely similar subproblems. For example, we can ask whether the strings “theseare” and “thereasons” both

can be reconstructed with a dictionary. If they can, then we can glue the reconstructions together. Notice, however,

that this is not a good problem for divide and conquer. The reason is that we do not know where the right dividing

point is. In the worst case, we could have to try every possible break! The recurrence would be

T (n) =
n−1


i=1

T (i)+ T(n− i).

You can check that the solution to this recurrence grows exponentially.

Although divide and conquer directly fails, we still want to make use of the subproblems. The attack we now

develop is called dynamic programming. The way to understand dynamic programming is to see that divide and

conquer fails because we might recalculate the same thing over and over again. (Much like we saw very early on

with the Fibonacci numbers!) If we try divide and conquer, we will repeatedly solve the same subproblems (the case

of small substrings) over and over again. The key will be to avoid the recalculations. To avoid recalculations, we use

a lookup table.

9-1

Lecture 9 9-2

In order for this approach to be effective, we have to think of subproblems as being ordered by size. We solve

the subproblems bottom-up, from the smallest to the largest, until we reach the original problem.

For this dictionary problem, think of the string as being an array s[1 . . .n]. Then there is a natural subprob-

lem for each substring s[i . . . j]. Consider a two dimensional array D(i, j) that will denote whether s[i . . . j] is the

concatenation of words from the dictionary. The size of a subproblem is naturally d = j− i.

So now we write a simple loops which solves the subprobelms in order of increasing size:

for d := 1 to n−1 do
for i := 1 to n−d do

j := i+ d;
if indict(s[i . . . j]) then D(i, j) := true else

for k := i+ 1 to j−1 do
if D(i,k) and D(k, j) then D(i, j) := true;

This algorithm runs in time O(n3); the three loops each run over at most n values. Pictorially, we can think of the

algorithm as filling in the upper diagonal triangle of a two-dimensional array, starting along the main diagonal and

moving up, diagonal by diagonal.

We need to add a bit to actually find the words. Let F(i, j) be the position of end of the first word in s[i . . . j]

when this string is a proper concatenation of dictionary words. Initially all F(i, j) should be set to nil. The value for

F(i, j) can be set whenever D(i, j) is set to true. Given the F(i, j), we can reconstruct the words simply by finding

the words that make up the string in order. Note also that we can use this to improve the running time; as soon as we

find a match for the entire string, we can exit the loop and return success! Further optimizations are possible.

Let us highlight the aspects of the dynamic programming approach we used. First, we used a recursive descrip-

tion based on subproblems: D(i, j) is true if D(i,k) and D(k, j) for some k. Second, we built up a table containing

the answers of the problems, in some natural bottom-up order. Third, we used this table to find a way to determine

the actual solution. Dynamic programming generally involves these three steps.

9.2 Edit distance

A problem that arises in biology is to measure the distance between two strings (of DNA). We will examine the

problem in English; the ideas are the same. There are many possible meanings for the distance between two strings;

here we focus on one natural measure, the edit distance. The edit distance measures the number of editing operations

it would be necessary to perform to transform the first string into the second. The possible operations are as follows:

Lecture 9 9-3

• Insert: Insert a character into the first string.

• Delete: Delete a character from the first string.

• Replace: Replace a character from the first string with another character.

Another possibility is to not edit a character, when there is a Match. For example, a transformation from activate

to caveat can be represented by

D M R D M I M M D

a c t i v a t e

c a v e a t

The top line represents the operation performed. So the a in activate id deleted, and the t is replaced. The e in

caveat is explicitly inserted.

The edit distance is the minimal number of edit operations – that is, the number of Inserts, Deletes, or Replaces

– necessary to transform one string to the other. Note that Matches do not count. Also, it is possible to have a

weighted edit distance, if the different edit operations have different costs. We currently assume all operations have

weight 1.

We will show how compute the edit distance using dynamic programming. Our first step is to define appropriate

subproblems. Let us represent our strings by A[1 . . .n] and B[1 . . .m]. Suppose we want to consider what we do with

the last character of A. To determine that, we need to know how we might have transformed the first n−1 characters

of A. These n−1 characters might have transformed into any number of symbols of B, up to m. Similarly, to compute

how we might have transformed the first n− 1 characters of A into some part of B, it makes sense to consider how

we transformed the first n−2 characters, and so on.

This suggests the following submproblems: we will let D(i, j) represent the edit distance between A[1 . . . i] and

B[1 . . . j]. We now need a recursive description of the subproblems in order to use dynamic programming. Here the

recurrence is:

D(i, j) = min[D(i−1, j)+ 1,D(i, j−1)+ 1,D(i−1, j−1)+ I(i 6= j)].

In the above, I(i 6= j) represents the value 1 if i 6= j and 0 if i = j. We obtain the above expression by considering the

possible edit operations available. Suppose our last operation is a Delete, so that we deleted the ith character of A to

transform A[1 . . . i] to B[1 . . . j]. Then we must have transformed A[1 . . . i−1] to B[1 . . . j], and hence the edit distance

Lecture 9 9-4

would be D(i−1, j)+ 1, or the cost of the transformation from A[1 . . . i−1] to B[1 . . . j] plus one for the cost of the

final Delete. Similarly, if the last operation is an Insert, the cost would be D(i, j−1)+ 1.

The other possibility is that the last operation is a Replace of the ith character of A with the jth character of B,

or a Match between these two characters. If there is a Match, then the two characters must be the same, and the cost

is D(i−1, j−1). If there is a Replace, then the two characters should be different, and the cost is D(i−1, j−1)+1.

We combine these two cases in our formula, using D(i−1, j−1)+ I(i 6= j).

Our recurrence takes the minimum of all these possibilities, expressing the fact that we want the best possible

choice for the final operation!

It is worth noticing that our recursive description does not work when i or j is 0. However, these cases are

trivial. We have

D(i,0) = i,

since the only way to transform the first i characters of A into nothing is to delete them all. Similarly,

D(0, j) = j.

Again, it is helpful to think of the computation of the D(i, j) as filling up a two-dimensional array. Here, we

begin with the first column and first row filled. We can then fill up the rest of the array in various ways: row by row,

column by column, or diagonal by diagonal!

Besides computing the distance, we may want to compute the actual transformation. To do this, when we fill

the array, we may also picture filling the array with pointers. For example, if the minimal distance for D(i, j) was

obtained by a final Delete operation, then the cell (i, j) in the table should have a pointer to (i− 1, j). Note that a

cell can have multiple pointers, if the minimum distance could have been achieved in multiple ways. Now any path

back from (n,m) to (0,0) corresponds to a sequence of operations that yields the minimum distance D(n,m), so the

transformation can be found by following pointers.

The total computation time and space required for this algorithm is O(nm).

9.3 All pairs shortest paths

Let G be a graph with positive edge weights. We want to calculate the shortest paths between every pair of nodes.

One way to do this is to run Dijkstra’s algorithm several times, once for each node. Here we develop a different

dynamic programming solution.

Lecture 9 9-5

Our subproblems will be shortest paths using only nodes 1 . . .k as intermediate nodes. Of course when k equals

the number of nodes in the graph, n, we will have solved the original problem.

We let the matrix Dk[i. j] represent the length of the shortest path between i and j using intermediate nodes 1 . . .k.

Initially, we set a matrix D0 with the direct distances between nodes, given by di j . Then Dk is easily computed from

the subproblems Dk−1 as follows:

Dk[i, j] = min(Dk−1[i, j],Dk−1[i,k]+ Dk−1[k, j]).

The idea is the shortest path using intermediate nodes 1 . . .k either completely avoids node k, in which case it

has the same length as Dk−1[i, j]; or it goes through k, in which case we can glue together the shortest paths found

from i to k and k to j using only intermediate nodes 1 . . .k−1 to find it.

It might seem that we need at least two matrices to code this, but in fact it can all be done in one loop. (Exercise:

think about it!)

D = (di j), distance array, with weights from all i to all j

for k = 1 to n do

for i = 1 to n do

for j = 1 to n do

D[i, j] = min(D[i, j],D[i.k]+ D[k, j])

Note that again we can keep an auxiliary array to recall the actual paths. We simply keep track of the last

intermediate node found on the path from i to j. We reconstruct the path by successively reconstructing intermediate

nodes, until we reach the ends.

9.4 Traveling salesman problem

Suppose that you are given n cities and the distances di j between them. The traveling salesman problem (TSP) is to

find the shortest tour that takes you from your home city to all the other cities and back again. As there are (n−1)!

possible paths, this can clearly be done in O(n!) time by trying all possible paths. Of course this is not very efficient.

Since the TSP is NP-complete, we cannot really hope to find a polynomial time algorithm. But dynamic

programming gives us a much better algorithm than trying all the paths.

Lecture 9 9-6

The key is to define the appropriate subproblem. Suppose that we label our home city by the symbol 1, and

other cities are labeled 2, . . . ,n. In this case, we use the following: for a subset S of vertices including 1 and at least

one other city, let C(S, j) be the shortest path that start at 1, visits all other nodes in S, and ends at j. Note that our

subproblems here look slightly different: instead of finding tours, we are simply finding paths. The important point

is that the shortest path from i to j through all the vertices in S consists of some shortest path from i to a vertex x,

where x ∈ S−{ j}, and the additional edge from x to j.

for all j do C({i, j}, j) := d1 j

for s = 3 to n do % s is the size of the subset

for all subsets S of {1, . . . ,n} of size s containing 1 do

for all j ∈ S, j 6= 1 do

C(S, j) := mini6= j,i∈S[C(S−{ j}, i)+ di j ]

opt := min j 6=iC({1, . . . ,n}, j)+ d j1

The idea is to build up paths one node at a time, not worrying (at least temporarily) where they will end up.

Once we have paths that go through all the vertices, it is easy to check the tours, since they consists of a shortest path

through all the vertices plus an additional edge. The algorithm takes time O(n22n), as there are O(n2n) entries in the

table (one for each pair of set and city), and each takes O(n) time to fill. Of course we can add in structures so that

we can actually find the tour as well. Exercise: Consider how memory-efficient you can make this algorithm.