CS计算机代考程序代写 DNA algorithm midterm1-sol.pdf

midterm1-sol.pdf

Name:
I followed the University’s Honor Code.

Student ID:

CMSC 423 Fall 2021 : Midterm 1

Solve each of the following problems. There are 4 problems and 100 total points. You may use NO notes,
calculators, books, phones etc.. If you need additional space, use the back of the exam pages (and note
that you’ve written there). If you need still more space, use the blank paper provided. If you think a
problem cannot be solved as stated, explain why. For the algorithm design questions, you should
describe your algorithm either in clear and unambiguous prose, or using clearly-written pseudocode. Please,
write as neatly as you can. Partial credit is granted, but we cannot grade what we cannot interpret. Good
luck!

Problem 1. (30 points)

i. The reverse complement of the DNA string AGGCTATAGCCT is .

ii. True or False (circle the correct one) : The main benefit of the su�x array over the su�x tree is that

it is asymptotically more space-e�cient.

iii. The Hamiltonian path problem asks to find a path through a graphG that visits each

exactly one time.

iv. To find a pattern of length |P | in a text of length |T |, the Z algorithm requires O( )

space and O( ) time.

v. In the Rabin-Karp algorithm, given the hash of the substring of the text of length |P | starting at index

i� 1, it takes O( ) time to compute the hash of the substring of the text of length |P |

starting at index i.

vi. The process by which RNA is produced from DNA is known as .

vii. The process by which proteins are produced from RNA is known as .

viii. If the length of the coding region of an mRNA transcript is 99 nucleotides long, then the resulting

protein will consist of an amino acid sequence of length .

ix. For a string S of length n (over an alphabet ⌃ of a small constant size), there are at mostO( )

nodes in the su�x trie of S.

1

[2 pts each ; no partial credit]

AGGCTATAGCCT

0

vertex /node

1171-11-1

1171-1-11

1

transcription

translation

33

m2

x. True / False (circle the correct answer). If one removed the “$” (or chosen sentinel character) from

the string before constructing the su�x tree, it would no longer be possible to uniquely identify each

su�x with a root-to-leaf path.

xi. Consider a text T with su�x array SA (T ), and an interval [4, 12] of consecutive su�x array entries. If

the longest common prefix (LCP) of the su�xes of T starting at T [SA[4]] and T [SA[8]] is x and the

LCP of the su�xes of T starting at T [SA[8]] and T [SA[12]] is y, then the LCP of the su�xes of T

starting at T [SA[4]] and T [SA[12]] will be at least .

xii. Assume I have a string S of length |S| over a large alphabet in which every character is distinct, as

well as the su�x array SA(S) of the string. Searching for a pattern P (in which every character is also

distinct), of length |P |, in SA(S) can be accomplished in O ( ) time.

xiii. True / False (circle the correct answer). In the Z-algorithm, it is not allowed for Z-boxes to be nested

or to overlap; they must all be disjoint.

xiv. True / False (circle the correct answer). Given a collection S = {s1, s2, . . . , sn} of strings where

L =
Pn

i=1 |si|, the Shortest Common Superstring of S must have length 
L
2
.

xv. The generalized su�x tree on a pair of strings S and T allows to find the longest common substring

(i.e. the longest string that is a substring of both S and T ) in O( ) time.

2

0

min Cx
, g)

also accept
¥Yn(G ITI , IPD)
OCIPD

=
ISH IT I

Problem 2. (20 points) Note: This question has 2 parts.

(a) Compute the Z-values (the Z array) for the following string:
A C C A G A C C A T A G A C C A G A T

(b) Assume the indexing of the Z-array starts at 0. Using the e�cient Z-box algorithm we discussed in
class, will you need to make any character comparisons to compute the value Z[15] (the Z value for the 16-th
character in the string)? Explain why (you may draw a diagram to aid in your description if you wish).

3

fats] 2-
10 points

90010–4 iii. iii. or , o , o no partial
accept

-1,0, 19

10 points .

full if correct
answer

,

if wrong , 5 points
for

drawing proper 2-
boxes of

describing -2 – algo

B


.

.
.

– JIYA-1
No
,
because

you
end up

in a

rule for the 2- algorithm

( case 2 Ca)) where no character
comparisons are required . The
2- box of 2112] covers Z[ 15 ]

Problem 3. (30 points) Note: This question has 2 parts

(a) You are given some text T . Let us define the score of a substring S as score(S) = |S| · freq(S), where
|S| is the length of S, · means multiplication, and freq(S) is the number of times the string S occurs in
T . Give an O(T ) algorithm for finding the highest scoring substring of T . That is, find the S such that
S = argmaxS0 score(S

0). You may assume in your solution that you can call a subroutine that will construct
any of the full text indices we have discussed in class so far (e.g. su�x trie, su�x tree, su�x array + LCP1
array) in optimal time.

4

15 points

b- points
for linear STISA

construction

✓ 5 points for correct /
linear

freq Fone
5 points for final traversal

☒ Note : Accept either specific / precise and unambiguous
text

,
or pseudocode .

(1) Construct the suffix tree of T ST -1 (04-11) time
space .

÷

)

(2) Augment STT , in each internal node
,

with the number

of leaves rooted at this node . This can be done

in 011-11) time and space with a DFS
.
call this

freqln) –
(3) Now

,
perform a traversal of ST -1 ( say ,

in – order
,

but it doesn’t matter) . As you visit
each node

,

keep track of its striny-depth_ ( call this sdln
))

.

At each node
,
evaluate Sdln) . fregln) . Keep

track of the maximum value of this quantity
over

your
traversal

.
At the end of

your traversal , this is the answer .

(b) For this sub-problem, you must use the su�x array, you cannot provide a solution using

the su�x trie/tree. You are given some text T along with its su�x array SA = SA(T ), the LCP1 array
of this su�x array LCP = LCP1(SA(T )) and some parameter k. Provide an O(|T | k) algorithm that will
find longest k-repeated substring — that is a substring that occurs at least k times. For example, if k = 5,
then the substring must appear at least 5 times in T (though it could occur more frequently as well). Note:
It is possible to do this in O(|T |) time, but I am not requiring your solution to achieve this tighter bound.

5

15 points

✓ ↳ to points
for getting

pairwise min LCP over

interval of length K

even if overall algo is not

right .

def find LKRCT , SA , LCP, K){

lKr= 0
,
lkri= 0

for i c- TO
,
ten ISA) – K){

pl = –

for j c-
[ 0

,
K) {

pl = min ( LCP[i+g] , pl)

}

if pl > lkr {

lkr = pl
lkri = i

3

}

lkrs =
” ” 11 starts empty

if lkr 7k {

}
lkrs = 1- [ lkri : lkritlkr

]

return lkrs

}

Problem 4. (20 points) Note: This question has 2 parts

(a) Below is the su�x tree for a string T. Construct the su�x array of T, and explain how you did this:

6

10 points

5 for correct SA

✓ 5 fordescription
↳Note : they may
have just extracted

the full string from
the ST and sorted the

suffixes ; this is OK
too

, we didn’t require
efficiency here

The suffix
array is

876 4 1 30 52

this can be obtained by doing a

lexicographic

order DFS over the suffix tree
. That is

,
at each

internal node
,
visit te

outgoing edges in lexicographic
order and append the index at each leaf
to the suffix

array being constructed .
In this example , this happens to be left-to-right
order ; this need not always be te case .

(b) Write a function (you must use psedocode for this problem) to produce the su�x array SAT of a string
T from the su�x tree STT of the text T in time and space bounded above by O(|T |). You should name this
function tree_to_array(root) where root is the root of STT. Additionally you may assume that T is a
DNA string over the alphabet ⌃ = A,C,G, T, $ and that you have access to the following two functions :

i. child(n,�) which fetches the child of node n corresponding to character � in O(1) time, or returns ;
if no such child exists

ii. idx(`) that returns the index in the su�x tree associated with node ` (which must be a leaf node).
This is the o↵set where the su�x ending at this leaf node begins within the text.

7

10 points
→ 5 points

for correct recursion (visit children in order)
✓ → 5 points for correct base case ( append idxln) in leaf)

def tree – to –
array

(root){

SA=[]

return t2 ahelper ( root , SA)
}

def tZahelperkn , SA) {
it is


leaf Lr) { SA . append ( idx ( n)) ; return}

for o e [ $
,
A

,
C
,
G
,
T] { 11 ← most visit in this order

C = child ( n , d)

if c. =/ {

Edahelperlc , SA)
}

}
3

def is

leaf ( n) {

if child ( n
,
$) =/ 0 { return false }

if child In
,
A) =/ 0 { return false }

if child (n
, c) =/ 0 { return false }

if child (n
,G) =/ 0 { return false }

if child (n
) -1) -1-0 { return false }

return true

}