midterm1-sol.pdf
Name:
I followed the University’s Honor Code.
Student ID:
CMSC 423 Fall 2021 : Midterm 1
Solve each of the following problems. There are 4 problems and 100 total points. You may use NO notes,
calculators, books, phones etc.. If you need additional space, use the back of the exam pages (and note
that you’ve written there). If you need still more space, use the blank paper provided. If you think a
problem cannot be solved as stated, explain why. For the algorithm design questions, you should
describe your algorithm either in clear and unambiguous prose, or using clearly-written pseudocode. Please,
write as neatly as you can. Partial credit is granted, but we cannot grade what we cannot interpret. Good
luck!
Problem 1. (30 points)
i. The reverse complement of the DNA string AGGCTATAGCCT is .
ii. True or False (circle the correct one) : The main benefit of the su�x array over the su�x tree is that
it is asymptotically more space-e�cient.
iii. The Hamiltonian path problem asks to find a path through a graphG that visits each
exactly one time.
iv. To find a pattern of length |P | in a text of length |T |, the Z algorithm requires O( )
space and O( ) time.
v. In the Rabin-Karp algorithm, given the hash of the substring of the text of length |P | starting at index
i� 1, it takes O( ) time to compute the hash of the substring of the text of length |P |
starting at index i.
vi. The process by which RNA is produced from DNA is known as .
vii. The process by which proteins are produced from RNA is known as .
viii. If the length of the coding region of an mRNA transcript is 99 nucleotides long, then the resulting
protein will consist of an amino acid sequence of length .
ix. For a string S of length n (over an alphabet ⌃ of a small constant size), there are at mostO( )
nodes in the su�x trie of S.
1
[2 pts each ; no partial credit]
AGGCTATAGCCT
0
vertex /node
1171-11-1
1171-1-11
1
transcription
translation
33
m2
x. True / False (circle the correct answer). If one removed the “$” (or chosen sentinel character) from
the string before constructing the su�x tree, it would no longer be possible to uniquely identify each
su�x with a root-to-leaf path.
xi. Consider a text T with su�x array SA (T ), and an interval [4, 12] of consecutive su�x array entries. If
the longest common prefix (LCP) of the su�xes of T starting at T [SA[4]] and T [SA[8]] is x and the
LCP of the su�xes of T starting at T [SA[8]] and T [SA[12]] is y, then the LCP of the su�xes of T
starting at T [SA[4]] and T [SA[12]] will be at least .
xii. Assume I have a string S of length |S| over a large alphabet in which every character is distinct, as
well as the su�x array SA(S) of the string. Searching for a pattern P (in which every character is also
distinct), of length |P |, in SA(S) can be accomplished in O ( ) time.
xiii. True / False (circle the correct answer). In the Z-algorithm, it is not allowed for Z-boxes to be nested
or to overlap; they must all be disjoint.
xiv. True / False (circle the correct answer). Given a collection S = {s1, s2, . . . , sn} of strings where
L =
Pn
i=1 |si|, the Shortest Common Superstring of S must have length
L
2
.
xv. The generalized su�x tree on a pair of strings S and T allows to find the longest common substring
(i.e. the longest string that is a substring of both S and T ) in O( ) time.
2
0
min Cx
, g)
also accept
¥Yn(G ITI , IPD)
OCIPD
=
ISH IT I
Problem 2. (20 points) Note: This question has 2 parts.
(a) Compute the Z-values (the Z array) for the following string:
A C C A G A C C A T A G A C C A G A T
(b) Assume the indexing of the Z-array starts at 0. Using the e�cient Z-box algorithm we discussed in
class, will you need to make any character comparisons to compute the value Z[15] (the Z value for the 16-th
character in the string)? Explain why (you may draw a diagram to aid in your description if you wish).
3
fats] 2-
10 points
90010–4 iii. iii. or , o , o no partial
accept
-1,0, 19
10 points .
full if correct
answer
,
if wrong , 5 points
for
drawing proper 2-
boxes of
describing -2 – algo
B
•
–
.
.
.
– JIYA-1
No
,
because
you
end up
in a
rule for the 2- algorithm
( case 2 Ca)) where no character
comparisons are required . The
2- box of 2112] covers Z[ 15 ]
Problem 3. (30 points) Note: This question has 2 parts
(a) You are given some text T . Let us define the score of a substring S as score(S) = |S| · freq(S), where
|S| is the length of S, · means multiplication, and freq(S) is the number of times the string S occurs in
T . Give an O(T ) algorithm for finding the highest scoring substring of T . That is, find the S such that
S = argmaxS0 score(S
0). You may assume in your solution that you can call a subroutine that will construct
any of the full text indices we have discussed in class so far (e.g. su�x trie, su�x tree, su�x array + LCP1
array) in optimal time.
4
15 points
b- points
for linear STISA
construction
✓ 5 points for correct /
linear
freq Fone
5 points for final traversal
☒ Note : Accept either specific / precise and unambiguous
text
,
or pseudocode .
(1) Construct the suffix tree of T ST -1 (04-11) time
space .
÷
)
(2) Augment STT , in each internal node
,
with the number
of leaves rooted at this node . This can be done
in 011-11) time and space with a DFS
.
call this
freqln) –
(3) Now
,
perform a traversal of ST -1 ( say ,
in – order
,
but it doesn’t matter) . As you visit
each node
,
keep track of its striny-depth_ ( call this sdln
))
.
At each node
,
evaluate Sdln) . fregln) . Keep
track of the maximum value of this quantity
over
your
traversal
.
At the end of
your traversal , this is the answer .
(b) For this sub-problem, you must use the su�x array, you cannot provide a solution using
the su�x trie/tree. You are given some text T along with its su�x array SA = SA(T ), the LCP1 array
of this su�x array LCP = LCP1(SA(T )) and some parameter k. Provide an O(|T | k) algorithm that will
find longest k-repeated substring — that is a substring that occurs at least k times. For example, if k = 5,
then the substring must appear at least 5 times in T (though it could occur more frequently as well). Note:
It is possible to do this in O(|T |) time, but I am not requiring your solution to achieve this tighter bound.
5
15 points
✓ ↳ to points
for getting
”
pairwise min LCP over
interval of length K
”
even if overall algo is not
right .
def find LKRCT , SA , LCP, K){
lKr= 0
,
lkri= 0
for i c- TO
,
ten ISA) – K){
pl = –
for j c-
[ 0
,
K) {
pl = min ( LCP[i+g] , pl)
}
if pl > lkr {
lkr = pl
lkri = i
3
}
lkrs =
” ” 11 starts empty
if lkr 7k {
}
lkrs = 1- [ lkri : lkritlkr
]
return lkrs
}
Problem 4. (20 points) Note: This question has 2 parts
(a) Below is the su�x tree for a string T. Construct the su�x array of T, and explain how you did this:
6
10 points
5 for correct SA
✓ 5 fordescription
↳Note : they may
have just extracted
the full string from
the ST and sorted the
suffixes ; this is OK
too
, we didn’t require
efficiency here
The suffix
array is
876 4 1 30 52
this can be obtained by doing a
”
lexicographic
”
order DFS over the suffix tree
. That is
,
at each
internal node
,
visit te
outgoing edges in lexicographic
order and append the index at each leaf
to the suffix
array being constructed .
In this example , this happens to be left-to-right
order ; this need not always be te case .
(b) Write a function (you must use psedocode for this problem) to produce the su�x array SAT of a string
T from the su�x tree STT of the text T in time and space bounded above by O(|T |). You should name this
function tree_to_array(root) where root is the root of STT. Additionally you may assume that T is a
DNA string over the alphabet ⌃ = A,C,G, T, $ and that you have access to the following two functions :
i. child(n,�) which fetches the child of node n corresponding to character � in O(1) time, or returns ;
if no such child exists
ii. idx(`) that returns the index in the su�x tree associated with node ` (which must be a leaf node).
This is the o↵set where the su�x ending at this leaf node begins within the text.
7
10 points
→ 5 points
for correct recursion (visit children in order)
✓ → 5 points for correct base case ( append idxln) in leaf)
def tree – to –
array
(root){
SA=[]
return t2 ahelper ( root , SA)
}
def tZahelperkn , SA) {
it is
–
leaf Lr) { SA . append ( idx ( n)) ; return}
for o e [ $
,
A
,
C
,
G
,
T] { 11 ← most visit in this order
C = child ( n , d)
if c. =/ {
Edahelperlc , SA)
}
}
3
def is
–
leaf ( n) {
if child ( n
,
$) =/ 0 { return false }
if child In
,
A) =/ 0 { return false }
if child (n
, c) =/ 0 { return false }
if child (n
,G) =/ 0 { return false }
if child (n
) -1) -1-0 { return false }
return true
}