代写代考 FIT2004 Week 8 Studio Sheet (Solutions)

Week 8 Studio Sheet (Solutions)
Useful advice: The following solutions pertain to the theoretical problems given in the tutorial classes. You are strongly advised to attempt the problems thoroughly before looking at these solutions. Simply reading the solu- tions without thinking about the problems will rob you of the practice required to be able to solve complicated problems on your own. You will perform poorly on the exam if you simply attempt to memorise solutions to the tutorial problems. Thinking about a problem, even if you do not solve it will greatly increase your understanding of the underlying concepts. Solutions are typically not provided for Python implementation questions. In some cases, psuedocode may be provided where it illustrates a particular useful concept.
Assessed Preparation
Problem1. WritethesuffixarrayforthestringACACIA$andcomputeitsBurrows-Wheelertransform.
Source: https://visualgo.net/bn/suffixarray
The sorted rotations of ACACIA$ are shown below. The BWT is obtained by reading the characters down
the right hand side, shown in red.
The BWT of the string is therefore AI$CAAC.
Problem2. ShowthestepsintheinverseBurrows-WheelertransformforthestringN$RSOOCIMPSE.
To invert BWT we use the L-F mapping.
Which yields the string $NOISSERPMOC. Reversing this gives us the original string, COMPRESSION$
Studio Problems
Problem3. WritethesuffixarrayforthestringGATTACA$andcomputeitsBurrows-Wheelertransform. Solution
Source: https://visualgo.net/bn/suffixarray
The sorted rotations of ACACIA$ are shown below. The BWT is obtained by reading the characters down the right hand side, shown in red.
The BWT of the string is therefore ACTGA$TA.
Problem4. ShowthestepsintheinverseBurrows-WheelertransformforthestringTWOIPPR$ENO. Solution
To invert BWT we use the L-F mapping.
Which yields the string $TNIOPREWOP. Reversing this gives us the original string, POWERPOINT$ Problem5. Considerthestringwoolloomooloo$.TheBWTofthisstringisooolooooolwml$.Showthesteps
taken by the pattern matching algorithm for the following patterns: (a) olo
(b) oll (c) oo (d) wol
The BWT of woolloomooloo$ is ooolooooolwml$. If the location of a substring is also to be deter- mined, during the pre-processing step, in addition to the first and last column, we also maintain the ID of cyclic rotation (i.e., suffix ID) with each row – this is the same as a suffix array (see the figures below where the IDs are shown in red). During the query processing, these IDs are used to determine the lo- cation as explained later. When searching for patterns we start at the back of the pattern, and search in reverse, using the LF mapping to guide us to occurrences of the previous character each time. Diagrams for each search follow.
(a) Searchingforolo
Starting at “o”, we see there are 2 “l”s within the range in the last column, so we contract the range
In the last column there is only one “o” in our range, so there is one instance of the substring olo
Using the L-F mapping, our range now con- tains the cyclic shifts which have “lo” as a pre- fix.
We use L-F mapping to get the location of “o” in the first column and determine its location using suffix ID, e.g., the location of olo in the string is 10.
(b) Searchingforoll
Starting at “l”, we see there is one “l” within the range in the last column, so we contract the range
After applying the L-F mapping, we see one “o” in our range, we know there is one “oll” sub- string.
Again using the L-F mapping, we can now look up the location of our substring which is 3.
(c) Searchingforoo
Using the L-F mapping, we can now look up Starting with “o”, we can see 4 “o”s in the the locations of our substrings in the suffix ar-
range, so there are 4 “oo” substrings. ray (12, 2, 9 and 6)
(d) Searchingforwol
Starting at “l”, we see there are 2 “o”s within
the range in the last column, so we contract the Apply the L-F mapping. We are tracking ol
substrings.
We observe that the range contracts to 0 (there are no “l”s in the range) so there are no wol substrings.
Problem6. ConsidertheprefixdoublingalgorithmappliedtocomputingthesuffixarrayofJARARAKA$.Write the partially-sorted suffix array after the length two prefixes have been sorted. Write the corresponding rank array and perform the next iteration of prefix doubling, showing the partially-sorted suffix array for the length four prefixes.
After sorting the length 2 prefixes, the suffixes are in the following order, so we have the following suffix array and rank array.
Index Suffix Rank
9$1 8 A$ 2 6 AKA$ 3 2 ARARAKA$ 4 4 ARAKA$ 4 1 JARARAKA$ 5 7 KA$ 6 3 RARAKA$ 7 5 RAKA$ 7
The order of the suffixes
Index 123456789 SuffixArray 9 8 6 2 4 1 7 3 5 Rank 547473621
The corresponding suffix array and rank array
After performing one more iteration and sorting the length 4 prefixes, the suffixes will be in the following order, yielding the following suffix array and rank array.
Index Suffix Rank
9$1 8 A$ 2 6 AKA$ 3 4 ARAKA$ 4 2 ARARAKA$ 5 1 JARARAKA$ 6 7 KA$ 7 5 RAKA$ 8 3 RARAKA$ 9
The order of the suffixes
Index 123456789 SuffixArray 9 8 6 4 2 1 7 5 3 Rank 659483721
The corresponding suffix array and rank array
Since all of the suffixes have unique ranks, we can tell that the suffixes are fully sorted and therefore the suffix array is the final suffix array.
Problem 7. The minimum lexicographical rotation problem is the problem of finding, for a given string, its cyclic rotation that is lexicographically least. For example, given the string banana, its cyclic rotations are banana, ananab, nanaba, anaban, nabana, abanan. The lexicographically (alphabetically) least one is abanan. Describe how to solve the minimum lexicographical rotation problem using a suffix array.
A cyclic rotation of a string is a suffix of that string followed by a prefix of that string. Since a suffix array stores the suffixes in sorted order, the lexicographically least suffix will be the first element of the arrray (ignoring the “$” which will always be first.) However, we can run into problems such as the string XAA$:
Source: https://visualgo.net/bn/suffixarray
Here, the suffix “A$” would be followed by XA in its cyclic rotation, giving AXA, wheras the suffix AA$ would be followed by X, giving AAX, which is lexicographically earlier than AXA. So it is not enough to just construct the suffix array, we need to account for the elements of the corresponding prefix.
We do this by first appending a copy of the string in question to itself, and then constructing the suffix array. This gaurantees that the corresponding prefix for each suffix is taken into account in the order of the suffix array. Note that we should only consider suffixes which are strictly longer than the original string, since only these suffixes contain characters from the appended copy.
Source: https://visualgo.net/bn/suffixarray
Now AAXAA$X is the first such string in the suffix array, and hence AAX is the lexicographically least rota-
Problem 8. Consider the following rank array, obtained during the prefix doubling algorithm applied to the strings ABBAABABBA$. At this point in the algorithm, we have just finished updating the ranks for some value of k.
String ABBAABABBA $ SuffixID 1 2 3 4 5 6 7 8 9 10 11 Rank 46534546521
(a) Determine what the current value of k is. In other words, what lengths of strings are the ranks currently comparing?
(b) Describeindetailhowtheprefixdoublingalgorithmwouldcomparethefollowingpairsofsuffixesduring the next iteration (for strings of length 2k ):
• 1and2 • 1and5 • 2and8 • 3and9 • 1and7
(a) kcannotbe1,sincesuffix1and4aredifferent.kcouldbe2,sinceallidentical2letterstringshave the same rank. k cannot be 4, since ID 2 and ID 8 have the same rank, but are different 4 character strings. So k is 2.
Rank[1] = 4. Rank[2] = 6. Since Rank[1] < Rank[2], we can determine that suffix 1 should come before suffix 2 during sorting. • Rank[1] = 4. Rank[5]=4. Since they are equal, we need to compare the following 2 characters. Rank[1+2] = 5. Rank[5+2] = 4. Since Rank[5+2] < Rank[1+2], we can determine that suffix 5 is less than suffix 1. • Rank[2]=6.Rank[8]=6.Sincetheyareequal,weneedtocomparethefollowing2characters. Rank[2+2] = 3. Rank[8+2] = 2. Since Rank[8+2] < Rank[2+2], we can determine that suffix 8 is less than 2. • Rank[3]=5.Rank[9]=5.Sincetheyareequal,weneedtocomparethefollowing2characters. Rank[3+2] = 4. Rank[9+2] = 1. Since Rank[9+2] < Rank[3+2], we can determine that suffix 9 is less than suffix 3. • Rank[1]=4.Rank[7]=4.Sincetheyareequal,weneedtocomparethefollowing2characters. Rank[1+2] = 5. Rank[7+2] = 5. Since they are also equal, we cannot differentiate suffixes 1 and 7 on their first 4 characters. As the sort needs to be stable, we will not change their relative order. Supplementary Problems Problem 9. A string S of length n is called k-periodic if S consists of n/k repeats of the string S[1..k]. For example, the string abababab is two-periodic since it is made up of four copies of ab. Of course, all strings are n -periodic (they are made of one copy of themselves!) The period of a string is the minimum value of k such that it is k -periodic. Describe an efficient algorithm for determining the period of a string using a suffix array. To solve this problem, let’s relate the period of the string to the concept of cyclic rotations. If a string is k-periodic, then this means that it is equal to its kth cyclic rotation. For example, if we take the string abababab, then its cyclic rotations are babababa, then abababab, which is the original string again. We can therefore use an idea similar to Problem 7. Let’s append a copy of the string S to itself to obtain the string S S . Now we observe that if a string is k -periodic, then the string S will appear in position k + 1 in SS. For example, take abababab, and append it to itself to obtain abababababababab. Then we notice that the substring at position 3 is precisely the original string. Thus, after suffix array on the string SS is constructed, we can do a substring search for S in the suffix array to find all suffixes that contain the substring S . Among these suffixes, the suffix with the smallest ID (ignoring the first suffix) can be used to determine the value of k . For example, we search abababab in the suffix array of abababababababab which matches with suffixes 1, 3, 5, 7, 9. The smallest suffix (except first suffix) that matches is 3. There- fore, the value of k is 2. Building the suffix array takes O (n log2 n ) time, and substring search takes O (n log(n )). Problem10. Writepsuedocodeforanalgorithmthatconvertsthesuffixtreeofagivenstringintothesuffixarray inO(n)time. Solution A suffix array is just the suffixes in sorted order. Since the suffix tree contains all of the suffixes as leaves, we just need to traverse it in lexicographic order and note down the order in which we find the leaves, and this will give us the suffix array. Whenever we reach a leaf, we can figure out which suffix it belongs to by looking at its length. A suffix of length k must be the suffix from position n − k + 1 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: functionTO_SUFFIX_ARRAY() SA[1..n] = null counter = 1 // Tracks which suffix we are up to CONVERT(root, 0) return SA[1..n] endfunction functionCONVERT(node,length) if node is a leaf then SA[counter] = n - length + 1 counter = counter + 1 else for each child of node, in lexicographic order do CONVERT(child, length + child.num_chars) end for end if endfunction Problem 11. (Advanced) Recall the pattern-matching algorithm that uses the Burrows-Wheeler transform of the text string. One downside of this algorithm is the large memory requirement if we decide to store the occurrences Occ(c , i ) explicitly for every position i and every character c . In this problem we will explore some space-time tradeoffs using milestones. Suppose that instead of storing Occ(c , i ) for all values of i , we decide to store it for every k th position only, i.e. we store Occ(c , 0), Occ(c , k ), Occ(c , 2k ), Occ(c , 3k ), ... for all characters c . (a) Whatisthespacecomplexityofstoringthepreprocessedstatisticsinthiscase? (b) Whatisthetimecomplexityofperformingthepreprocessing? To compute Occ(c , i ), we will take the value of Occ(c , j ) where j is the nearest multiple of k up to i and then manually count the rest of the occurrences in S [ j + 1..i ] (c) Whatisthetimecomplexityofperformingaqueryforapatternoflengthm? (d) Describe how bitwise operations can be exploited to reduce the space complexity of the preprocessed statistics to O 􏰆 |Σ|n 􏰇 where w is the number of bits in a machine word, while retaining the ability to per- form pattern searches in O (m ). (a) SincewewillstorethevalueofOcc(c,k)forallmultiplesofkupton,theamountofspacerequired O 􏰊|Σ|n 􏰋. k (b) Weperformalinearsweepoverthetextstring,maintainingthecountsforallcharacters.Whenwe hit a multiple of k , we save the current state of the counts. In total, this takes O 􏰊|Σ|+n + |Σ|n 􏰋 k (c) For a given position, we can compute the nearest multiple of k in constant time, hence the only work required is to count the number of occurrences between the milestone and the query position. In the worst case, we may have to count k − 1 elements, hence this takes O (k ) time per character. Therefore to search a pattern of length m takes O (m k ) time in the worst case. (d) InordertoachieveO(m)timepatternsearches,weneedtobeabletocomputeOcc(c,i)inconstant time. To do so, we make milestones of size k = w , giving us the value of Occ at every w th position. This gives us the space complexity desired. Our goal is to speed up the computation of counting the remaining occurrences between the milestones. For each milestone position j , for each character c ∈ A, let’s store a w -bit integer whose b th bit is 1 if the character at position j + b is c , or a 0 otherwise. To compute the number of occurences of c in the interval [ j + 1..i ] then corresponds to counting the number of 1 bits in the first i − j positions of this integer. This can be achieved by masking off the bottom bits by taking its bitwise AND with 2i − j − 1 and then using a popcount operation, which counts the number of 1 bits in an integer. Assuming that our machine model supports popcount in O (1) time, we can now compute Occ(c , i ) in O (1) time. Since we stored just one integer at each milestone for each character, the amount of extra space used is O 􏰆 |Σ|n 􏰇, in addition to the O 􏰆 |Σ|n 􏰇 space used for the milestones, making the ww total space complexity as required. Problem 12. (Advanced) A given suffix array could have come from many strings. For example, the suffix array 7, 6, 4, 2, 1, 5, 3 could have come from banana$, or from dbxbxa$, or many other possible strings. (a) Devise an algorithm that given a suffix array, determines any string that would produce that suffix array. You can assume that you have as large of an alphabet as you need. (b) Deviseanalgorithmthatgivenasuffixarray,determinesthelexicographicallyleaststringthatwouldpro- duce that suffix array. You can assume that you have as large of an alphabet as required. Your algorithm should run in O (n ) time (a) Toconstructanystringthathasagivensuffixarrayisquitesimple.Thefirstsuffixwillbetheempty suffix so we always have $ as the final character. From there, the next suffix should begin with a, the next with b, the next with c, and so on. This means that we place a at position SA[2], b at position SA[3], c at position SA[4] and so on. For example, given the suffix array above, this algorithm would produce the string dcfbea$. (b) Toproducethelexicographicallyearlieststringisabittrickier.Whatwewouldliketodoisreusethe same character as many times as possible, starting several suffixes with a, then several with b, and so on if we can. But we need to be careful to ensure that we keep the order of the suffixes correct. Suppose we place some letter c at the position SA[i ]. How do we determine whether it is safe to also place c at position SA[i + 1]? Well, the suffix at position SA[i + 1] must be greater since it comes later in the suffix array, so we need to ensure that c +S[(SA[i]+1)..n]