代写代考 FIT2004 Week 7 Studio Sheet (Solutions)

Week 7 Studio Sheet (Solutions)
Useful advice: The following solutions pertain to the theoretical problems given in the tutorial classes. You are strongly advised to attempt the problems thoroughly before looking at these solutions. Simply reading the solu- tions without thinking about the problems will rob you of the practice required to be able to solve complicated problems on your own. You will perform poorly on the exam if you simply attempt to memorise solutions to the tutorial problems. Thinking about a problem, even if you do not solve it will greatly increase your understanding of the underlying concepts. Solutions are typically not provided for Python implementation questions. In some cases, psuedocode may be provided where it illustrates a particular useful concept.
Implementation checklist
Assessed Preparation
Problem1. Drawaprefixtriecontainingthestringscat,cathode,canopy,dog,danger,domain.Terminate
each string with a $ character to indicate the ends of words.
It will be most beneficial for your learning if you have completed this checklist before the studio.
By the end of week 7, write Python code for a Trie structure (recommended to use a class). It should have an insert method and a search method.
Source: https://www.cs.usfca.edu/~galles/visualization/Trie.html
Problem2. DrawasuffixtreeforthestringABAABA$. Solution
Studio Problems
Problem 3. Describe an algorithm that given a sequence of strings over a constant-size alphabet, counts how many different ones there are (i.e. how many there are ignoring duplicates). Your algorithm should run in O (T ) time where T is the total length of all of the strings.
Initialise count = 0. Append “$” to each string. Insert each string into a trie. After all strings have been inserted, traverse the tree and count the number of “$” nodes. If two strings are the same, they will termi- nate in the same node, so the number of “$”s is the number of distinct strings. It will take O (T ) to build the trie, and O (T ) to traverse the trie, so the whole algorithm runs in O (T ).
Problem4. DrawasuffixtreeforthestringGATTACA$. Solution
Source: https://visualgo.net/en/suffixtree
The numbers at the leaf nodes are the suffix IDs (assuming first suffix has ID 0). This visualisation merges the nodes that form a chain but do not show tuple representation of the branch. Write the tuple repre- sentation yourself. In exams, you must write the nodes in tuple (x,y) representation.
Problem5. DescribeanalgorithmthatgivenastringSoflengthn,countsthenumberofdistinctsubstringsof S in O(n) time.
First let’s consider how to solve this with a suffix trie, and then convert our solution to work in a suffix tree. A substring of S is a prefix of a suffix of S. This means that each substring of S can be found by following some path from the root of a suffix trie of S to some node (possibly leaf, possibly internal). If two substrings of S are the same, they will end at the same node. So to count distinct substrings, we can just count the nodes of the suffix trie, which takes O(n2) time.
Now we want to use a suffix tree instead. If we think about converting a trie to a tree, each node in the tree is made from one or more nodes in the trie, and the number of characters stored at each node in the tree is exactly the number of nodes which were compressed to create that tree node. So to do the equivalent of counting nodes in the trie, we need to count total characters in the tree. We can do this with a single traversal taking O (n ) time, because a suffix tree has O (n ) nodes, and at each node, we look up the indices of its parent edge label to determine its length, so counting the characters in a node takes O (1).
Problem6. Earlier,inStudio4,welearnedaboutthelongestcommonsubsequenceproblem.Nowweconsider a similar, related problem, the longest common substring problem. Given two strings s1 and s2 , your goal is to find a string that is a substring of both s1 and s2 and is as long as possible. Describe how a suffix tree can be used to solve this problem. Your algorithm should run in O (n + m ) time, where n , m are the lengths of s1 , s2 respectively. You may assume that you can construct a suffix tree in linear time.
Hint: It is often easier to think about how to solve problems in a trie context, and then convert your solution to work in a tree.
Finding the longest common substring of s1 and s2 is very similar to finding the longest repeated substring of s1s2, which we know can be found by creating a suffix tree and then finding the deepest node with more than one child. The issue with naively applying this to s1 s2 is that, in the concatenated string, the longest repeated substring might lie totally in s1 (or s2), or partially in s1 and partially in s2. We can solve the second problem easily by adding a seperator character (which is not present in either string) in our concatenation, so that we have s1#s2$.
An important observation about the placement of “#” and “$” in the tree is that no internal node can contain either of them. That is, all instances of these terminator characters must appear in leaves. This is clearly true for “$”, since it terminates the whole string. For “#”, since it only appears once in the string, and since we know that all internal nodes in a tree have at least two children, i.e. they correspond to repeating substrings, we know that “#” cannot be in any internal node, as it does not repeat.
Bearing this observation in mind, we need to address the problem of ensuring that one of our repetitions is in s1, and the other is in s2. Consider an interal node of the tree. If it has a leaf descendant node with a “#”, then it must appear in s1. If it has a leaf descendant without a “#”, then it must appear in s2. So we need to find the deepest such node.
We traverse the tree, and at each node we store two flags, one for being a substring of s1 and the other for being a substring of s2. If a node is a substring of s1, all its ancestors are substrings of s1, and similarly for s2. Once we have finished we return the depth of the node with the longest substring that was a substring of both s1 and s2, i.e. had both flags set.
It is worth having a think about how to set the flags for every node in the tree in linear time, and also how to check whether a leaf node contains a “#” in constant time.
Problem7. Describehowtomodifyaprefixtriesothatitcanperformsqueriesoftheform“countthenumber of strings in the trie with prefix p ”. Your modification should not affect the time or space complexity of building the trie, and should support queries in O (m ) time, where m is the length of the prefix.
All words with a prefix p are leaves in the subtree rooted at the node containing the last character in p . So we have to count the leaves in the subtree of that node. We could traverse the trie to get to the last node of p in O (m ) and then do a complete traversal of the subtree, but this would take much longer than O (m ), possibly even taking O (T ) where T is the number of characters in the tree.
Instead, we modify the insertion algorithm. We will store a counter at each node of our trie, initialised to 0. When inserting a word, whenever we move to or create a node, we increment the counter at that node by 1. This counter tracks the number of words which have the prefix corresponding to that node. Then, to count the number of strings with prefix p , we just traverse the tree to the last node of p which takes O (m ) time, and return the count of that node.
Problem8. Givenatrieandaquerystring,wewanttofindthelexicographicallygreateststringinthetriewhich is lexicogrphically less than or equal to the query string.
Your data structure should perform these queries in O (n + m ) time, where m is the length of the query string, and n is the length of the answer string (the predecessor). It is possible that some queries have no string which satisfies the criterion above (if it is less than all of the strings in the trie), in which case you should return null.
Example: Suppose the trie contains “aaa”, “aba”, “baa”. If the query is “ba”, the result is “aba”. If the query is “baa”, the result is “baa”.
Let’s assume that the contents of our trie are terminated with a $ ∈/ Σ. We traverse the query string, keep- ing track of the deepest node with a child lexicographically less than the child we are about to move to (initially null). This node is the longest prefix that the query string shares with the answer. We either end up failing at some point (the current node has no child with the character we want), or we reach the end of the query string. If we reach the end of the query string (including the “$”) we return the query. Oth- erwise we look at the node we were keeping track of, and if it is null, we return null since the word must have no predecessor, as none of its prefixes could be extended with a lexicographically lesser character. If such a node does exist, go back to that node, go along the edge with the greatest character such that it is lexicographically less than the character we followed in our initial traversal. Then follow the lexico- graphically greatest path from that node to a leaf. The string corresponding to that path is the solution. This takes O (m ) to traverse the query and O (n ) to traverse the answer string.
In the below example, the query is candelabra. After the traversal, n is the deepest node with a child lexicographically less than the node we traverse to (d does not satisfy this condition since its children are both greater than e). So after traversing cand we go back to n and then traverse until we find a leaf, always choosing the lexicographically greatest option, which yields canal.
Source: https://www.cs.usfca.edu/~galles/visualization/Trie.html 1: functionPREDECESSOR(S[1..n])
2: node = root
3: ancestor = null // Lowest ancestor with an extension lexicographically less than S
4: ancestor_child = null
5: answer = “”
6: for each character c in S [1..n ] do
7: if node has an edge with label < c then 8: ancestor = node 9: ancestor_child = the greatest child of ancestor lexicographically less than c 10: end if 11: if node has an edge for character c then node = node.get_child(c) 12: else break 13: ifc=$thenreturnanswer //Sisinthetrie 14: else answer += c 15: end for 16: if ancestor = null then return null // Nothing is lexicographically less than S 17: while node ̸= ancestor do 18: node = node.parent 19: answer.pop_back() // Remove last character from answer 20: end while 21: answer += ancestor_child.character 22: node = ancestor_child 23: while node is not a leaf do 24: c = lexicographically greatest edge of node 25: answer += c 26: node = node.get_child(c) 27: end while 28: return answer.pop_back() // Remove the $ from the answer string 29: endfunction Problem9. Youaregivenasequenceofnstringss1,s2,...,sn eachofwhichhasanassociatedweightw1,w2,...,wn. You wish to find the “most powerful prefix” of these words. The most powerful prefix is a prefix that maximises the following function score(p)= 􏰁 wi ×|p|, si suchthatp is a prefix of si where |p | denotes the length of the prefix p . Describe an algorithm that solves this problem in O (T ) time, where T is the total length of the strings s1 , ..., sn . Assume that we have three strings baby, bank, bit with weights 10, 20 and 40, respectively. The score of prefix ba is 2×10+2×20 = 60, the score of prefix b is 1×10+1×20+1×40 = 70 andthescoreofprefixbitis3×40=120. For each prefix, we need to keep track of its score. We modify our prefix trie to store a score at each node, and each time we insert a word, we update the score by adding to it the weight of the current word multiplied by the length of the prefix represented by this node. Once all insertions are complete, we traverse the tree to find the node with the maximum score. The prefix corresponding to this node is the answer. Supplementary Problems Problem 10. Given a string S and a string T , we wish to form T by concatenating substrings of S . Describe an algorithm for determining the fewest concatenations of substrings of S to form T . Using the same substring multiple times counts as multiple concatenations. For example, given the strings S = ABCCBA and T = CBAABCA, ittakesatleastthreesubstrings,e.g.CBA + ABC + A.YouralgorithmshouldruninO(n+m)time,wherenand m are the lengths of S and T . You can assume that you have an algorithm to construct suffix tree in linear time. We first make the observation that we can be greedy. If we are up to a certain point T [ j ...], then it is optimal to take the longest possible substring from S that matches T [ j ..k ] for the largest possible k . First build a suffix tree from S , taking O (n ) time, then search for the string T in the tree. This search is going to find the longest substring of S which corresponds to the start of T . Once we reach a point where the search stops, this means that the current substring cannot be extended, so we go back to the root and search again from the next character of T . We repeat this process until we reach the last character in T . Below is an illustration of the example case given in this problem. Source: https://www.cs.usfca.edu/~galles/visualization/Trie.html Problem 11. Describe an algorithm for finding a shortest unique substring of a given string S . That is, finding a shortest substring of S that only occurs once in S . For example, given the string cababac, the shortest unique substrings are ca and ac of length two. Your algorithm should run in O (n ) time, where n is the length of S . You can assume that you have an algorithm to construct suffix tree in linear time. Consider a suffix trie of S . A unique substring corresponds to a node that has only one leaf below it. We want to find the highest such node, since this will be the shortest such substring. Define a candidate node as a node which has a leaf as a child, whose edge label contains more than just $. In a suffix tree, all internal nodes have two or more children. So we want to find the shallowest candidate node, i.e. the candidate node which has the fewest characters on the path from the root. Let’s traverse the tree, tracking the shallowest candidate. When we finish the traversal, the correct length is the length stored in the shallowest candidate + 1. This is because we need to include the first character in one of the leaf edges, since internal nodes are non-unique substrings. The substring stored on the path from the root to the shallowest candidate, including the first character in any of its descendants is a shortest unique substring. Source: https://visualgo.net/en/suffixtree Red numbers indicate the length we are tracking as we traverse the trie. When we finish, there are two candidate nodes with depth 1, so the length of the shortest non-repeating substring is 2, and the solutions is either AC or CA Problem 12. Describe an algorithm for finding a shortest string that is not a substring of a given string S over some fixed alphabet. For example, over the alphabet {A,B} the shortest string that is not a substring of AAABABBBA is BAA of length three. Your algorithm should run in O (n ) time, where n is the length of S . If we traverse the tree, the shallowest node we find that does not have every character in the alphabet as a child will give us the shortest missing substring. This is because, if a node has every alphabet character as a child, then we can extend the substring at that node to every possible substring with 1 more character. If one or more alphabet characters are missing, we cannot do this, and so we have found a missing sub- string. The missing substring is given by the string at that node + the missing alphabet character. Note that the number of nodes in the suffix tree is O (n ). So, the worst-case cost is O (n ). Source: https://visualgo.net/en/suffixtree In the example above, the shortest missing string is BAA Problem 13. (Advanced) Prefix tries also have applications in processing integers. We can store integers in their binary representation in a prefix trie over the alphabet {0, 1}. Use this idea to solve the following problem. We are given a list of w-bit integers a1,a2,...,an. We want to answer queries of the form “given a w-bit integer x, select oneoftheintegersai tomaximisex⊕ai,where⊕denotestheXORoperation.YourqueriesshouldruninO(w) time. To maximise a binary number, we want to set 1 bits in the highest slots possible. This means we should select the ai which has the opposite bit to x in the highest possible positions as we read from left to right. If we have a prefix trie containing a1,a2,..an, using x we can traverse the trie as follows. Each time we cometoanode,seeifthenodehasachildwhichisNOT(x[i])(i.e.0ifx[i]=1,or1ifx[i]=0). Ifso, choose that child. If not, the node has only one child, so go to that node. One way to view this is like the ordinary trie traversal but opposite. We always chose to go to a child with a different label than the current character, instead of the one with the same character if possible. When we reach a leaf, the choice of ai which maximises ai XOR x is that leaf. Since we traversed a single path from root to leaf, our query has taken O (w ) time. Problem14. (Advanced) Givenalistofw-bitintegersa1,a2,...,an,findthemaximumpossiblecumulativeXOR of a subarray ai , ..., a j , i.e. maximise ai ⊕ai+1 ⊕...⊕aj−1 ⊕aj , where ⊕ denotes the XOR operation. Your algorithm should run in O (n w ) time. Hint: Use the fact that x ⊕ x = 0, and the ideas from Problem 13. First, some properties of XOR. It is associative, meaning that ((a ⊕ b ) ⊕ c ) = (a ⊕ (b ⊕ c )), and, as given in the question, x ⊕ x = 0. This means that if we have a list of numbers all being XOR’ed together, and some of these number are identical, we can reorder the list using associativity until the pairs of identical numbers are next to each other, and then cancel them using the second property. Define a function F(L,R)=aL ⊕aL+1⊕...⊕aR TheproblemwearetryingtosolveisnowtofindanL andR whichmaximiseF,where1≤L ≤R ≤n. Note that because of the facts above, we have F (L , R ) = F (1, L − 1) ⊕ F (1, R ) So if we had all the values F (1, 1), F (1, 2)...F (1, j − 1) in a trie, we could compute the maximum value for F (i, j),i ≤ j in O(w) with a query to the trie described in Problem 13. Then we simply iterate over 1≤i ≤n,keepingtrackofthemaximum,whichtakesO(nw). 1: 2: 3: 4: 5: 6: 7: 8: 9: functionMAX_SUBARRAY_XOR(A[1..n]) prefix_xor = 0 trie = Trie() // The trie from Problem 13 maximum_xor = 0 fori=1tondo prefix_xor = prefix_xor ⊕A[i ] // Extend the current prefix maximum_xor = max(maximum_xor, trie.query(prefix_xor)) trie.insert(prefix_xor) return maximum_xor endfunction Problem 15. (Advanced) In o 5, we wrote a dynamic programming solution to the following problem. Given a target string S of length n and a list of m strings of length at most n , find the minimum number of strings from L to concatenate to form S . A word from L may be used multiple times. We devised an O (n 2 m ) algorithm for this problem. Describe how you can use a prefix trie to speed up this solution to O (n 2 + n m ) time. In Studio 5, we wrote the following dynamic programming solution: DP[i ] = {The minimum number of words needed to form the suffix S [i ..n ]}, for 0 ≤ i ≤ n . w prefix of S[i..n] if i = n , ifnowordw ∈L isaprefixofS[i..n] otherwise. DP[i + |w |] A part of this should immediately jump out at us. Note that we check over all strings w ∈ L that are prefixes of S[i..n]. We have a very good data structure for handling this sort of thing, a prefix trie! So, instead of simply looping over all of the words in L and checking whether they are prefixes, we will first insert all of the contents of L into a trie, terminated by $ ∈/ Σ. Since there ar