CS考试辅导 FIT2004. Here we choose to focus on suffix trees as they are tremendously

Advanced algorithms and data
structures
Week 3 lecture notes by
Based on lecture slides by

Copyright By PowCoder代写 加微信 powcoder

Faculty of Information Technology September 30, 2021

1 Suffixtrees ………………………………………….. 2
2 Implicitsuffixtrees………………………………………. 7
3 Ukkonen’salgorithmforlineartimeconstructionofsuffixtrees . . . . . . . . . . . . . . . . . . . . . 8
3.1 Overview ……………………………………….. 8
3.2 Naive implicit suffix tree construction
3.3 Speeding up traversal with suffix links
3.4 Skip-counting…………..
3.5 Spaceefficientrepresentationofedgelabels ………………………. 24
3.6 Extensionterminationcriterion(theshowstopperrule). . . . . . . . . . . . . . . . . . . . . . 25
3.7 Rapidleafextension ………………………………….. 26
3.8 Rule3srequirealittlemorethannothing……………………….. 28
3.9 Summary……………………………………….. 29
3.10 Extendedexample …………………………………… 32
…………………………. 9 …………………………. 16 …………………………. 22

1. SUFFIX TREES
1 Suffix trees
To this point we have only examined linear time solutions to the exact pattern matching problem that preprocess the pattern. However, there are also linear time solutions that preprocess the text. Although we will find the worst case complexity of these methods to be the same (or similar depending on the level of granularity of our analysis), whether it is best to preprocess the pattern or the text is often dictated by the problem at hand. In fact, often we are not given a choice as in many problems only one of the pattern or the text are made available for preprocessing.
In the substring matching problem we are given access to the text and need to preprocess it in order to support rapid queries. In contrast, the exact pattern matching problem gave us a pattern and asked us to find it in an initially unknown text. The algorithms that we have considered previously enable us to efficiently search for a single pattern within multiple large texts. However, now we are tasked with efficiently searching for multiple patterns within a single text. This reversal of roles of the pattern and the text reduces the effectiveness of the methods that preprocess the pattern and to remedy this we need to construct methods that preprocess the text.
There are several data structures that can be constructed from the text that support pattern matching in O(m)- time. For instance, we examined suffix trees and the burrows wheeler transform (BWT) in FIT2004. Here we choose to focus on suffix trees as they are tremendously versatile data structures that enable efficient solutions to a multitude of important problems (see tutorial 3 and assignment 2 for examples). However, naive construction of suffix trees requires O(n2)-time and yields an O(n2+km)-time solution to the substring matching problem; assuming we are given k patterns of the same length. On the surface they do not appear to be an attractive alternative to exact pattern matching algorithms that promise an O(m)-time preprocessing phase and an O(n)-time search phase for each pattern, giving O(km + kn) overall. Fortunately, there exists an ingenious algorithm for constructing a suffix tree in linear time known as Ukkonen’s algorithm, and with it suffix trees allow us to solve the substring matching problem in O(n + km)-time. But before we examine the inner workings of Ukkonen’s, we will quickly remind ourselves of what a suffix tree is and examine their connection to the implicit suffix trees, which are central to the Ukkonen’s algorithm.
Definition 1: The substring matching problem
Given a reference text txt[1…n], preprocess txt such that any given pat[1…m] can be searched in time proportional to the length of the pattern, O(m).
Definition 2: Suffix trees
A suffix tree for a string str[1 . . . n] is a tree with the following properties.
1. It has exactly n leaves numbered 1 through n.
2. Except for the root, every internal node has at least two children.
3. Each edge is labelled with a non-empty substring of str.
4. Every node has at most one edge beginning with any given character of the alphabet.
5. The string-label of the path starting at the root and ending at the leaf numbered i is the suffix str[i . . . n],
for i = 1…n.
Recall that we can also define suffix trees as path compressed suffix tries.

1. SUFFIX TREES
Example 1: Suffix tree for abcabz
Let’slookatasimpleexample,str[1…6]=abcabz whosesuffixtreeisshownbelow,
Figure 1: The suffix tree for str[1 . . . 6] = abcabz
We can see that this tree has all 5 properties we require of a suffix tree. Specifically, it has: • 6 leaves numbered 1 through 6,
• every internal node has at least two children except the root which has four,
• each edge is labelled with a substring of str,
• at every node no two edges start with the same character,
• any path from the root to a leaf node will yield a suffix of str.
Example 2: Implicit suffix tree for abcab
On the surface our definition of a suffix tree is seems sound, but what happens if we remove the last character
fromourstring?Wewouldobtainstr[1…5]=abcab andthecorrespondingtreewouldbe,
Figure 2: The implicit suffix tree for str[1 . . . 5] = abcab
But this isn’t a suffix tree! In particular, it doesn’t satisfy rules 1 and 5. The issue is that the suffixes str[4…5] =‘ab’ and str[5] =‘b’ are prefixes of the suffixes str[1…5] =‘abcab’ and str[2…5] =‘bcab’ respectively. In contrast, the string str[1 . . . 6] = ‘abcabz’ has no such suffixes as the last character (‘z’) appears no where else in the string.
Example 2 demonstrates that not all strings have an associated suffix tree, but we can circumvent this issue by adding a special terminal character to the end of our string. In this unit (unless stated otherwise) we will take this character to be ‘$’.

1. SUFFIX TREES
Definition 3: Terminal character
The terminal character ‘$’ is defined to be a character in the alphabet such that,
1. It occurs only once in the string and is always the last character in the string. 2. It is lexicographically smaller than any other character in the alphabet.
Example 3: Suffix tree for abcab
To construct a valid suffix tree for str[1 . . . 5] = abcab we can append ‘$’ to give str[1 . . . 6] = abcab$, which then yields the tree,
Figure 3: The suffix tree for str[1 . . . 6] = abcab$ You should check that this tree satisfies all the required properties.
Strictly speaking the suffix tree given in example 3 is the suffix tree for ‘abcab$’ and not ‘abcab’, but as the latter doesn’t have a corresponding suffix tree, and the former has all the practical and conceptual properties that one might want in a suffix tree for ‘abcab’, we will treat the two as synonymous from this point forward. As such, anytime we refer to the suffix tree for a string str[1 . . . n] we are actually referring to the suffix tree of str[1 . . . n]$.
Naive suffix tree construction
For the sake of completeness we will briefly examine the naive construction of suffix trees. This is assumed knowledge and you should revise your notes from previous units for full discussion of the naive algorithm as we will only provide a refresher through way of example here. The suffix tree in example 3 can be constructed naively as follows:
• Start with the (empty) root node of the suffix tree. • Insert suffix 1 (str[1…]) into an empty tree.
• Call the resultant tree T1.

1. SUFFIX TREES
• Insert suffix 2 into T1.
• Note str[2….] is not a prefix of str[1….]
• So create a new leaf node for str[2..] suffix, branching off at the root. • Call resultant tree T2
• Insert suffix 3 into T2.
• Note str[3….] is not a prefix of str[1….] or str[2….].
• So create a new leaf node for str[3..] suffix, again branching off at the root. • Call resultant tree T3
• Inserting suffix 4 into T3.
• Note str[4..5] = ab is the longest common prefix shared with the suffix str[1….]. • So, a new node (u) is inserted …
• … along the edge between the root and the leaf node 1…
• … with another edge branching off u to the new leaf node 4
• Call resultant tree T4

1. SUFFIX TREES
• Insert suffix 5 into T4.
• Note str[5..5] = b is the longest common prefix shared with suffix str[2..]. • So, a new node (v) is inserted…
• … along the edge between the root and the leaf node 2…
• … with another edge branching out from v to the new leaf node 5
• Call resultant tree T5
• Insert suffix 6 into T5.
• Note the suffix str[6..6] denotes the special terminal character $. • This creates a new isolated edge branching off the root.
Question 1
Assuming as input a string, str[1 . . . n], show that the time complexity of the naive method presented above is O(n2).

2. IMPLICIT SUFFIX TREES
Space efficient representation of suffix trees
Explicitly storing the substring labels of each edge in our suffix tree is convenient conceptually, but requires O(n2) space for a string of length n. Fortunately, we don’t need to store the edge labels explicitly. Instead we can store them implicitly using a tuple (j, i) to denote the substring str[j . . . i], where 1 ≤ (j ≤ i) ≤ n. This reduces the space required to store the suffix tree to O(n) (assuming a fixed size alphabet).
Figure 4: By implicitly storing the edge labels using the start and end indices of the corresponding substrings we can store a suffix tree in O(n) space.
Question 2
Prove that the space efficient representation of suffix trees requires O(n) space. You may assume the size of the alphabet under consideration is fixed.
2 Implicit suffix trees
We have now established a robust definition for suffix trees, but this does not make the trees in example 1 and 2 insignificant. These are in fact the implicit suffix trees for the strings ‘abcabz’ and ‘abcab’ respectively.
Definition 4: Implicit suffix trees
The implicit suffix tree for a string str[1 . . . n] is the tree obtained from the suffix tree for str[1 . . . n] by, 1. Removing all the terminal characters (‘$’) from the tree.
2. Deleting all the edges labelled with empty substrings in the resultant tree.
3. Removing all internal nodes that don’t have at least two children.
Example 4: Constructing an implicit suffix tree from a suffix tree.
Let’s examine the process of constructing an implicit suffix tree when given a regular suffix tree for,
str[1…n]=abcab .

3. UKKONEN’S ALGORITHM FOR LINEAR TIME CONSTRUCTION OF SUFFIX TREES
Notice that the tree in example 4 only contains the suffixes str[5 . . . 6] = ‘ab’ and str[6] = ‘b’ implicitly hence the name implicit suffix tree. Furthermore, we can see that the final tree is identical to the tree we found in example 2. This suggests an alternate definition for implicit suffix trees.
These definitions imply that we can obtain the regular suffix tree that corresponds to an implicit suffix tree by simply appending ‘$’ to all the suffixes contained within the tree. Although we are yet to describe how to do this algorithmically this idea is at the heart of Ukkonen’s algorithm.
3 Ukkonen’s algorithm for linear time construction of suffix trees
Recall that our goal is to develop a method for building the suffix tree that corresponds to a string str[1…n] in O(n)-time. We have already seen that naive construction requires O(n2)-time, but we have just discovered an alternate method for constructing suffix trees.
First construct the implicit suffix tree for str[1 . . . n], and then extend every suffix in this tree by $ to obtain the corresponding regular suffix tree. This method always yields the correct regular suffix tree for str[1 . . . n] because the terminal character only appears at the end of the string by definition, and when the last character in a string is found nowhere else in the string the implicit suffix tree for the string is the same as the regular suffix tree (see example 1). That is, the implicit suffix tree for str[1 . . . n]$ is the regular suffix tree for str[1 . . . n]$. However, this viewpoint can be restrictive as it tempts us to construct the required implicit suffix tree using the naive method for regular suffix tree construction, which obviously will not yield any improvement on the worst case time complexity. To achieve the desired linear bound we need to invert our thinking and grow the suffixes in our tree incrementally instead of successively inserting the complete suffixes into our tree.
3.1 Overview
At a high level Ukkonen’s algorithm is relatively simple, and although the ‘devil is in the detail,’ we will abstract ourselves from these complications initially to establish the algorithm’s overall structure. As alluded to in the previous section, the goal of Ukkonen’s algorithm is to construct the regular suffix tree for str[1…n] by first constructing the corresponding implicit suffix tree. The algorithm proceeds over n phases and in each phase i + 1 (where 1 ≤ i + 1 ≤ n) the implicit suffix tree Ii+1 for the prefix str[1 . . . i + 1] is constructed. Specifically in each phase i + 1, Ii+1 is computed incrementally using Ii, the implicit suffix tree constructed in the previous phase. At the conclusion of the n phases we have constructed In – the implicit suffix tree for str[1 . . . n] – which can be extended to the regular suffix tree for str[1…n] by appending the terminal character $ and running one last iteration of the algorithm1.
1When implementing Ukkonen’s algorithm it is easiest to append the terminal character to str[1…n] at the beginning of the algorithm and then perform n + 1 phases to obtain the regular suffix tree for str[1 . . . n].
Definition 5: Implicit suffix trees (alternate definition)
An implicit suffix tree for the string str[1 . . . n] is the tree obtained when attempting to build the suffix tree for str without first appending the terminal character (‘$’) to str.

3. UKKONEN’S ALGORITHM FOR LINEAR TIME CONSTRUCTION OF SUFFIX TREES
By definition Ii contains (implicitly) all the suffixes of str[1 . . . i] and in phase i + 1 these suffixes all undergo suffix extensions (often just called extensions) to accommodate the new character str[i + 1]. Once every suffix of str[1 . . . i] has been extended by str[i + 1], and the new suffix str[i + 1] has been added to the tree (equivalent to extending the empty suffix by str[i + 1]), the tree will contain all the suffixes of str[1 . . . i + 1] and we obtain Ii+1. Within phase i + 1 extensions proceed as follows: for any suffix of str[1 . . . i] starting at position j, where 1 ≤ j ≤ i + 1,
• find the end of the path from the root node in the current state of the implicit suffix tree that corresponds to the suffix str[j . . . i].
• Extend this path by appending str[i + 1] to the suffix so that the tree now contains a path from the root that corresponds to the suffix str[j . . . i + 1].
3.2 Naive implicit suffix tree construction
There are three types of extensions that occur once the path str[j . . . i] has been found and by detailing these we can transform the previous overview into a naive method for constructing implicit suffix trees. We can then derive Ukkonen’s algorithm by making several intuitive improvements that will eventually yield an O(n)-time method. To proceed we’ll consider building the implicit suffix tree for str[1 . . . 4] = abac and generalise the extensions that occur as we come to them.
To begin we must construct I1 from an empty tree. Recall that I1 contains all suffixes of the prefix str[1], which is just a single character str[1] = a. Thus, we only have one extension to make as 1 ≤ j ≤ i+1 =⇒ j = 1.
Shortly we will see that this extension can be thought of as what we will call a rule 2 extension, but for now we will treat it as the base case of the algorithm.
Now that all extensions in phase i + 1 = 1 have been made we proceed to phase i + 1 = 2, where we need to construct I2 from I1 . In particular, we need to extend the path str[1] = a in I1 to str[1 . . . 2] = ab and insert str[2] = b, the remaining suffix of the prefix str[1 . . . 2]. This gives us two extensions overall in phase 2 and 1 ≤ j ≤ 2.
Phase: i+1=1,extensionj=1
To insert str[1] = a we create a new root node r and a new leaf numbered 1. We then assign the character a as the edge label between the new node r and leaf 1.

3. UKKONEN’S ALGORITHM FOR LINEAR TIME CONSTRUCTION OF SUFFIX TREES
Phase: i+1=2,extensionj=1
To insert str[1 . . . 2]= ab we first find the path str[1] = a. As this path ends at the leaf 1 we simply adjust the label of the edge to account for the new character str[2] = b, so that the edge is now labeled by str[1 . . . 2] = ab.
The previous extension is the first of three types that can occur when constructing implicit suffix trees in this manner and we define an extension rule that describes how to perform this kind of extension.
Definition 6: Rule 1 extensions
When performing extension j in phase i + 1 first find the path str[j . . . i] in Ii. If this path ends at a leaf, adjust the label of the edge to that leaf to account for the added character str[i + 1].
Before suffix extension −→ After suffix extension
Having completed the first extension in the phase we proceed immediately to the next.
The j = 2 extension is clearly not a rule 1 extension as our path does not end at a leaf, and is in fact an example of a rule 2 extension defined as follows.
Phase: i+1=2,extensionj=2
To insert str[2] = b we first note that the path (str[j…i] = str[2…1]) that needs to be extended is the empty string, and so our extension is made directly from r . There is no path from the root that starts with str[2] = b we must create a new leaf numbered 2 and assign the character b as the edge label between the r and leaf 2.

3. UKKONEN’S ALGORITHM FOR LINEAR TIME CONSTRUCTION OF SUFFIX TREES
Definition 7: Rule 2 extensions (case 1)
When performing extension j in phase i + 1 first find the path str[j . . . i] in Ii. If this path does not end at a leaf, and the next character in the existing path is some x ̸= str[i + 1], and str[i] and x are separated by an existing node u , then create a new leaf numbered j; assign character str[i + 1] as the edge label between
the u and the leaf j.
Before suffix extension −→ After suffix extension
In extension j = 2 in phase i + 1 = 2 the path str[2…1] was the empty string and so ended at the root. The next character in the only existing path was str[1] = a ̸= str[2] = b and so we created a new leaf numbered 2 and labeled the edge between leaf 2 and the root with str[2] = b. The base case is also essentially a rule 2 extension of this kind as we begin with a singleton node in the root and our initial path is str[1…0], which is the empty string. As there are no non-empty paths from the root in phase i + 1 = 1 we cannot find any path continuing with the character str[1] and so we must create a new leaf numbered 1 and label the edge between it and the root with str[1].
With both extensions in phase i + 1 = 2 complete we now move onto phase i + 1 = 3 where we must make three extensions, one each for 1 ≤ j ≤ 3.
Phase: i+1=3,extensionj=1
To insert str[1…3] = aba we first follow the path str[j…i] = str[1…2] = ab from r . This path ends at leaf 1 so we make the extension using rule 1 and extend the edge label between r and leaf 1 by str[3] = a.

3. UKKONEN’S ALGORITHM FOR LINEAR TIME CONSTRUCTION OF SUFFIX TREES
Phase: i+1=3,extensionj=2
Toinsertstr[2…3]=bawefirstfollowthepathstr[j…i]=str[2…2]=bfrom r . Thispathendsatleaf 2 so we make the extension using rule 1 and extend the edge label between r and leaf 2 by str[3] = a.
Phase: i+1=3,extensionj=3
To insert str[3] = a we first follow the path str[j . . . i] = str[3 . . . 2], which is the empty string. Starting from r we look for a path continued by str[3] = a and we find one

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com