COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
This material has been reproduced and communicated to you by or on behalf of Monash University pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 1 / 39
Copyright By PowCoder代写 加微信 powcoder
Prepared by: [ ] Extended by:
FIT3155: Advanced Algorithms and Data Structures
Week 3: Linear-time suffix tree construction
Faculty of Information Technology, Monash University
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 2 / 39
What is covered in this lecture series?
Linear-time suffix tree construction
Ukkonen algorithm (suffix tree construction)
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 3 / 39
Source material and recommended reading
, Algorithms on Strings, Trees and Sequences, Cambridge University Press. (Chapter 5)
Francisco Gomez Martin, Exact String Pattern Recognition
Ukkonen, E. (1995). ”On-line construction of suffix trees”. Algorithmica 14 (3): 249–260.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 4 / 39
Introduction
The substring (matching) problem
Given a reference text txt[1…n], preprocess txt such that any given pattern pat[1…m] can be searched in time proportional to the length of the pattern, O(m).*
*Compare this with what we learnt from Z-algorithm and Boyer-Moore algorithm. Suffix trees (and suffix arrays) permit solving the above (and many
other related) problems. They are very versatile.
Suffix trees unravel the composition of strings, and permit efficient access to them.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 5 / 39
String definitions
Given str[1…n]
A prefix of str[1..n] is a substring str[1..i], ∀1 ≤ i ≤ n. A suffix of str[1..n] is a substring str[j..n], ∀1 ≤ j ≤ n. A substring of str[1..n] is any str[j..i], ∀1 ≤ (j ≤ i) ≤ n.
Therefore,
In other words, a substring is a prefix of a suffix
(or equivalently) a substring is a suffix of a prefix
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 6 / 39
Suffixes of a string – Example
Consider the suffixes of str[1..n]$ = abaaba$
SUFFIX 1 = abaaba$ SUFFIX 2 = baaba$ SUFFIX 3 = aaba$ SUFFIX 4 = aba$ SUFFIX 5 = ba$ SUFFIX 6 = a$ SUFFIX 7 = $
The ‘$’ symbol
Note that $ is a special character to denote the end of the string. It is often chosen to be a character that is lexicographically smaller than all other characters in the text.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 7 / 39
Recall from FIT2004:
path compressed suffix tries = suffix trees
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 8 / 39
Recall from FIT2004:
Efficient representation of suffix trees requires O(n) space
Note, instead of storing the edge labels as substrings explicitly, we can store them implicitly using (j,i) denoting the substring str[j..i], where 1≤(j ≤i)≤n.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 9 / 39
Building a suffix tree na ̈ıvely
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
What is the time complexity of this na ̈ıve construction?
Start with the (empty) root node of the suffix tree: r
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
What is the time complexity of this na ̈ıve construction?
Insert suffix 1 (str[1…]) into an empty tree.
Call the resultant tree T1.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
Call resultant tree T2 What is the time complexity of this na ̈ıve construction?
Insert suffix 2 into T1.
Note str[2….] is not a prefix of str[1….].
So create a new leaf node for str[2..] suffix, branching off at r .
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
What is the time complexity of this na ̈ıve construction?
Insert suffix 3 into T2.
Note str[3….] is not a prefix of str[1….]
or str[2….].
So create a new leaf node for str[3..]
suffix, again branching off at r . Call resultant tree T3
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
Call resultant tree T4 What is the time complexity of this na ̈ıve construction?
Inserting suffix 4 into T3.
Note str[4..5] is the longest common
prefix shared with the suffix str[1….]. So, a new node u is inserted …
… along the edge between r and the leaf node 1…
… with another edge branching off u to
the new leaf node 4
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
Call resultant tree T5 What is the time complexity of this na ̈ıve construction?
Insert suffix 5 into T4.
Note str[5..5] is the longest common
prefix shared with suffix str[2..]. So, a new node v is inserted…
… along the edge between r and the leaf node 2…
… with another edge branching out from v to the new leaf node 5
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Building a suffix tree na ̈ıvely 123456
Considersuffixesof str=abcab$
SUFFIX 1: str[1..6] = a b c a b $
SUFFIX 2: str[2..6] = b c a b $
SUFFIX 3: str[3..6] = c a b $
SUFFIX 4: str[4..6] = a b $
SUFFIX 5: str[5..6] = b $
SUFFIX 6: str[6..6] = $
branching off the root r . What is the time complexity of this na ̈ıve construction?
Insert suffix 6 into T5.
Note the suffix str[6..6] denotes the
special terminal character $. This creates a new isolated edge
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 10 / 39
Efficient suffix tree construction using Ukkonen’s algorithm
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 11 / 39
Linear Time Suffix Tree construction
in 1995 gave a very clever algorithm to construct suffix trees in linear run time.
Ukkonen, E. (1995). ”On-line construction of suffix trees”. Algorithmica 14 (3): 249–260.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 12 / 39
Ukkonen’s linear-time algorithm – introduction
Ukkonen’s algorithm exploits the properties inherent in any suffix tree, and benefits from several ‘tricks’ to achieve a linear run time.
Ukkonen’s algorithms uses the following main ideas:
1 construct and use an ‘implicit suffix tree’ data structure.
the actual suffix tree is computed after iteratively constructing (over many phases) an implicit suffix tree.
2 enhance this implicit suffix tree using ‘suffix links’. this helps make the traversals on the implicit tree much faster.
3 gain from a set of implementational ‘tricks’:
these tricks avoid unnecessary computations, thus speeding up the
algorithm drastically.
Only when all these ideas/elements are used together, Ukkonen’s algorithm achieves a linear-time construction of suffix trees.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 13 / 39
Implicit suffix tree compared with (regular) suffix tree The relationship between an implicit suffix tree and its regular suffix tree can be understood by the following operations on the regular suffix tree:
Start with a regular suffix tree (of str=a b c a b $ ).
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 14 / 39
Implicit suffix tree compared with (regular) suffix tree The relationship between an implicit suffix tree and its regular suffix tree can be understood by the following operations on the regular suffix tree:
Start with a regular suffix tree (of str=a b c a b $ ). Remove all terminal ($) characters in the regular suffix tree.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 14 / 39
Implicit suffix tree compared with (regular) suffix tree The relationship between an implicit suffix tree and its regular suffix tree can be understood by the following operations on the regular suffix tree:
Start with a regular suffix tree (of str=a b c a b $ ). Remove all terminal ($) characters in the regular suffix tree. Then, remove all edges without edge labels (i.e. substrings).
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 14 / 39
Implicit suffix tree compared with (regular) suffix tree The relationship between an implicit suffix tree and its regular suffix tree can be understood by the following operations on the regular suffix tree:
Start with a regular suffix tree (of str=a b c a b $ ).
Remove all terminal ($) characters in the regular suffix tree.
Then, remove all edges without edge labels (i.e. substrings).
Then, path compress the tree by removing all nodes that do not have at least two children.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 14 / 39
Ukkonen’s algorithm builds implicit suffix trees incrementally in phases
Given a string str[1..n], Ukkonen’s algorithm proceeds over n “phases” Each phase i + 1 (where 1 ≤ i + 1 ≤ n) builds the implicit suffix
tree (denoted by implicitSTi+1) for the prefix str[1..i + 1].
Importantly, each implicitSTi+1 is incrementally computed using the implicitSTi from the previous phase.
The construction of implicitSTi+1 from implicitSTi, in turn, involves several suffix extensions, for each suffix str[j..i + 1], ∀j = 1…i + 1 in that order (discussed below).
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 15 / 39
Each phase involves suffix extensions Suffix extension steps in each phase
In any phase i + 1, the suffixes in implicitSTi (from previous phase i) undergo suffix extensions to accommodate the new character, str[i + 1].
Thus, extending any suffix starting at position j, where 1 ≤ j ≤ i + 1, in the current phase i + 1 involves:
finding the end of the path from root node r corresponding to the suffix str[j . . . i] in the current state of the implicit suffix tree, and
extending the end of str[j . . . i] path by appending str[i + 1] to that (growing) suffix.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 16 / 39
Algorithm at an very high level
Construct implicitST1 For i from 1 to n − 1
Begin PHASE i + 1 Forjfrom1toi+1
Begin SUFFIX EXTENSION j
⋆ Find end of path from root denoting str[j..i] in the current state of
the suffix tree.
⋆ Apply one of the three suffix extension rules (see slides 18-20 discussed
End of extension step j (i.e., str[j..i + 1] extension computed). End of phase i + 1 (implicitSTi+1 computed)
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 17 / 39
Suffix extension rules – Rule 1 of 3 RULE 1 of 3
If the path str[j..i] in implicitSTi ends at a leaf, adjust the label of the edge to that leaf to account for the added character str[i + 1].
Before suffix extension
After suffix extension
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 18 / 39
Suffix extension rules – Rule 2 of 3 RULE 2 of 3
If the path str[j..i] in implicitSTi does NOT end at a leaf, and the next character in the existing path is some x ̸= str[i + 1], then split the edge after str[..i] and create a new node u , followed by a new leaf numbered j; assign character str[i + 1] as the edge label between the new node u and leaf j.
Before suffix extension After suffix extension
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 19 / 39
Suffix extension rules – Rule 2 of 3 (an alternative scenario) RULE 2 of 3 – an alternative scenario that can arise
If the path str[j..i] in implicitSTi does NOT end at a leaf, and the next character in the existing path is some x ̸= str[i + 1], and str[i] and x are separated by an existing node u , then create a new leaf numbered j; assign character str[i + 1] as the edge label between the u and the leaf j.
Before suffix extension After suffix extension
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 19 / 39
Suffix extension rules – Rule 3 of 3 RULE 3 of 3
If the path str[j..i] in implicitSTi does NOT end at a leaf, but is within some edge label, and the next character in that path is str[i + 1], then str[i + 1] is already in the tree. No further action needed.
No further action needed.
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 20 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
rule-2 (alt)
branch out of r into a new leaf 1 with label str[1]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
extend edge label between nodes r and leaf 1 with str[2]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
rule-2 (alt)
branch of r into a new leaf 2 with label str[2]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
extend edge label between nodes r and leaf 1 with str[3]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
extend edge label between nodes r and leaf 2 with str[3]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
no further action necessary
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
extend edge label between nodes r and leaf 1 with str[4]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
extend edge label between nodes r and leaf 2 with str[4]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
insert node u along ( r ,2); attach leaf 3; edge-label str[4]
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
phase extn
no further action necessary
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Suffix extension rules – Example
For the string str=a b b a , the implicit suffix trees for each phase, along with their corresponding suffix extensions are shown below:
Full Table seen in previous slides
phase extn
branch out of r into a new leaf 1 with label str[1]
str[1..2] str[2..2]
rule-1 rule-2
extend edge label between nodes r and leaf 1 with str[2] branch of r into a new leaf 2 with label str[2]
str[1..3] str[2..3] str[3..3]
rule-1 rule-1 rule-3
extend edge label between nodes r and leaf 1 with str[3] extend edge label between nodes r and leaf 2 with str[3] no further action necessary
str[1..4] str[2..4] str[3..4] str[4..4]
41 42 43 44
rule-1 rule-1 rule-2 rule-3
extend edge label between nodes r and leaf 1 with str[4] extend edge label between nodes r and leaf 2 with str[4] insert node u along ( r ,2); attach leaf 3; edge-label str[4] no further action necessary
(FIT3155 S1 2022, Monash University) L3:Linear-time suffix tree construction 21 / 39
Speeding up tree traversal using suffix links
Suffix links are simply pointers between internal nodes of an (imp
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com