PowerPoint Presentation
Faculty of Information Technology,
Monash University
FIT2004: Algorithms and Data Structures
Week 6: Retrieval Data Structures for Strings
These slides are prepared by M. A. Cheema and are based on the material developed by Arun Konagurthu and Lloyd Allison.
Recommended readings
Unit notes (Chapters 9&10)
Cormen et al. “Introduction to Algorithms” (Chapter 18)
For a more advanced treatment of Trie and suffix trees: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press. (Chapter 5) – Book available in the library!
(FIT2004
FIT2004: Lec-6: Retrieval Data Structures for Strings
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Introduction
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suppose you have a large text containing N strings. You want to pre-process it such that searching on this text is efficient.
Sorting based approach:
Pre-processing: Sort the strings
Searching: Binary search to find
Let M be the average length of strings (M can be quite large, e.g., for DNA sequences). Comparison between two strings takes O(M).
Time complexity:
Pre-processing O(MN log N) using merge sort or O(MN) using radix sort
Searching O(M log N)
Can we do better?
Yes! ReTrieval data structures allow answering different string queries efficiently
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
ReTRIEval tree = Trie
Often pronounced as ‘Try’.
Trie is an N-way (or multi-way) tree, where N is the size of the alphabet
E.g., N=2 for binary
N = 26 for English letters
N = 4 for DNA
In a standard Trie, all words with the shared prefix fall within the same subtree/subtrie
In fact, it is the shortest possible tree that can be constructed such that all prefixes fall within the same subtree.
Trie Example: Insertion
FIT2004: Lec-6: Retrieval Data Structures for Strings
Let’s look at an example : a Trie that stores baby, bad, bank, box, dog, dogs, banks.
We will use $ to denote the end of a string.
Inserting a string in a Trie:
Start from the root node
For each character c in the string
If a node containing c exists
Move to the node
Else
Create the node
Move to it
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Alternative Illustration
FIT2004: Lec-6: Retrieval Data Structures for Strings
Traditionally, characters are shown on edges instead of nodes. However, these are just two different ways to illustrate.
We show characters on nodes because I find them clearer in lecture slides especially for dense examples later in the lecture (e.g., in Suffix Trie).
In exams, you can use any of the two illustrations.
b
a
y
b
$
d
$
n
k
$
s
o
x
$
o
d
g
$
$
$
s
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Trie Example: Search
FIT2004: Lec-6: Retrieval Data Structures for Strings
Searching a string:
Start from the root node
For each character c in the string (including $)
If a node containing c exists
Move to the node
If c == $
Return “found”
Else
Return “not found”
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Search box
Trie Example: Search
FIT2004: Lec-6: Retrieval Data Structures for Strings
Searching a string:
Start from the root node
For each character c in the string (including $)
If a node containing c exists
Move to the node
If c == $
Return “found”
Else
Return “not found”
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Search boxing
Trie Example: Search
FIT2004: Lec-6: Retrieval Data Structures for Strings
Searching a string:
Start from the root node
For each character c in the string (including $)
If a node containing c exists
Move to the node
If c == $
Return “found”
Else
Return “not found”
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Output for searching ban ?
Time Complexity?:
For loop runs O(M) times.
Time to check if a node containing c exists?
Depends on implementation, and on whether alphabet size is constant
Trie Example: Prefix Matching
FIT2004: Lec-6: Retrieval Data Structures for Strings
Prefix matching returns every string in text that has the given string as its prefix.
E.g., Autocompletion. Return all strings that start
with “ban”
Prefix matching:
Start from the root node
For each character c in the prefix
If a node containing c exists
Move to the node
Else
Return “not found”
Return all strings in the
subtree rooted at the last node
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Prefix matching for ban
Trie Example: Prefix Matching
FIT2004: Lec-6: Retrieval Data Structures for Strings
Prefix matching returns every string in text that has the given string as its prefix.
E.g., Autocompletion. Return all strings that start
with “b”
Prefix matching:
Start from the root node
For each character c in the prefix
If a node containing c exists
Move to the node
Else
Return “not found”
Return all strings in the
subtree rooted at the last node
b
a
b
y
$
d
$
n
k
$
o
x
$
d
o
g
$
s
$
s
$
Prefix matching for b
Implementing a Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Implementation using an array:
At each node, create an array of alphabets size (e.g., 26 for English letters, 4 for DNA strings)
If i-th node exists, add pointer to it at array[i]
Otherwise, array[i] = Nil.
The above implementation allows checking whether a node exists or not in O(1).
Other implementations are possible (e.g., using linked lists or hash tables).
Example: Implementing a Trie using arrays (assuming only three letters A,B,C)
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 2 3 4
A B C $
Insert BA$
1 2 3 4
A B C $
1 2 3 4
A B C $
B
A
$
Insert BC$
1 2 3 4
A B C $
C
$
Advantages and Disadvantages of Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Advantages
A better search structure than a binary search tree with string keys.
A more versatile search structure than hash table
Allows lookup on prefix matching in O(M)-time where M is the length of prefix.
Allows sorting collection of strings in O(MN) time where MN is the total number of characters in all strings
Disadvantages
On average Tries can be slower (in some cases) than hash tables for looking up patterns/queries.
Wastes space, since even when a node has few children, you need to create an array of size alphabet
Some properties of Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
The maximum depth is the length of longest string in the collection.
Insertion, Deletion, Lookup operations take time proportional to the length of the string/pattern being inserted, deleted, or searched.
But we waste a lot of space if
each node has 1 pointer per symbol in the alphabet.
deeper nodes typically have mostly null pointers.
Can reduce total space usage by turning each node into a linked list or binary search tree etc, trading off time for space.
Radix/PATRICIA Tree (NOT EXAMINABLE BUT WORTH MENTIONING)
FIT2004: Lec-6: Retrieval Data Structures for Strings
Radix/PATRICIA tree is a space-optimized/compact Trie data structure
Unlike regular tries, edges can be labeled with substrings of characters.
The nodes along a path having exactly one child are merged
This makes them much more efficient for sets of strings that share long prefixes or substrings.
Radix/PATRICIA Tree (NOT EXAMINABLE BUT WORTH MENTIONING)
b
a
b
y
$
n
k
$
o
x
$
e
s
r
$
$
s
e
r
$
b
a
by
$
nk
oxe
$
s
r
$
$
s
er
$
FIT2004: Lec-6: Retrieval Data Structures for Strings
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
(Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words
A string s
1 2 3 . . . m-1 m
Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
(Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words
A prefix of a word s[1..m] is some string s[1..i] where 1<=i<=m
FIT2004: Lec-6: Retrieval Data Structures for Strings
A string s
1 2 3 i . . . m-1 m
Prefix of s
Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
(Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words
A prefix of a word s[1..m] is s[1..i] where 1<=i<=m
A suffix of a word s[1..m] is s[i..m] where 1<=i<=m
A string s
1 2 3 . . . i m-1 m
Suffix of s
Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
Any substring of a word is a prefix of some suffix
In other words, a substring of s[1..m] is s[i..j]
A string s
1 2 3 i . . . j m-1 m
Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
Any substring of a word is a prefix of some suffix
In other words, a substring of s[1..m] is s[i..j]
s[i..j] is a prefix of s[i..m] (which is a suffix of s[1..m])
To be able to efficiently search substrings…
Just make a prefix trie of suffixes
A string s
1 2 3 i . . . j m-1 m
Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Consider some text, e.g., “refer”.
A Trie constructed using all suffixes of the text is called a Suffix Trie
Pick any substring, eg “efe”
r
e
f
e
$
r
e
f
e
$
r
f
e
$
r
$
r
$
Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Consider some text, e.g., “refer”.
A Trie constructed using all suffixes of the text is called a Suffix Trie
Pick any substring, eg “efe”
Notice that it traces a path from root to some node
r
e
f
e
$
r
e
f
e
$
r
f
e
$
r
$
r
$
Constructing Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suffix trie of referrer$
Insert all suffixes of referrer in the trie
referrer$
eferrer$
ferrer$
errer$
rrer$
rer$
er$
r$
$
Many arrows are removed for
better visualization
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
Found!
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
Not found 🙁
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Substring search on Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Substring search for str
Similar to prefix matching
search “err”
search “fers”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Time Complexity:
O(M) where M is the length of substring
$
Counting # of occurences of a substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Quiz time!
https://flux.qa - RFIBMB
$
Counting # of occurences of a substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
Follow the path similar to prefix matching
Count # of leaf nodes ($) in the subtree rooted at the last node
E.g Count occurences of
“er”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Time Complexity:
Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie
$
Counting # of occurences of a substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
Follow the path similar to prefix matching
Count # of leaf nodes ($) in the subtree rooted at the last node
E.g Count occurences of
“er”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Time Complexity:
Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie
$
Counting # of occurences of a substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
Follow the path similar to prefix matching
Count # of leaf nodes ($) in the subtree rooted at the last node
E.g Count occurences of
“er”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Time Complexity:
Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie
$
Counting # of occurences of a substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
Follow the path similar to prefix matching
Count # of leaf nodes ($) in the subtree rooted at the last node
E.g Count occurences of
“er”
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Time Complexity:
Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie
$
Finding longest repeated substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
Quiz time!
https://flux.qa - RFIBMB
$
Finding longest repeated substring
FIT2004: Lec-6: Retrieval Data Structures for Strings
Find the deepest node in the tree with at least two children
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Space complexity of suffix trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Let N be the size of the string for which suffix trie is constructed
Space complexity?:
# number of suffixes: O(N)
Cost for each suffix is
linear to its size
Total space cost: O(N2)
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suffix Tree is a compact Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Compress branches by merging the nodes that have only one child
r
e
f
$
e
r
r
e
r
f
$
e
r
r
e
r
e
f
$
e
r
r
e
r
$
r
r
e
r
$
r
e
r
$
r
$
$
$
Suffix Tree
FIT2004: Lec-6: Retrieval Data Structures for Strings
Compress branches by merging the nodes that have only one child
But the total complexity is still the same as the same number of letters are stored
r
e
ferrer$
ferrer$
e
ferrer$
r
rer$
rer$
r$
$
$
Quiz time!
https://flux.qa - RFIBMB
$
Space complexity of suffix tree
FIT2004: Lec-6: Retrieval Data Structures for Strings
Compress branches by merging the nodes that have only one child
But the total complexity is still the same as the same number of letters are stored
Replace every substring with numbers (x,y) where x is the starting index of the substring and y is its length
e.g., ferrer is represented as (3,6)
rer is represented as (6,3)
r
e
ferrer$
ferrer$
e
ferrer$
r
rer$
rer$
r$
$
$
r e f e r r e r $
1 2 3 4 5 6 7 8 9
$
Space complexity of suffix tree
FIT2004: Lec-6: Retrieval Data Structures for Strings
Compress branches by merging the nodes that have only one child
But the total complexity is still the same as the same number of letters are stored
Replace every substring with numbers (x,y) where x is the starting index of the substring and y is its length
e.g., ferrer$ is represented as (3,7)
rer$ is represented as (6,4)
(1,1)
(2,1)
(3,7)
(2,1)
(5,1)
(6,4)
(6,4)
(8,2)
(9,1)
(9,1)
r e f e r r e r $
1 2 3 4 5 6 7 8 9
(3,7)
(3,7)
(9,1)
Space complexity of suffix tree
Every (internal) node has at least 2 children
There are exactly n leaves
There are at most n/2 nodes in the layer above the leaves
There are at most n/4 notes in the layer above that
…
The total number of nodes is bounded by
N + n/2 + n/4 + … + 1 < 2n
(1,1)
(2,1)
(3,7)
(2,1)
(5,1)
(6,4)
(6,4)
(8,2)
(9,1)
(9,1)
(3,7)
(3,7)
r e f e r r e r $
1 2 3 4 5 6 7 8 9
(9,1)
Time complexity of constructing suffix tree
FIT2004: Lec-6: Retrieval Data Structures for Strings
The algorithm described earlier inserts O(N) suffixes
Insertion cost of each suffix is linear in the size of suffix
Average suffix size is O(N)
Thus, total time complexity is O(N2)
It is possible to construct suffix tree in O(N)
Esko Ukkonen in 1995 gave a beautiful (but involved) algorithm to construct a Suffix Tree in linear time. If you every get interested in doing this in linear time, consider reading the source:
Ukkonen, E. (1995). "On-line construction of sux trees". Algorithmica 14 (3): 249260.
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Sorted Suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
String
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
1
2
3
4
5
6
7
8
9
10
11
12
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Sort
Querying on Sorted Suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Substring search:
Is “IPP” in the String?
Binary search on sorted suffices
Let M be the number of characters in substring and N be the size of string.
Worst-case cost of substring search is?
O (M log N)
Querying on Sorted Suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Longest repeated substring:
For each consecutive pair in sorted suffices
Compute the size of longest common prefix (LCP) among the pair
Maintain the one with the maximum size
Scan the LCP for the maximum
Complexity:
Cost of building suffix array +
cost of building LCP array +
O(N)
0
1
1
4
0
3
Sorted Suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Space complexity of Sorted Suffixes:
O(N2)
Can we do better?
Yes! Suffix Array reduces it to O(N) without losing effectiveness
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
index
String
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
1
2
3
4
5
6
7
8
9
10
11
12
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Sort
Suffix Array:
Only stores IDs of suffixes. The sorted suffices are shown just for illustration
Suffix ID
Practice
FIT2004: Lec-6: Retrieval Data Structures for Strings
What will be the suffix array of ABAB$?
Quiz time!
https://flux.qa - RFIBMB
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Substring Search:
Do a binary search
Example:
Search “IPP” in the string.
Initially, the search space is whole Suffix Array
1
2
3
4
5
6
7
8
9
10
11
12
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Substring Search:
Do a binary search
Example:
Search “IPP” in the string.
Initially, the search space is whole Suffix Array
Look at the middle element in range, i.e.,
At index 6 in Suffix array
This is Suffix #1 (“MISSISSIPPI$”)
1
2
3
4
5
6
7
8
9
10
11
12
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Substring Search:
Do a binary search
Example:
Search “IPP” in the string.
Initially, the search space is whole Suffix Array
Look at the middle element in range, i.e.,
At index 6 in Suffix array
This is Suffix #1 (“MISSISSIPPI$”)
Since “IPP” < “MISSISSIPPI”, substring if present must be above this element
Look at the middle element in range, i.e.,
At index 3 in Suffix array
This is suffix with ID 8 (“IPPI$”)
1
2
3
4
5
6
7
8
9
10
11
12
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Substring Search:
Do a binary search
Example:
Search “IPP” in the string.
Initially, the search space is whole Suffix Array
Look at the middle element in range, i.e.,
At index 6 in Suffix array
This is Suffix #1 (“MISSISSIPPI$”)
Since “IPP” < “MISSISSIPPI”, substring if present must be above this element
Look at the middle element in range, i.e.,
At index 3 in Suffix array
This is suffix with ID 8 (“IPPI$”)
Found!!!
Time Complexity:
O(M log N)
Can be improved to O(M) using LCP array (beyond the scope of this unit)
1
2
3
4
5
6
7
8
9
10
11
12
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Longest Repeated Substring:
Algorithm is same as on “Sorted Suffixes”
Time complexity is also the same
Time Complexity:
O(N2)
Can be improved to O(N) using LCP array (beyond the scope of this unit)
1
2
3
4
5
6
7
8
9
10
11
12
Construction Cost of Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
We need to generate N suffixes and then sort all N suffixes.
Time Complexity (with Merge Sort):
O(N log N) comparisons
Each comparison takes O(N)
Total cost: O(N2 log N)
Time Complexity (with Radix Sort):
O(N) passes
Each pass takes O(N)
Total cost: O(N2)
Space required during construction:
O(N2) – we need all suffixes during sorting
Can we do better?
Yes, using prefix doubling approach
O(N log2 N) time complexity
O(N) space required during construction
1
2
3
4
5
6
7
8
9
10
11
12
Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost
FIT2004: Lec-6: Retrieval Data Structures for Strings
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
Basic Idea:
Generate suffixes
Sort suffixes on their 1st characters
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
ID
Rank
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 2
2 5
2 8
2 11
6 1
7 9
7 10
9 3
9 4
9 6
9 7
Basic Idea:
Generate suffixes
Sort suffixes on first 1 characters
$
I S S I S S I P P I $
I S S I P P I $
I P P I $
I $
M I S S I S S I P P I $
P P I $
P I $
S S I S S I P P I $
S I S S I P P I $
S S I P P I $
S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
11 3
11 6
Basic Idea:
Generate suffixes
Sort suffixes on first 1 characters
Sort suffixes on first 2 characters
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Basic Idea:
Generate suffixes
Sort suffixes on first 1 characters
Sort suffixes on first 2 characters
Sort suffixes on first 4 characters
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Basic Idea:
Generate suffixes
Sort suffixes on their 1st characters
Sort suffixes on first 2 characters
Sort suffixes on first 4 characters
…
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Basic Idea:
Generate suffixes
Sort suffixes on their 1st characters
Sort suffixes on first 2 characters
Sort suffixes on first 4 characters
…
Sort suffixes on all N characters
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Basic Idea:
Generate suffixes
Sort suffixes on their 1st characters
Sort suffixes on first 2 characters
Sort suffixes on first 4 characters
…
Sort suffixes on all N characters
The last sort alone takes NlogN with a comparison cost of N
This is N2logN
We need to speed up the comparisons
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Basic Idea:
Generate suffixes
Sort suffixes on their 1st characters
Sort suffixes on first 2 characters
Sort suffixes on first 4 characters
…
Sort suffixes on all N characters
Suppose we could compare in O(1)
logN sorts
O(NlogN) for each
O(NlogNlogN) = O(Nlog2N)
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Constructing Suffix Array: Prefix Doubling
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
10 3
10 6
Comparing suffixes in O(1):
Suppose already sorted on first k characters (2 in this example)
Now sorting on 2k characters (4 in this example)
Observation 1:
If current ranks are different, suffix with smaller rank is smaller (because its first k characters are smaller)
E.g., PPI$ < SSIP
Note comparison cost is O(1)
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
10 3
10 6
Observation 2:
If current ranks are the same
First k characters must be the same
The tie is to be broken on the next k characters, e.g.,
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
10 3
10 6
Observation 2:
If current ranks are the same
First k characters must be the same
The tie is to be broken on the next k characters, e.g.,
We need to compare “SSIPPI$” and “PPI$” on the first 2 characters
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 4
8 7
9 3
9 6
Observation 2:
If current ranks are the same
First k characters must be the same
The tie is to be broken on the next k characters, e.g.,
We need to compare “SSIPPI$” and “PPI$” on the first 2 characters
SSIPPI$ and PPI$ are suffixes and are already ranked on first 2 characters
E.g., PPI$ < SSIPPI$ because its rank is smaller
Therefore, suffix #7< suffix #4
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 7
9 4
10 6
11 3
BUT WAIT!
How did we do that quickly? Surely looking up the “second half” suffixes is O(N)?
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Practice
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Suppose we are comparing suffix with ID 2 and 5:
We need to compare SSIPPI$ and PPI$
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Practice
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Suppose we are comparing suffix with ID 2 and 5:
We need to compare SSIPPI$ and PPI$
How do we find their ranks quickly?
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Practice
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Suppose we are comparing suffix with ID 2 and 5:
We need to compare SSIPPI$ and PPI$
How do we find their ranks quickly?
We want the ranks of suffixes:
2+k and 5+k
I.e. suffixes 6 and 9
This means we can calculate the IDs of the suffixes we want in O(1)
Now we need to get from IDs to ranks in O(1)
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Practice
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 7
10 4
11 6
12 3
Suppose we are comparing suffix with ID 2 and 5:
We need to compare SSIPPI$ and PPI$
How do we find their ranks quickly?
We want the ranks of suffixes:
2+k and 5+k
I.e. suffixes 6 and 9
To have O(1) access to their ranks, we need an array indexed by ID which contains the ranks!
In other words, the way the ranks are arranged on this slide is useless
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
Rank
Practice
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
6 4 12 10 4 11 9 3 8 7 2 1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Index/ID
String
M I S S I S S I P P I $
Note: The greyed out oldRank array has been left for reference, but does not exist in implementation
If we want the rank of ID i, look at Rank[i]
Going back to our example…
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 7
9 4
10 6
11 3
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
oldRank
90
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Index/ID
String
M I S S I S S I P P I $
Note: The greyed out oldRank array has been left for reference, but does not exist in implementation
If we want the rank of ID i, look at Rank[i]
Going back to our example…
We wanted to find the second parts of suffixes 2 and 5
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 7
9 4
10 6
11 3
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
oldRank
6 4 12 10 4 11 9 3 8 7 2 1
91
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Index/ID
String
M I S S I S S I P P I $
Note: The greyed out oldRank array has been left for reference, but does not exist in implementation
If we want the rank of ID i, look at Rank[i]
Going back to our example…
We wanted to find the second parts of suffixes 2 and 5
I.e. ID 6 and 9
Rank[9] < rank[6]
So ID 5 should come before ID 2 in the suffix array
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 7
9 4
10 6
11 3
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
ID
oldRank
6 4 12 10 4 11 9 3 8 7 2 1
92
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M -
I -
S -
S -
I -
S -
S -
I -
P -
P -
I -
$ -
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
SA
Rank
REMEMBER:
This rank array does not exist
It contains the same values as the other rank array, its just to make the algorithm easier to follow
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M -
I -
S -
S -
I -
S -
S -
I -
P -
P -
I -
$ -
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
SA
Rank
Rank the first characters of each suffix
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
3 1
2 2
5 3
5 4
2 5
5 6
5 7
2 8
4 9
4 10
2 11
1 12
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
SA
Rank
Rank the first characters of each suffix
In practice we would do this using ord(), but since ranks are only comparative, the actual values don’t matter, just their order. So we use numbers starting at 1
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
3 1
2 2
5 3
5 4
2 5
5 6
5 7
2 8
4 9
4 10
2 11
1 12
M I S S I S S I P P I $
I S S I S S I P P I $
S S I S S I P P I $
S I S S I P P I $
I S S I P P I $
S S I P P I $
S I P P I $
I P P I $
P P I $
P I $
I $
$
SA
Rank
Sort SA by ranks
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
1 12
2 2
2 5
2 8
2 11
3 1
4 9
4 10
5 3
5 4
5 6
5 7
$
I S S I S S I P P I $
I S S I P P I $
I P P I $
I $
M I S S I S S I P P I $
P P I $
P I $
S S I S S I P P I $
S I S S I P P I $
S S I P P I $
S I P P I $
SA
Rank
Sort SA by ranks
Note that this does not change the rank array, since IDs have kept the same ranks
We just rearranged the SA
Lets do an example (by hand in lecture)
FIT2004: Lec-6: Retrieval Data Structures for Strings
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
1 12
2 2
2 5
2 8
2 11
3 1
4 9
4 10
5 3
5 4
5 6
5 7
$
I S S I S S I P P I $
I S S I P P I $
I P P I $
I $
M I S S I S S I P P I $
P P I $
P I $
S S I S S I P P I $
S I S S I P P I $
S S I P P I $
S I P P I $
SA
Rank
Now sort on first 2 characters
For each comparison, we use the trick outlined in the previous section
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Now we update the ranks
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Make an array, “Temp” to hold the new ranks
1
1
1
1
1
1
1
1
1
1
1
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
For each pair of adjacent suffixes, compare them
1
1
1
1
1
1
1
1
1
1
1
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
If they have different ranks already, then the second suffix is certainly larger
1
1
1
1
1
1
1
1
1
1
1
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Set Temp[11] to Rank[12]+1
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
If they have the same rank
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
We need to use the O(1) trick
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
First characters have the same rank, so compare the suffixes which start with next characters
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
$ vs PPI$ (ID 11+1 and 8+1)
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
12 has lower rank than 9, so 11 should have lower rank than 8
1
1
1
1
1
1
1
1
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Set Rank[8] = Rank[11] + 1
1
1
1
1
1
1
1
3
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
1
4
1
1
1
1
1
3
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
1
4
1
1
4
1
1
3
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
1
1
4
1
1
3
1
1
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
1
1
4
1
1
3
1
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
1
1
4
1
1
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
1
8
4
1
1
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
1
8
4
1
8
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
9
8
4
1
8
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Continue in this way
5
4
9
8
4
9
8
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
2 8
2 2
2 5
3 1
4 10
4 9
5 4
5 7
5 3
5 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 3
I 2
S 5
S 5
I 2
S 5
S 5
I 2
P 4
P 4
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
This is our new Rank array, so overwrite the old one
5
4
9
8
4
9
8
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 4
8 7
9 3
9 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 5
I 4
S 9
S 8
I 4
S 9
S 8
I 3
P 7
P 6
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
This is our new Rank array, so overwrite the old one
5
4
9
8
4
9
8
3
7
6
2
1
Temp
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 4
8 7
9 3
9 6
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
Lets do an example (by hand in lecture)
SA
Rank
M 5
I 4
S 9
S 8
I 4
S 9
S 8
I 3
P 7
P 6
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
Now we have the suffixes sorted by the first 2. Go for 4!
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
3 8
4 2
4 5
5 1
6 10
7 9
8 7
9 4
10 6
11 3
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Lets do an example (by hand in lecture)
ID
Rank
M 5
I 4
S 11
S 9
I 4
S 10
S 8
I 3
P 7
P 6
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Lets do an example (by hand in lecture)
ID
Rank
M 6
I 5
S 12
S 10
I 4
S 11
S 9
I 3
P 8
P 7
I 2
$ 1
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
FIT2004: Lec-6: Retrieval Data Structures for Strings
1 12
2 11
3 8
4 5
5 2
6 1
7 10
8 9
9 7
10 4
11 6
12 3
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
Lets do an example (by hand in lecture)
ID
Rank
1
2
3
4
5
6
7
8
9
10
11
12
ID
String
Rank
M 6
I 5
S 12
S 10
I 4
S 11
S 9
I 3
P 8
P 7
I 2
$ 1
Summary
FIT2004: Lec-6: Retrieval Data Structures for Strings
Take home message
Tries, Suffix trees and Suffix array provide efficient text search and pattern matching (typically linear in number of characters in string)
Things to do (this list is not exhaustive)
Implement Trie, Suffix trees and Suffix array and run various pattern matching queries
Coming Up Next
Burrows-Wheeler Transform - A beautiful space-time efficient pattern matching algorithm on text
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suffix Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
We saw that a Trie allows prefix matching in O(M) where M is the length of the prefix, e.g., checking whether “ref” is a prefix of “refer”.
What about substring matching? Can we use the Trie to check if “ef” is a substring of “refer” in O(M).
No! because “ef” is not a prefix of “refer”. Using the Trie of “refer”, we can only check whether “r”, “re”, “ref”, “refe”, and “refer” are the substrings.
Idea:
What about if we add “efer” in the Trie? It will allow us efficiently checking whether “e”, “ef”, “efe”, and “efer” are also in the string.
What about if we also add “fer”. This will allow us checking whether “f”, “fe”, and “fer” are in the string or not.
In short, if we add all suffixes of the string refer (i.e., refer, efer, fer, er, r) in the Trie, we can efficiently search every substring of refer in the Trie.
r
e
f
e
$
r
e
f
e
$
r
f
e
$
r
$
r
$
Prefix for a string s[1…M] is a string s[1…X] where X≤M. (e.g., refe is a prefix of refer.)
Suffix for a string s[1…M] is a string s[X…M]. E.g., fer is a suffix of refer.
Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Suffix Array:
Only stores IDs of suffixes. The sorted suffices are shown just for illustration
But if suffixes are not stored, how do we retrieve the suffix while comparing?
Easy to get it using suffix ID and original string, i.e., Suffix = String[ID:]
Suffix Array Space Complexity:
O(N)
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
11 3
11 6
Another example:
Suppose we are comparing Suffix with ID 3 and Suffix with ID 6.
First k characters must be the same
The tie is to be broken on the next k characters, e.g.,
We need to compare “ISSIPPI$” and “IPPI$” on the first 2 characters
ISSIPPI$ and IPPI$ are suffixes and are already ranked on first 2 characters
E.g., IPPI$ < ISSIPPI$ because its rank is smaller
Therefore, SSIPPI$ < SSISSIPPI$
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
M I S S I S S I P P I $
1 2 3 4 5 6 7 8 9 10 11 12
String
1 12
2 11
3 8
4 2
4 5
6 1
7 10
8 9
9 4
9 7
11 3
11 6
Yet another example:
Suppose we are comparing Suffix with ID 2 and Suffix with ID 5.
First k characters must be the same
The tie is to be broken on the next k characters, e.g.,
We need to compare “SISSIPPI$” and “SIPPI$” on the first 2 characters
SISSIPPI$ and SIPPI$ are suffixes and are already ranked on first 2 characters
E.g., SIPPI$ = SISSIPPI$ on first 2 characters
Therefore, ISSIPPI$ = ISSISSIPPI$ on first 4 characters
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
130
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
6 4 11 9 4 11 9 3 8 7 2 1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
12
11
8
2
5
1
10
9
4
7
3
6
Create a new array called Rank where
Index corresponds to ID of each suffix
At each index, the current rank of the suffix is recorded
This array can be used to get current rank of any suffix.
Consider Suffix with ID 2 “ISSISSIPPI$”:
Its current rank can be found at Rank[2]
How do we know the ID/rank of suffix “SISSIPPI$”?
Since it is 2 characters “off” from “ISSISSPPI$” (ID 2), its ID is 2+2=4 and rank is Rank[4] = 9
In general, a suffix k characters “off” from a suffix with ID x will have an ID x + k.
E.g., ID of suffix “IPPI$”?
It is 6 characters “off” from suffix with ID 2 so its ID is 2 + 6 = 8
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Rank
Index/ID
String
M I S S I S S I P P I $
1
2
3
4
4
6
7
8
9
9
11
11
z
z
131
FIT2004: Lec-6: Retrieval Data Structures for Strings
Given SSISSIPPI$ with ID 3, what is the ID of PPI$?
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
6 4 11 9 4 11 9 3 8 7 2 1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
12
11
8
2
5
1
10
9
4
7
3
6
Note: We don’t need to store the suffixes (shown grey) – we only need Suffix IDs.
Comparing Suffix with IDs 4 and 7:
Their ranks are equal.
Rank[4] = Rank [7]
E.g., they are the same on first 2 characters
We need to compare them on the next 2 characters
Compare ranks of suffixes 4+2=6 and 7+2=9
Rank[6] > Rank[9]
So, Suffix #7 is smaller than #4
Note: Comparison takes O(1) and we do not need to store all suffixes – space used is O(N)
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $
ID
Index/ID
String
M I S S I S S I P P I $
z
z
133
$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
6 4 12 10 4 11 9 3 8 7 2 1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
12
11
8
2
5
1
10
9
7
4
6
3
Note: We don’t need to store the suffixes (shown grey) – we only need Suffix IDs.
Suppose array has been sorted on first 4 characters.
Comparing Suffix with IDs 2 and 5:
Their ranks are equal.
Rank[2] = Rank [5]
E.g., they are the same on first 4 characters
We need to compare them on the next 4 characters
Should we instead compare them on all remaining characters???
No, array is sorted on first 4 only
Compare ranks of suffixes 2+4=6 and 5+4=9
Rank[6] > Rank[9]
So, Suffix #5 is smaller than #2
Note: Each comparison takes O(1) and we do not need to store all suffixes – space used is O(N)
ID
Index/ID
String
M I S S I S S I P P I $
z
z
134
Construction Cost of Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $
12
11
8
5
2
1
10
9
7
4
6
3
Time Complexity (prefix doubling):
We need to sort O(log N) times
Sort on 1 characters
Sort on 2 characters
…
Sort on N/2 characters
Sort on N characters
Each sorting requires O(N log N) comparisons
Each comparison takes O(1)
Total cost: O(N log2 N)
Space Complexity:
O(N)
6 5 12 10 4 11 9 3 8 7 2 1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Index/ID
M I S S I S S I P P I $
String