程序代写代做代考 algorithm DNA ER data structure PowerPoint Presentation

PowerPoint Presentation

Faculty of Information Technology,
Monash University

FIT2004: Algorithms and Data Structures
Week 6: Retrieval Data Structures for Strings

These slides are prepared by M. A. Cheema and are based on the material developed by Arun Konagurthu and Lloyd Allison.

Recommended readings
Unit notes (Chapters 9&10)
Cormen et al. “Introduction to Algorithms” (Chapter 18)
For a more advanced treatment of Trie and suffix trees: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press. (Chapter 5) – Book available in the library!

(FIT2004
FIT2004: Lec-6: Retrieval Data Structures for Strings

Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost

FIT2004: Lec-6: Retrieval Data Structures for Strings

Introduction
FIT2004: Lec-6: Retrieval Data Structures for Strings
Suppose you have a large text containing N strings. You want to pre-process it such that searching on this text is efficient.
Sorting based approach:
Pre-processing: Sort the strings
Searching: Binary search to find
Let M be the average length of strings (M can be quite large, e.g., for DNA sequences). Comparison between two strings takes O(M).

Time complexity:
Pre-processing  O(MN log N) using merge sort or O(MN) using radix sort
Searching  O(M log N)
Can we do better?
Yes! ReTrieval data structures allow answering different string queries efficiently

Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost

FIT2004: Lec-6: Retrieval Data Structures for Strings

Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
ReTRIEval tree = Trie
Often pronounced as ‘Try’.
Trie is an N-way (or multi-way) tree, where N is the size of the alphabet
E.g., N=2 for binary
N = 26 for English letters
N = 4 for DNA
In a standard Trie, all words with the shared prefix fall within the same subtree/subtrie
In fact, it is the shortest possible tree that can be constructed such that all prefixes fall within the same subtree.

Trie Example: Insertion
FIT2004: Lec-6: Retrieval Data Structures for Strings
Let’s look at an example : a Trie that stores baby, bad, bank, box, dog, dogs, banks.
We will use $ to denote the end of a string.
Inserting a string in a Trie:
Start from the root node
For each character c in the string
If a node containing c exists
Move to the node
Else
Create the node
Move to it

Alternative Illustration
FIT2004: Lec-6: Retrieval Data Structures for Strings
Traditionally, characters are shown on edges instead of nodes. However, these are just two different ways to illustrate.
We show characters on nodes because I find them clearer in lecture slides especially for dense examples later in the lecture (e.g., in Suffix Trie).
In exams, you can use any of the two illustrations.

a
y
b
$
d
$
n
k
$
s
o
x
$
o
d
g
$
$
$
s

Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost

FIT2004: Lec-6: Retrieval Data Structures for Strings

Trie Example: Search
FIT2004: Lec-6: Retrieval Data Structures for Strings
Searching a string:
Start from the root node
For each character c in the string (including $)
If a node containing c exists
Move to the node
If c == $
Return “found”
Else
Return “not found”

$
Search box

$
Search boxing

$
Output for searching ban ?
Time Complexity?:
For loop runs O(M) times.
Time to check if a node containing c exists?
Depends on implementation, and on whether alphabet size is constant

Trie Example: Prefix Matching
FIT2004: Lec-6: Retrieval Data Structures for Strings
Prefix matching returns every string in text that has the given string as its prefix.
E.g., Autocompletion. Return all strings that start
with “ban”
Prefix matching:
Start from the root node
For each character c in the prefix
If a node containing c exists
Move to the node
Else
Return “not found”
Return all strings in the
subtree rooted at the last node

$
Prefix matching for ban

Trie Example: Prefix Matching
FIT2004: Lec-6: Retrieval Data Structures for Strings
Prefix matching returns every string in text that has the given string as its prefix.
E.g., Autocompletion. Return all strings that start
with “b”
Prefix matching:
Start from the root node
For each character c in the prefix
If a node containing c exists
Move to the node
Else
Return “not found”
Return all strings in the
subtree rooted at the last node

$
Prefix matching for b

Implementing a Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Implementation using an array:
At each node, create an array of alphabets size (e.g., 26 for English letters, 4 for DNA strings)
If i-th node exists, add pointer to it at array[i]
Otherwise, array[i] = Nil.
The above implementation allows checking whether a node exists or not in O(1).
Other implementations are possible (e.g., using linked lists or hash tables).

Example: Implementing a Trie using arrays (assuming only three letters A,B,C)
FIT2004: Lec-6: Retrieval Data Structures for Strings

1 2 3 4

A B C $

Insert BA$

1 2 3 4

A B C $

1 2 3 4

A B C $

$
Insert BC$

1 2 3 4

A B C $

Advantages and Disadvantages of Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
Advantages
A better search structure than a binary search tree with string keys.
A more versatile search structure than hash table
Allows lookup on prefix matching in O(M)-time where M is the length of prefix.
Allows sorting collection of strings in O(MN) time where MN is the total number of characters in all strings
Disadvantages
On average Tries can be slower (in some cases) than hash tables for looking up patterns/queries.
Wastes space, since even when a node has few children, you need to create an array of size alphabet

Some properties of Trie
FIT2004: Lec-6: Retrieval Data Structures for Strings
The maximum depth is the length of longest string in the collection.
Insertion, Deletion, Lookup operations take time proportional to the length of the string/pattern being inserted, deleted, or searched.
But we waste a lot of space if
each node has 1 pointer per symbol in the alphabet.
deeper nodes typically have mostly null pointers.
Can reduce total space usage by turning each node into a linked list or binary search tree etc, trading off time for space.

Radix/PATRICIA Tree (NOT EXAMINABLE BUT WORTH MENTIONING)
FIT2004: Lec-6: Retrieval Data Structures for Strings
Radix/PATRICIA tree is a space-optimized/compact Trie data structure
Unlike regular tries, edges can be labeled with substrings of characters.
The nodes along a path having exactly one child are merged
This makes them much more efficient for sets of strings that share long prefixes or substrings.

Radix/PATRICIA Tree (NOT EXAMINABLE BUT WORTH MENTIONING)

oxe

$
FIT2004: Lec-6: Retrieval Data Structures for Strings

Outline
Introduction
Trie
Construction
Query Processing
Suffix Trie
Construction
Query Processing
Suffix Tree
Suffix Array
Introduction
Query Processing
Reducing Construction Cost

FIT2004: Lec-6: Retrieval Data Structures for Strings

Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
(Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words

A string s
1 2 3 . . . m-1 m

Prefixes and suffixes
FIT2004: Lec-6: Retrieval Data Structures for Strings
(Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words
A prefix of a word s[1..m] is some string s[1..i] where 1<=i<=m FIT2004: Lec-6: Retrieval Data Structures for Strings A string s 1 2 3 i . . . m-1 m Prefix of s Prefixes and suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings (Prefix) Tries are very useful for quickly looking up whole words, but more generally, prefixes of words A prefix of a word s[1..m] is s[1..i] where 1<=i<=m A suffix of a word s[1..m] is s[i..m] where 1<=i<=m A string s 1 2 3 . . . i m-1 m Suffix of s Prefixes and suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings Any substring of a word is a prefix of some suffix In other words, a substring of s[1..m] is s[i..j] A string s 1 2 3 i . . . j m-1 m Prefixes and suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings Any substring of a word is a prefix of some suffix In other words, a substring of s[1..m] is s[i..j] s[i..j] is a prefix of s[i..m] (which is a suffix of s[1..m]) To be able to efficiently search substrings… Just make a prefix trie of suffixes A string s 1 2 3 i . . . j m-1 m Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Consider some text, e.g., “refer”. A Trie constructed using all suffixes of the text is called a Suffix Trie Pick any substring, eg “efe” r e f e $ r e f e $ r f e $ r $ r $ Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Consider some text, e.g., “refer”. A Trie constructed using all suffixes of the text is called a Suffix Trie Pick any substring, eg “efe” Notice that it traces a path from root to some node r e f e $ r e f e $ r f e $ r $ r $ Constructing Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Suffix trie of referrer$ Insert all suffixes of referrer in the trie referrer$ eferrer$ ferrer$ errer$ rrer$ rer$ er$ r$ $ Many arrows are removed for better visualization r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Outline Introduction Trie Construction Query Processing Suffix Trie Construction Query Processing Suffix Tree Suffix Array Introduction Query Processing Reducing Construction Cost FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” Found! r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” Not found :( r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Substring search on Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Substring search for str Similar to prefix matching search “err” search “fers” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Time Complexity: O(M) where M is the length of substring $ Counting # of occurences of a substring FIT2004: Lec-6: Retrieval Data Structures for Strings r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Quiz time! https://flux.qa - RFIBMB $ Counting # of occurences of a substring FIT2004: Lec-6: Retrieval Data Structures for Strings Follow the path similar to prefix matching Count # of leaf nodes ($) in the subtree rooted at the last node E.g Count occurences of “er” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Time Complexity: Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie $ Counting # of occurences of a substring FIT2004: Lec-6: Retrieval Data Structures for Strings Follow the path similar to prefix matching Count # of leaf nodes ($) in the subtree rooted at the last node E.g Count occurences of “er” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Time Complexity: Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie $ Counting # of occurences of a substring FIT2004: Lec-6: Retrieval Data Structures for Strings Follow the path similar to prefix matching Count # of leaf nodes ($) in the subtree rooted at the last node E.g Count occurences of “er” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Time Complexity: Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie $ Counting # of occurences of a substring FIT2004: Lec-6: Retrieval Data Structures for Strings Follow the path similar to prefix matching Count # of leaf nodes ($) in the subtree rooted at the last node E.g Count occurences of “er” r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Time Complexity: Can be done in O(M) if number of leaf nodes is maintained during construction of suffix trie $ Finding longest repeated substring FIT2004: Lec-6: Retrieval Data Structures for Strings r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ Quiz time! https://flux.qa - RFIBMB $ Finding longest repeated substring FIT2004: Lec-6: Retrieval Data Structures for Strings Find the deepest node in the tree with at least two children r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Space complexity of suffix trie FIT2004: Lec-6: Retrieval Data Structures for Strings Let N be the size of the string for which suffix trie is constructed Space complexity?: # number of suffixes: O(N) Cost for each suffix is linear to its size Total space cost: O(N2) r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Outline Introduction Trie Construction Query Processing Suffix Trie Construction Query Processing Suffix Tree Suffix Array Introduction Query Processing Reducing Construction Cost FIT2004: Lec-6: Retrieval Data Structures for Strings Suffix Tree is a compact Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings Compress branches by merging the nodes that have only one child r e f $ e r r e r f $ e r r e r e f $ e r r e r $ r r e r $ r e r $ r $ $ $ Suffix Tree FIT2004: Lec-6: Retrieval Data Structures for Strings Compress branches by merging the nodes that have only one child But the total complexity is still the same as the same number of letters are stored r e ferrer$ ferrer$ e ferrer$ r rer$ rer$ r$ $ $ Quiz time! https://flux.qa - RFIBMB $ Space complexity of suffix tree FIT2004: Lec-6: Retrieval Data Structures for Strings Compress branches by merging the nodes that have only one child But the total complexity is still the same as the same number of letters are stored Replace every substring with numbers (x,y) where x is the starting index of the substring and y is its length e.g., ferrer is represented as (3,6) rer is represented as (6,3) r e ferrer$ ferrer$ e ferrer$ r rer$ rer$ r$ $ $ r e f e r r e r $ 1 2 3 4 5 6 7 8 9 $ Space complexity of suffix tree FIT2004: Lec-6: Retrieval Data Structures for Strings Compress branches by merging the nodes that have only one child But the total complexity is still the same as the same number of letters are stored Replace every substring with numbers (x,y) where x is the starting index of the substring and y is its length e.g., ferrer$ is represented as (3,7) rer$ is represented as (6,4) (1,1) (2,1) (3,7) (2,1) (5,1) (6,4) (6,4) (8,2) (9,1) (9,1) r e f e r r e r $ 1 2 3 4 5 6 7 8 9 (3,7) (3,7) (9,1) Space complexity of suffix tree Every (internal) node has at least 2 children There are exactly n leaves There are at most n/2 nodes in the layer above the leaves There are at most n/4 notes in the layer above that … The total number of nodes is bounded by N + n/2 + n/4 + … + 1 < 2n (1,1) (2,1) (3,7) (2,1) (5,1) (6,4) (6,4) (8,2) (9,1) (9,1) (3,7) (3,7) r e f e r r e r $ 1 2 3 4 5 6 7 8 9 (9,1) Time complexity of constructing suffix tree FIT2004: Lec-6: Retrieval Data Structures for Strings The algorithm described earlier inserts O(N) suffixes Insertion cost of each suffix is linear in the size of suffix Average suffix size is O(N) Thus, total time complexity is O(N2) It is possible to construct suffix tree in O(N) Esko Ukkonen in 1995 gave a beautiful (but involved) algorithm to construct a Suffix Tree in linear time. If you every get interested in doing this in linear time, consider reading the source: Ukkonen, E. (1995). "On-line construction of sux trees". Algorithmica 14 (3): 249260. Outline Introduction Trie Construction Query Processing Suffix Trie Construction Query Processing Suffix Tree Suffix Array Introduction Query Processing Reducing Construction Cost FIT2004: Lec-6: Retrieval Data Structures for Strings Sorted Suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ String M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ 1 2 3 4 5 6 7 8 9 10 11 12 $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Sort Querying on Sorted Suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Substring search: Is “IPP” in the String? Binary search on sorted suffices Let M be the number of characters in substring and N be the size of string. Worst-case cost of substring search is? O (M log N) Querying on Sorted Suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Longest repeated substring: For each consecutive pair in sorted suffices Compute the size of longest common prefix (LCP) among the pair Maintain the one with the maximum size Scan the LCP for the maximum Complexity: Cost of building suffix array + cost of building LCP array + O(N) 0 1 1 4 0 3 Sorted Suffixes FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Space complexity of Sorted Suffixes: O(N2) Can we do better? Yes! Suffix Array reduces it to O(N) without losing effectiveness Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 index String M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ 1 2 3 4 5 6 7 8 9 10 11 12 $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Sort Suffix Array: Only stores IDs of suffixes. The sorted suffices are shown just for illustration Suffix ID Practice FIT2004: Lec-6: Retrieval Data Structures for Strings What will be the suffix array of ABAB$? Quiz time! https://flux.qa - RFIBMB Outline Introduction Trie Construction Query Processing Suffix Trie Construction Query Processing Suffix Tree Suffix Array Introduction Query Processing Reducing Construction Cost FIT2004: Lec-6: Retrieval Data Structures for Strings Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Substring Search: Do a binary search Example: Search “IPP” in the string. Initially, the search space is whole Suffix Array 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Substring Search: Do a binary search Example: Search “IPP” in the string. Initially, the search space is whole Suffix Array Look at the middle element in range, i.e., At index 6 in Suffix array This is Suffix #1 (“MISSISSIPPI$”) 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Substring Search: Do a binary search Example: Search “IPP” in the string. Initially, the search space is whole Suffix Array Look at the middle element in range, i.e., At index 6 in Suffix array This is Suffix #1 (“MISSISSIPPI$”) Since “IPP” < “MISSISSIPPI”, substring if present must be above this element Look at the middle element in range, i.e., At index 3 in Suffix array This is suffix with ID 8 (“IPPI$”) 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Substring Search: Do a binary search Example: Search “IPP” in the string. Initially, the search space is whole Suffix Array Look at the middle element in range, i.e., At index 6 in Suffix array This is Suffix #1 (“MISSISSIPPI$”) Since “IPP” < “MISSISSIPPI”, substring if present must be above this element Look at the middle element in range, i.e., At index 3 in Suffix array This is suffix with ID 8 (“IPPI$”) Found!!! Time Complexity: O(M log N) Can be improved to O(M) using LCP array (beyond the scope of this unit) 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Longest Repeated Substring: Algorithm is same as on “Sorted Suffixes” Time complexity is also the same Time Complexity: O(N2) Can be improved to O(N) using LCP array (beyond the scope of this unit) 1 2 3 4 5 6 7 8 9 10 11 12 Construction Cost of Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 We need to generate N suffixes and then sort all N suffixes. Time Complexity (with Merge Sort): O(N log N) comparisons Each comparison takes O(N) Total cost: O(N2 log N) Time Complexity (with Radix Sort): O(N) passes Each pass takes O(N) Total cost: O(N2) Space required during construction: O(N2) – we need all suffixes during sorting Can we do better? Yes, using prefix doubling approach O(N log2 N) time complexity O(N) space required during construction 1 2 3 4 5 6 7 8 9 10 11 12 Outline Introduction Trie Construction Query Processing Suffix Trie Construction Query Processing Suffix Tree Suffix Array Introduction Query Processing Reducing Construction Cost FIT2004: Lec-6: Retrieval Data Structures for Strings Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - 12 Basic Idea: Generate suffixes Sort suffixes on their 1st characters M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ ID Rank FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 2 2 5 2 8 2 11 6 1 7 9 7 10 9 3 9 4 9 6 9 7 Basic Idea: Generate suffixes Sort suffixes on first 1 characters $ I S S I S S I P P I $ I S S I P P I $ I P P I $ I $ M I S S I S S I P P I $ P P I $ P I $ S S I S S I P P I $ S I S S I P P I $ S S I P P I $ S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 11 3 11 6 Basic Idea: Generate suffixes Sort suffixes on first 1 characters Sort suffixes on first 2 characters $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Basic Idea: Generate suffixes Sort suffixes on first 1 characters Sort suffixes on first 2 characters Sort suffixes on first 4 characters $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Basic Idea: Generate suffixes Sort suffixes on their 1st characters Sort suffixes on first 2 characters Sort suffixes on first 4 characters … $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Basic Idea: Generate suffixes Sort suffixes on their 1st characters Sort suffixes on first 2 characters Sort suffixes on first 4 characters … Sort suffixes on all N characters $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Basic Idea: Generate suffixes Sort suffixes on their 1st characters Sort suffixes on first 2 characters Sort suffixes on first 4 characters … Sort suffixes on all N characters The last sort alone takes NlogN with a comparison cost of N This is N2logN We need to speed up the comparisons $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Basic Idea: Generate suffixes Sort suffixes on their 1st characters Sort suffixes on first 2 characters Sort suffixes on first 4 characters … Sort suffixes on all N characters Suppose we could compare in O(1) logN sorts O(NlogN) for each O(NlogNlogN) = O(Nlog2N) $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Constructing Suffix Array: Prefix Doubling O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 10 3 10 6 Comparing suffixes in O(1): Suppose already sorted on first k characters (2 in this example) Now sorting on 2k characters (4 in this example) Observation 1: If current ranks are different, suffix with smaller rank is smaller (because its first k characters are smaller) E.g., PPI$ < SSIP Note comparison cost is O(1) $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 10 3 10 6 Observation 2: If current ranks are the same First k characters must be the same The tie is to be broken on the next k characters, e.g., $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 10 3 10 6 Observation 2: If current ranks are the same First k characters must be the same The tie is to be broken on the next k characters, e.g., We need to compare “SSIPPI$” and “PPI$” on the first 2 characters $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 4 8 7 9 3 9 6 Observation 2: If current ranks are the same First k characters must be the same The tie is to be broken on the next k characters, e.g., We need to compare “SSIPPI$” and “PPI$” on the first 2 characters SSIPPI$ and PPI$ are suffixes and are already ranked on first 2 characters E.g., PPI$ < SSIPPI$ because its rank is smaller Therefore, suffix #7< suffix #4 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 7 9 4 10 6 11 3 BUT WAIT! How did we do that quickly? Surely looking up the “second half” suffixes is O(N)? $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Practice FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Suppose we are comparing suffix with ID 2 and 5: We need to compare SSIPPI$ and PPI$ $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Practice FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Suppose we are comparing suffix with ID 2 and 5: We need to compare SSIPPI$ and PPI$ How do we find their ranks quickly? $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Practice FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Suppose we are comparing suffix with ID 2 and 5: We need to compare SSIPPI$ and PPI$ How do we find their ranks quickly? We want the ranks of suffixes: 2+k and 5+k I.e. suffixes 6 and 9 This means we can calculate the IDs of the suffixes we want in O(1) Now we need to get from IDs to ranks in O(1) $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Practice FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 7 10 4 11 6 12 3 Suppose we are comparing suffix with ID 2 and 5: We need to compare SSIPPI$ and PPI$ How do we find their ranks quickly? We want the ranks of suffixes: 2+k and 5+k I.e. suffixes 6 and 9 To have O(1) access to their ranks, we need an array indexed by ID which contains the ranks! In other words, the way the ranks are arranged on this slide is useless $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID Rank Practice O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings 6 4 12 10 4 11 9 3 8 7 2 1 1 2 3 4 5 6 7 8 9 10 11 12 Rank Index/ID String M I S S I S S I P P I $ Note: The greyed out oldRank array has been left for reference, but does not exist in implementation If we want the rank of ID i, look at Rank[i] Going back to our example… 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 7 9 4 10 6 11 3 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID oldRank 90 O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings 1 2 3 4 5 6 7 8 9 10 11 12 Rank Index/ID String M I S S I S S I P P I $ Note: The greyed out oldRank array has been left for reference, but does not exist in implementation If we want the rank of ID i, look at Rank[i] Going back to our example… We wanted to find the second parts of suffixes 2 and 5 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 7 9 4 10 6 11 3 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID oldRank 6 4 12 10 4 11 9 3 8 7 2 1 91 O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings 1 2 3 4 5 6 7 8 9 10 11 12 Rank Index/ID String M I S S I S S I P P I $ Note: The greyed out oldRank array has been left for reference, but does not exist in implementation If we want the rank of ID i, look at Rank[i] Going back to our example… We wanted to find the second parts of suffixes 2 and 5 I.e. ID 6 and 9 Rank[9] < rank[6] So ID 5 should come before ID 2 in the suffix array 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 7 9 4 10 6 11 3 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ ID oldRank 6 4 12 10 4 11 9 3 8 7 2 1 92 Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M - I - S - S - I - S - S - I - P - P - I - $ - 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - 12 M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ SA Rank REMEMBER: This rank array does not exist It contains the same values as the other rank array, its just to make the algorithm easier to follow Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M - I - S - S - I - S - S - I - P - P - I - $ - 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - 11 - 12 M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ SA Rank Rank the first characters of each suffix Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank 3 1 2 2 5 3 5 4 2 5 5 6 5 7 2 8 4 9 4 10 2 11 1 12 M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ SA Rank Rank the first characters of each suffix In practice we would do this using ord(), but since ranks are only comparative, the actual values don’t matter, just their order. So we use numbers starting at 1 Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank 3 1 2 2 5 3 5 4 2 5 5 6 5 7 2 8 4 9 4 10 2 11 1 12 M I S S I S S I P P I $ I S S I S S I P P I $ S S I S S I P P I $ S I S S I P P I $ I S S I P P I $ S S I P P I $ S I P P I $ I P P I $ P P I $ P I $ I $ $ SA Rank Sort SA by ranks Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank 1 12 2 2 2 5 2 8 2 11 3 1 4 9 4 10 5 3 5 4 5 6 5 7 $ I S S I S S I P P I $ I S S I P P I $ I P P I $ I $ M I S S I S S I P P I $ P P I $ P I $ S S I S S I P P I $ S I S S I P P I $ S S I P P I $ S I P P I $ SA Rank Sort SA by ranks Note that this does not change the rank array, since IDs have kept the same ranks We just rearranged the SA Lets do an example (by hand in lecture) FIT2004: Lec-6: Retrieval Data Structures for Strings M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank 1 12 2 2 2 5 2 8 2 11 3 1 4 9 4 10 5 3 5 4 5 6 5 7 $ I S S I S S I P P I $ I S S I P P I $ I P P I $ I $ M I S S I S S I P P I $ P P I $ P I $ S S I S S I P P I $ S I S S I P P I $ S S I P P I $ S I P P I $ SA Rank Now sort on first 2 characters For each comparison, we use the trick outlined in the previous section FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Now we update the ranks FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Make an array, “Temp” to hold the new ranks 1 1 1 1 1 1 1 1 1 1 1 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank For each pair of adjacent suffixes, compare them 1 1 1 1 1 1 1 1 1 1 1 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank If they have different ranks already, then the second suffix is certainly larger 1 1 1 1 1 1 1 1 1 1 1 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Set Temp[11] to Rank[12]+1 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank If they have the same rank 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank We need to use the O(1) trick 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank First characters have the same rank, so compare the suffixes which start with next characters 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank $ vs PPI$ (ID 11+1 and 8+1) 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank 12 has lower rank than 9, so 11 should have lower rank than 8 1 1 1 1 1 1 1 1 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Set Rank[8] = Rank[11] + 1 1 1 1 1 1 1 1 3 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 1 4 1 1 1 1 1 3 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 1 4 1 1 4 1 1 3 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 1 1 4 1 1 3 1 1 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 1 1 4 1 1 3 1 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 1 1 4 1 1 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 1 8 4 1 1 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 1 8 4 1 8 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 9 8 4 1 8 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Continue in this way 5 4 9 8 4 9 8 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 2 8 2 2 2 5 3 1 4 10 4 9 5 4 5 7 5 3 5 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 3 I 2 S 5 S 5 I 2 S 5 S 5 I 2 P 4 P 4 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank This is our new Rank array, so overwrite the old one 5 4 9 8 4 9 8 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 4 8 7 9 3 9 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 5 I 4 S 9 S 8 I 4 S 9 S 8 I 3 P 7 P 6 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank This is our new Rank array, so overwrite the old one 5 4 9 8 4 9 8 3 7 6 2 1 Temp FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 4 8 7 9 3 9 6 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ Lets do an example (by hand in lecture) SA Rank M 5 I 4 S 9 S 8 I 4 S 9 S 8 I 3 P 7 P 6 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank Now we have the suffixes sorted by the first 2. Go for 4! FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 3 8 4 2 4 5 5 1 6 10 7 9 8 7 9 4 10 6 11 3 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Lets do an example (by hand in lecture) ID Rank M 5 I 4 S 11 S 9 I 4 S 10 S 8 I 3 P 7 P 6 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Lets do an example (by hand in lecture) ID Rank M 6 I 5 S 12 S 10 I 4 S 11 S 9 I 3 P 8 P 7 I 2 $ 1 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank FIT2004: Lec-6: Retrieval Data Structures for Strings 1 12 2 11 3 8 4 5 5 2 6 1 7 10 8 9 9 7 10 4 11 6 12 3 $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ Lets do an example (by hand in lecture) ID Rank 1 2 3 4 5 6 7 8 9 10 11 12 ID String Rank M 6 I 5 S 12 S 10 I 4 S 11 S 9 I 3 P 8 P 7 I 2 $ 1 Summary FIT2004: Lec-6: Retrieval Data Structures for Strings Take home message Tries, Suffix trees and Suffix array provide efficient text search and pattern matching (typically linear in number of characters in string) Things to do (this list is not exhaustive) Implement Trie, Suffix trees and Suffix array and run various pattern matching queries Coming Up Next Burrows-Wheeler Transform - A beautiful space-time efficient pattern matching algorithm on text FIT2004: Lec-6: Retrieval Data Structures for Strings Suffix Trie FIT2004: Lec-6: Retrieval Data Structures for Strings We saw that a Trie allows prefix matching in O(M) where M is the length of the prefix, e.g., checking whether “ref” is a prefix of “refer”. What about substring matching? Can we use the Trie to check if “ef” is a substring of “refer” in O(M). No! because “ef” is not a prefix of “refer”. Using the Trie of “refer”, we can only check whether “r”, “re”, “ref”, “refe”, and “refer” are the substrings. Idea: What about if we add “efer” in the Trie? It will allow us efficiently checking whether “e”, “ef”, “efe”, and “efer” are also in the string. What about if we also add “fer”. This will allow us checking whether “f”, “fe”, and “fer” are in the string or not. In short, if we add all suffixes of the string refer (i.e., refer, efer, fer, er, r) in the Trie, we can efficiently search every substring of refer in the Trie. r e f e $ r e f e $ r f e $ r $ r $ Prefix for a string s[1…M] is a string s[1…X] where X≤M. (e.g., refe is a prefix of refer.) Suffix for a string s[1…M] is a string s[X…M]. E.g., fer is a suffix of refer. Suffix Array FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String $ I $ I P P I $ I S S I P P I $ I S S I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I P P I $ S I S S I P P I $ S S I P P I $ S S I S S I P P I $ 12 11 8 5 2 1 10 9 7 4 6 3 Suffix Array: Only stores IDs of suffixes. The sorted suffices are shown just for illustration But if suffixes are not stored, how do we retrieve the suffix while comparing? Easy to get it using suffix ID and original string, i.e., Suffix = String[ID:] Suffix Array Space Complexity: O(N) O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 11 3 11 6 Another example: Suppose we are comparing Suffix with ID 3 and Suffix with ID 6. First k characters must be the same The tie is to be broken on the next k characters, e.g., We need to compare “ISSIPPI$” and “IPPI$” on the first 2 characters ISSIPPI$ and IPPI$ are suffixes and are already ranked on first 2 characters E.g., IPPI$ < ISSIPPI$ because its rank is smaller Therefore, SSIPPI$ < SSISSIPPI$ $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings M I S S I S S I P P I $ 1 2 3 4 5 6 7 8 9 10 11 12 String 1 12 2 11 3 8 4 2 4 5 6 1 7 10 8 9 9 4 9 7 11 3 11 6 Yet another example: Suppose we are comparing Suffix with ID 2 and Suffix with ID 5. First k characters must be the same The tie is to be broken on the next k characters, e.g., We need to compare “SISSIPPI$” and “SIPPI$” on the first 2 characters SISSIPPI$ and SIPPI$ are suffixes and are already ranked on first 2 characters E.g., SIPPI$ = SISSIPPI$ on first 2 characters Therefore, ISSIPPI$ = ISSISSIPPI$ on first 4 characters $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank 130 O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings 6 4 11 9 4 11 9 3 8 7 2 1 1 2 3 4 5 6 7 8 9 10 11 12 Rank 12 11 8 2 5 1 10 9 4 7 3 6 Create a new array called Rank where Index corresponds to ID of each suffix At each index, the current rank of the suffix is recorded This array can be used to get current rank of any suffix. Consider Suffix with ID 2 “ISSISSIPPI$”: Its current rank can be found at Rank[2] How do we know the ID/rank of suffix “SISSIPPI$”? Since it is 2 characters “off” from “ISSISSPPI$” (ID 2), its ID is 2+2=4 and rank is Rank[4] = 9 In general, a suffix k characters “off” from a suffix with ID x will have an ID x + k. E.g., ID of suffix “IPPI$”? It is 6 characters “off” from suffix with ID 2 so its ID is 2 + 6 = 8 $ I $ I P P I $ I S S I S S I P P I $ I S S I P P I $ M I S S I S S I P P I $ P I $ P P I $ S I S S I P P I $ S I P P I $ S S I S S I P P I $ S S I P P I $ ID Rank Index/ID String M I S S I S S I P P I $ 1 2 3 4 4 6 7 8 9 9 11 11 z z 131 FIT2004: Lec-6: Retrieval Data Structures for Strings Given SSISSIPPI$ with ID 3, what is the ID of PPI$? O(1) Comparison FIT2004: Lec-6: Retrieval Data Structures for Strings 6 4 11 9 4 11 9 3 8 7 2 1 1 2 3 4 5 6 7 8 9 10 11 12 Rank 12 11 8 2 5 1 10 9 4 7 3 6 Note: We don’t need to store the suffixes (shown grey) – we only need Suffix IDs. Comparing Suffix with IDs 4 and 7: Their ranks are equal. Rank[4] = Rank [7] E.g., they are the same on first 2 characters We need to compare them on the next 2 characters Compare ranks of suffixes 4+2=6 and 7+2=9 Rank[6] > Rank[9]
So, Suffix #7 is smaller than #4

Note: Comparison takes O(1) and we do not need to store all suffixes – space used is O(N)

$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I S S I P P I $
S I P P I $
S S I S S I P P I $
S S I P P I $

Index/ID
String
M I S S I S S I P P I $

133

$
I $
I P P I $
I S S I S S I P P I $
I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $

O(1) Comparison
FIT2004: Lec-6: Retrieval Data Structures for Strings
6 4 12 10 4 11 9 3 8 7 2 1

1 2 3 4 5 6 7 8 9 10 11 12

Rank
12
11
8
2
5
1
10
9
7
4
6
3

Note: We don’t need to store the suffixes (shown grey) – we only need Suffix IDs.
Suppose array has been sorted on first 4 characters.
Comparing Suffix with IDs 2 and 5:
Their ranks are equal.
Rank[2] = Rank [5]
E.g., they are the same on first 4 characters
We need to compare them on the next 4 characters
Should we instead compare them on all remaining characters???
No, array is sorted on first 4 only
Compare ranks of suffixes 2+4=6 and 5+4=9
Rank[6] > Rank[9]
So, Suffix #5 is smaller than #2
Note: Each comparison takes O(1) and we do not need to store all suffixes – space used is O(N)

Index/ID
String
M I S S I S S I P P I $

134

Construction Cost of Suffix Array
FIT2004: Lec-6: Retrieval Data Structures for Strings
$
I $
I P P I $
I S S I P P I $
I S S I S S I P P I $
M I S S I S S I P P I $
P I $
P P I $
S I P P I $
S I S S I P P I $
S S I P P I $
S S I S S I P P I $

12
11
8
5
2
1
10
9
7
4
6
3

Time Complexity (prefix doubling):
We need to sort O(log N) times
Sort on 1 characters
Sort on 2 characters
…
Sort on N/2 characters
Sort on N characters
Each sorting requires O(N log N) comparisons
Each comparison takes O(1)
Total cost: O(N log2 N)
Space Complexity:
O(N)
6 5 12 10 4 11 9 3 8 7 2 1

1 2 3 4 5 6 7 8 9 10 11 12

Rank
Index/ID
M I S S I S S I P P I $

String

Related Posts