PowerPoint Presentation
Faculty of Information Technology,
Monash University
FIT2004: Algorithms and Data Structures
Week 7: Burrows-Wheeler Transform
These slides are prepared by M. A. Cheema and are based on the material developed by Arun Konagurthu and Lloyd Allison.
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Compression problem
Suppose you have a large sequence of characters (e.g., English text or DNA sequence). How can you compress the data?
Idea:
Original Text: this is mississippi’s history. is this mississippi’s history?
Modified: (rearrange such that we get many “runs” of the same characters)
hhhhiiiiooiiiiiiiiiittttmmsssssssssssrrppppyysssss (text length: 50)
Compressed: 4h4i2o10i4t2m11s2r4p2y5s (compressed length: 24)
FIT2004: Lec-7: Burrows-Wheeler Transform
Compression problem
Sorting the text provides “runs” of maximal lengths.
hhhhiiiiiiiiiiiiiimmoopppprrssssssssssssssssttttyy (text length: 50)
4h14i2m2o4p2r16s4t2y (Compressed length: 20)
However, sorting is not an acceptable solution! We must be able to recover the original text from the compressed data, i.e., decompression.
So, the question is how to modify the original text such that there are many “runs” of the characters (to effectively compress the data) and the original text can be recovered from the decompressed data.
Burrows-Wheeler Transform! Used in bzip2.
FIT2004: Lec-7: Burrows-Wheeler Transform
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Burrows-Wheeler Transform
FIT2004: Lec-7: Burrows-Wheeler Transform
M I S S I S S I P P I $
M
I
S
S
I
S
S
I
P
P
I
$
$
M
I
S
S
I
S
S
I
P
P
I
I
$
M
I
S
S
I
S
S
I
P
P
P
I
$
M
I
S
S
I
S
S
I
P
P
P
I
$
M
I
S
S
I
S
S
I
I
P
P
I
$
M
I
S
S
I
S
S
S
I
P
P
I
$
M
I
S
S
I
S
S
S
I
P
P
I
$
M
I
S
S
I
I
S
S
I
P
P
I
$
M
I
S
S
S
I
S
S
I
P
P
I
$
M
I
S
S
S
I
S
S
I
P
P
I
$
M
I
I
S
S
I
S
S
I
P
P
I
$
M
All cyclic rotations of the text
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
All cyclic rotations of the text
Sort the strings in alphabetical order assuming $ is the smallest
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
All cyclic rotations of the text
The last column of the sorted matrix is Burrows-Wheeler Transform
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
All cyclic rotations of the text
The last column of the sorted matrix is Burrows-Wheeler Transform
Note similarity with suffix array which corresponds to IDs of these suffixes/cyclic rotations
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
All cyclic rotations of the text
The last column of the sorted matrix is Burrows-Wheeler Transform
Once you get BWT, you can use run-length encoding to compress it (if the goal is compression).
Exercise
What is the Burrows-Wheeler Transform of BIRD?
$BIRD
BI$RD
D$RBI
IRBD$
RDI$B
None of the above
FIT2004: Lec-7: Burrows-Wheeler Transform
Quiz time!
https://flux.qa – RFIBMB
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Why is BWT effective for compression?
Last-First Property:
The last character of a row comes before the first character of the row in the input string.
because each string in the matrix is a cyclic rotation of the text
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
M I S S I S S I P P I $
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
All cyclic rotations of the text
For each rotation, the char in the BWT comes before the first char
Why is BWT effective for compression?
Consider a large English text. IS is a very common word. Thus, I appears before S in the text much more frequently compared to some other letters, e.g., IS is more frequent than CABS, BOSS etc.
When the cyclic rotation matrix is sorted, all the occurrences of S in the first column appear together. The last column which is BWT will contain a lot of occurrences of Is because I appears before S much more frequently than the other letters.
E.g., this-is-a-historical-story (space replaced with – for clarity)
………………………………
s-a-historical-story$this-i
s-is-a-historical-story$thi
storical-story$this-is-a-hi
story$this-is-a-historical-
………………………………
Effective for compression when text is large and has such biases in it (i.e., some letters appear before some others much more frequently).
FIT2004: Lec-7: Burrows-Wheeler Transform
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Decompressing (Inverting) BWT
We saw that BWT produces “runs” of characters which is effective in compression.
But how do we invert BWT, i.e., how do we decompress the data to recover original text.
If we could rebuild all the sorted permutations, the one that ends in $ is the original text
FIT2004: Lec-7: Burrows-Wheeler Transform
k-mers
k-mers of a string refers to its all possible substrings of size k (considering cyclic rotation).
2-mers of $MISSISSIPPI are $M, MI, IS, SS, SI, IS, SS, SI, IP, PP, PI, I$.
3-mers of $MISSISSIPPI are $MI, MIS, ISS, SSI, SIS, ISS, SSI, SIP, IPP, PPI, PI$, I$M.
Which of the following represents 2-mers of BIRD$.
D$, RI, BI, RD, $B
IR, D$, BI, $B, RD
$B, DR, BI, IR, D$
$D, DR, RI, IB, B$
None of the above
FIT2004: Lec-7: Burrows-Wheeler Transform
Quiz time!
https://flux.qa – RFIBMB
Inverting BWT
The first column is the last column, sorted
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$
I
I
I
I
M
P
P
S
S
S
S
sort
Inverting BWT
The first column is the last column, sorted
The letters in the first column come after the letter in the last column
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
Inverting BWT
The first column is the last column, sorted
The letters in the first column come after the letter in the last column
Concatenating the first column with the last column gives us all 2-mers
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
Inverting BWT
The first column is the last column, sorted
The letters in the first column come after the letter in the last column
Concatenating last column + first column gives us all 2-mers
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$
I
I
I
I
M
P
P
S
S
S
S
I
P
S
S
M
$
P
I
S
S
I
I
Inverting BWT
The first column is the last column, sorted
The letters in the first column come after the letter in the last column
Concatenating the first column with the last column gives us all 2-mers
Sorting the 2-mers gives us the first 2 columns
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$
I
I
I
I
M
P
P
S
S
S
S
I
P
S
S
M
$
P
I
S
S
I
I
Inverting BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$
I
I
I
I
M
P
P
S
S
S
S
I
P
S
S
M
$
P
I
S
S
I
I
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
sort
Inverting BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
We have all sorted 2-mers (first 2 columns)
We want all the 3-mers, so we can sort them and get the first 3 columns
Last column comes before first column
Concatenate again!
Inverting BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
I
P
S
S
M
$
P
I
S
S
I
I
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
sort
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
I
P
S
S
M
$
P
I
S
S
I
I
Inverting BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
Now we have sorted 3-mers (first 3 columns)
Keep
prepending the last column to the sorted k-mers
sorting the result
until you have reconstructed all the sorted cyclic rotations
Return first row without $
Inverting BWT
Inverting BWT
Create an empty table M
Make a column C containing BWT
Repeat len(BWT) times
Concatenate C with M
Sort M alphabetically
Return the first row (ignore $).
Let N be the total number of characters in the original string. What is the complexity?
Time complexity:
Requires N calls to sorting
Cost of sorting N rows where each row has N characters: O (N2) using radix sort
Total cost for sorting: O(N3) if radix sort is being used
Space complexity:
Size of matrix: O(N2)
Can we improve?
Yes! It is possible to invert in O(N) time complexity and O(N) space complexity
FIT2004: Lec-7: Burrows-Wheeler Transform
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Faster Inversion of BWT
We will use different colors for different occurrences of S in $MISSISSIPPI.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M I S S I S S I P P I
1
2
3
4
5
6
7
8
9
10
11
12
Quiz time!
https://flux.qa – RFIBMB
Faster Inversion of BWT
We have used different colors for different occurrences of S in $MISSISSIPPI.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M I S S I S S I P P I
1
2
3
4
5
6
7
8
9
10
11
12
Faster Inversion of BWT
We have used different colors for different occurrences of S in $MISSISSIPPI.
Observation
The relative orders of the same characters in the first column and the last column is the same.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M I S S I S S I P P I
1
2
3
4
5
6
7
8
9
10
11
12
Faster Inversion of BWT
We have used different colors for different occurrences of S in $MISSISSIPPI.
Observation
The relative orders of the same characters in the first column and the last column is the same.
E.g., the i-th S in the first column is the i-th S in the last column
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$ M I S S I S S I P P I
1
2
3
4
5
6
7
8
9
10
11
12
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
i-th occurrence of a letter in first column and i-th occurrence of the letter in the last column point to the same letter.
Faster Inversion of BWT
Why does this observation hold?
Rotate each row that ends at S by one character
First characters of all these are the same (i.e., S)
This means the sorting is based on the remaining characters, i.e., the sorting order is determined by stripping off S.
Hence, the row that appeared earlier before rotation must appear earlier after rotation.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
M I S S I S S I P P I $
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
So, we know which character in the last column corresponds to which character in the first column. The inversion can then be done as follows.
Start from $ in the first column (F)
The previous letter in this row I is the letter before $ in the original string (Last-First property). Recover this letter.
Now, find this I in the first column
The previous letter in this row P is the letter before this I in the original string (Last-First property). Recover this letter
Now, find this P in the first column.
The previous letter in this row P is the letter before this P in the original string (Last-First property). Recover it.
and so on …
M
I
S
S
I
S
S
I
P
P
I
$
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Each step should be O(1)
What do we need to know?
Quiz time!
https://flux.qa – RFIBMB
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Each step should be O(1)
What do we need to know?
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
How can we collect this information quickly?
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 0 0 0 0
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 1 0 0 0
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 1 0 1 0
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 1 0 1 1
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 1 0 1 2
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
0 1 1 1 2
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
Occ
1 1 1 1 2
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 0
9 0
10 0
11 0
12 0
Occ
1 1 1 2 2
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 0
Occ
1 2 1 2 2
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 0
11 0
12 0
Occ
1 2 1 2 3
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 0
12 0
Occ
1 2 1 2 4
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 0
Occ
1 3 1 2 4
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
0 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 0 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 0 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 0 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 0
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Position of each block of letters in left column
Which letter we are looking at in right column (i.e. which S are we looking at?)
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Faster Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 0
2 0
3 0
4 1
5 0
6 0
7 1
8 1
9 2
10 3
11 2
12 3
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Practice
FIT2004: Lec-7: Burrows-Wheeler Transform
What is Burrows-Wheeler Transform of REFERRER?
RRRFEE$RE
$REFERRER
RRRFE$ERE
RRREFEE$R
None of the above
Quiz time!
https://flux.qa – RFIBMB
Practice: Burrows-Wheeler Transform
FIT2004: Lec-7: Burrows-Wheeler Transform
R E F E R R E R $
All cyclic rotations of the text
R E F E R R E R $
$ R E F E R R E R
R $ R E F E R R E
E R $ R E F E R R
R E R $ R E F E R
R R E R $ R E F E
E R R E R $ R E F
F E R R E R $ R E
E F E R R E R $ R
Practice: Burrows-Wheeler Transform
FIT2004: Lec-7: Burrows-Wheeler Transform
All cyclic rotations of the text
R E F E R R E R $
$ R E F E R R E R
R $ R E F E R R E
E R $ R E F E R R
R E R $ R E F E R
R R E R $ R E F E
E R R E R $ R E F
F E R R E R $ R E
E F E R R E R $ R
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
Sort all rows alphabetically
The last colum is BWT.
Practice: Efficient Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
$ E F R
Rank
1
2
3
4
5
6
7
8
9
What are the values in the Rank array?
1, 2, 4, 5
1, 4, 5, 9
1, 2, 5, 6
None of the above
Quiz time!
https://flux.qa – RFIBMB
Practice: Efficient Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
1 2 5 6
$ E F R
Rank
1
2
3
4
5
6
7
8
9
What are the values in the Rank array?
1, 2, 4, 5
1, 4, 5, 9
1, 2, 5, 6
None of the above
Practice: Efficient Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
1
2
3
4
5
6
7
8
9
What are the values in the Occ array?
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
Occ
Practice: Efficient Inversion of BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
1
2
3
4
5
6
7
8
9
What are the values in the Occ array?
1 0
2 1
3 2
4 0
5 0
6 1
7 0
8 3
9 2
Occ
Outline
Compression
Burrows-Wheeler Transform (BWT)
Why BWT is effective for compression
Decompressing BWT
Naïve Approach
Efficient Approach
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Suppose we want to search SIS in the string.
Initially the range contains all rows of BWT
Start from the last character S of SIS.
Find first S in the range and the last S in the range in the Last column
Find the corresponding Ss in the first column and update the range
Now, find the first I in the range and the last I in the range in the Last column
Find the corresponding Is in the first column and update the range.
Now, find the first S in the range and the last S in the range
Find the corresponding Ss in first column and update the range
At any stage, if the character is not found in the range then the substring is not present and false can be returned.
S
S
I
range
1
2
3
4
5
6
7
8
9
10
11
12
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
How to efficiently compute first and last occurrence of a character c in the range.
For each character, create a sorted array of their positions in the last column – this can be done in linear time
To search a character c in range(i,j), use binary search.
to search the first S in the range (5,11), binary search for the smallest position equal to or larger than 5 in the array of S
to search the last S in the range (5,11), binary search for the largest position smaller than or equal to 11
1
2
3
4
5
6
7
8
9
10
11
12
I 1, 8, 11, 12
M 5
P 2, 7
S 3, 4, 9, 10
Time Complexity:
O(M log N) where M is length of substring.
Could be improved to O(M) by maintaining an occ array of size alphabet * M (example to follow)
Substring search in linear O(M) time
FIT2004: Lec-7: Burrows-Wheeler Transform
In the following slides, we use a larger version of “occ”
Occ[row,char] is the number of times we have seen character “char” before this row
We have two pointers to keep track of our range, first and last
Substring search in linear O(M) time
FIT2004: Lec-7: Burrows-Wheeler Transform
Occ[row,char] is the number of times we have seen character “char” before this row
Update rules for first and last
Next_char refers to the next character in the pattern that we are searching
First = rank[next_char] + occ[first, next_char]
Last = rank[next_char] + occ[last + 1, next_char] – 1
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Occ
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Occ
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
First S: rank[S] + occ[1,S] =
9 + 0 = 9
Last S: rank[S] + occ[12+1,S] – 1
= 9 + 4 – 1 = 12
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
Occ
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
First S: rank[S] + occ[1,S] =
9 + 0 = 9
Last S: rank[S] + occ[12+1,S] – 1
= 9 + 4 – 1 = 12
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Count
1 4 1 2 4
$ I M P S
1 2 6 7 9
$ I M P S
Occ
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Count
1 4 1 2 4
$ I M P S
1 2 6 7 9
$ I M P S
Occ
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
First I: rank[I] + occ[9,I] =
2 + 2 = 4
Last I: rank[I] + occ[12+1, I] – 1 = 2 + 4 – 1 = 5
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
First S: rank[S] + occ[4, S] =
9 + 1 = 10
Last S: rank[S] + occ[5, S] =
9 + 2 – 1 = 10
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
S
S
I
1
2
3
4
5
6
7
8
9
10
11
12
Rank
Occ
1 4 1 2 4
$ I M P S
Count
1 2 6 7 9
$ I M P S
$ I M P S
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 1 0
4 0 1 0 1 1
5 0 1 0 1 2
6 0 1 1 1 2
7 1 1 1 1 2
8 1 1 1 2 2
9 1 2 1 2 2
10 1 2 1 2 3
11 1 2 1 2 4
12 1 3 1 2 4
13 1 4 1 2 4
Substring search using BWT
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I1
I1 $ M I S S I S S I P P1
I2 P P I $ M I S S I S S1
I3 S S I P P I $ M I S S2
I4 S S I S S I P P I $ M1
M1 I S S I S S I P P I $
P1 I $ M I S S I S S I P2
P2 P I $ M I S S I S S I2
S1 I P P I $ M I S S I S3
S2 I S S I P P I $ M I S4
S3 S I P P I $ M I S S I3
S4 S I S S I P P I $ M I4
Another example:
Suppose we want to search ISS in the string.
Initially the range contains all rows of BWT
Start from the last character S of SIS.
Find first S in the range and the last S in the range in the Last column
Find the corresponding Ss in the first column and update the range
Now, find the first S in the range and the last S in the range in the Last column
Find the corresponding Ss in the first column and update the range.
Now, find the first I in the range and the last I in the range
Find the corresponding Is in first column and update the range
S
I
S
range
1
2
3
4
5
6
7
8
9
10
11
12
Practice: Substring matching
FIT2004: Lec-7: Burrows-Wheeler Transform
$ R E F E R R E R
E F E R R E R $ R
E R $ R E F E R R
E R R E R $ R E F
F E R R E R $ R E
R $ R E F E R R E
R E F E R R E R $
R E R $ R E F E R
R R E R $ R E F E
Search ER
Search RE
Search FEF
1
2
3
4
5
6
7
8
9
Summary
FIT2004: Lec-7: Burrows-Wheeler Transform
Take home message
Burrows-Wheeler Transform is an elegant algorithm that allows efficient and effective compression and substring matching
Things to do (this list is not exhaustive)
Read more about Burrows-Wheeler Transform and understand how and why it works
Implement it in Python
Coming Up Next
Introduction to Graphs and Path problems on Graphs
Matrix Properties
Is it true that each column in the Matrix is a permutation of the string $MISSISSIPPI?
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
M I S S I S S I P P I $
$ M I S S I S S I P P I
I $ M I S S I S S I P P
P I $ M I S S I S S I P
P P I $ M I S S I S S I
I P P I $ M I S S I S S
S I P P I $ M I S S I S
S S I P P I $ M I S S I
I S S I P P I $ M I S S
S I S S I P P I $ M I S
S S I S S I P P I $ M I
I S S I S S I P P I $ M
All rotations before sorting
Inverting BWT
Is it true that sorting the 2-mers gives us the first two columns of the Matrix?
Yes! Note that we have obtained the first two columns of the matrix using BWT.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
$
I
I
I
I
M
P
P
S
S
S
S
$
I
I
I
I
M
P
P
S
S
S
S
Concatenate Last
and First columns
I
P
S
S
M
$
P
I
S
S
I
I
Sort
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
Inverting BWT
Concatenating the last and first two columns gives us the 3-mers of $MISSISSIPPI.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
Concatenate Last and First two columns
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
$
I
I
I
I
M
P
P
S
S
S
S
I
P
S
S
M
$
P
I
S
S
I
I
M
$
P
S
S
I
I
P
I
I
S
S
Inverting BWT
Sorting the 3-mers gives us the first three columns of the matrix.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
Concatenate Last and First two columns
I
P
S
S
M
$
P
I
S
S
I
I
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
Sort
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
Inverting BWT
Concatenating the last column with the first three columns gives us 4-mers.
Sorting the 4-mers gives us the first four columns.
FIT2004: Lec-7: Burrows-Wheeler Transform
$ M I S S I S S I P P I
I $ M I S S I S S I P P
I P P I $ M I S S I S S
I S S I P P I $ M I S S
I S S I S S I P P I $ M
M I S S I S S I P P I $
P I $ M I S S I S S I P
P P I $ M I S S I S S I
S I P P I $ M I S S I S
S I S S I P P I $ M I S
S S I P P I $ M I S S I
S S I S S I P P I $ M I
Concatenate Last and First three columns
I
P
S
S
M
$
P
I
S
S
I
I
$ M
I $
I P
I S
I S
M I
P I
P P
S I
S I
S S
S S
Sort
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
$ M I
I $ M
I P P
I S S
I S S
M I S
P I $
P P I
S I P
S I S
S S I
S S I
$ M I S
I $ M I
I P P I
I S S I
I S S I
M I S S
P I $ M
P P I $
S I P P
S I S S
S S I P
S S I S
$ M I S
I $ M I
I P P I
I S S I
I S S I
M I S S
P I $ M
P P I $
S I P P
S I S S
S S I P
S S I S
/docProps/thumbnail.jpeg