On Burrows and Kalyanaraman
October 30, 2013
Introduction
Copyright By PowCoder代写 加微信 powcoder
Motivation
Notation and Definitions
BWT properties
Algorithms
References
Burrows : Introduction
Burrows (BWT) is a transformation originally invented for data compression [BW94].
It was later adopted in the bioinformatics domain.
One of the most popular application of BWTs in bioinformatics is in the problem of read mapping
[LD09, LD10, LTPS09, TS09]. This has a direct application in genome re-sequencing and targeted re-sequencing projects.
In this lecture, we will define the Burrows and review its application in pattern matching.
Definition of of Transform
Notation and definitions:
input string of length n
character at index i of s (indexing starts at 1) substring of s starting at index i and ending at index j e.g., s[1…n] = s
string alphabet
end of string symbol (i.e., s[n] = $) s.t. $∈/ Σ operator to denote lexicographically smaller
string concatenation operator
rot(i ) SA[1…n] R[1…n]
the cyclic permutation of s which starts at index i (i.e., s[i . . . n] · s[1 . . . i − 1])
suffix array of s
(i.e., lexicographically sorted array of suffixes of s) build an array of strings R s.t. R[i] = rot(SA[i])
Definition
BWT(s): Given an input string s of length n, BWT(s) is an array of size n where BWT[i] = R[i][n].
BWT: Example
1234567 s=banana$
All (left) rotations:
Suff id. 1banana$ 2anana$b 3nana$ba 4ana$ban 5na$bana 6a$banan 7$banana
R[1…n]: Suffix array of s with rotations
SA 7$banana 6a$banan 4ana$ban 2anana$b 1banana$ 5na$bana 3nana$ba
BWT(s) = annb$aa (which is same as the last column in the R table).
BWT properties
Given an input string s and its BWT transform BWT(s), let:
ind(l(xi)) f (xi )
ind (f (xi ))
denote the ith occurrence of x in the last column of R (Note: this is same as the ith occ. of x in BWT(s)) E.g., l(a1) = annb$aa; l(a2) = annb$aa
denote the index in s corresponding to l(xi) E.g., ind(l(a1)) = 6; ind(l(a2)) = 4
denote the i th occurrence of x in the first column of R E.g., f (a1) = $aaabnn; f (a2) = $aaabnn
denote the index in s corresponding to f (xi )
Last to first property of BWTs
Last column to first column property: The ith occurrence of character x in the last column of R is same as the ithoccurrence of x inthefirstcolumnofR —i.e.,ind(l(xi))=ind(f(xi)).
For any i < n, since l(xi) occurs before l(xi+1) in the last column of R (same as the BWT(s)),
⇒ s[ind(l(xi))+1...n]≺s[ind(l(xi+1)+1...n]
Last to first property of BWTs
Last column to first column property: The ith occurrence of character x in the last column of R is same as the ithoccurrence of x inthefirstcolumnofR —i.e.,ind(l(xi))=ind(f(xi)).
For any i < n, since l(xi) occurs before l(xi+1) in the last column of R (same as the BWT(s)),
⇒ s[ind(l(xi))+1...n]≺s[ind(l(xi+1)+1...n]
⇒ x·s[ind(l(xi))+1...n]≺x·s[ind(l(xi+1))+1...n]
Last to first property of BWTs
Last column to first column property: The ith occurrence of character x in the last column of R is same as the ithoccurrence of x inthefirstcolumnofR —i.e.,ind(l(xi))=ind(f(xi)).
For any i < n, since l(xi) occurs before l(xi+1) in the last column of R (same as the BWT(s)),
⇒ s[ind(l(xi))+1...n]≺s[ind(l(xi+1)+1...n]
⇒ x·s[ind(l(xi))+1...n]≺x·s[ind(l(xi+1))+1...n] ⇒ s[ind(l(xi )) . . . n] ≺ s[ind(l(xi+1)) . . . n]
(∵ s[ind(l(xi))] = s[ind(l(xi+1))] = x)
Last to first property of BWTs
Last column to first column property: The ith occurrence of character x in the last column of R is same as the ithoccurrence of x inthefirstcolumnofR —i.e.,ind(l(xi))=ind(f(xi)).
For any i < n, since l(xi) occurs before l(xi+1) in the last column of R (same as the BWT(s)),
⇒ s[ind(l(xi))+1...n]≺s[ind(l(xi+1)+1...n]
⇒ x·s[ind(l(xi))+1...n]≺x·s[ind(l(xi+1))+1...n] ⇒ s[ind(l(xi )) . . . n] ≺ s[ind(l(xi+1)) . . . n]
(∵ s[ind(l(xi))] = s[ind(l(xi+1))] = x)
⇒ ind(f(xi))=ind(l(xi)) (∵ theaboveinequalityholds∀i