COMP9319 WEB DATA COMPRESSION AND SEARCH
Search on Suffix Array, FM Index, Backward Search, Compressed BWT
SUFFIX ARRAY
Copyright By PowCoder代写 加微信 powcoder
¢ We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
SUFFIX ARRAY
¢ We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
Note: If 0-based index: some papers assume 1-based, some are 0-based.
Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
R ssissippi
Let S = mississippi L
Let P=issi
i.e., To find every suffix that begins with issi. How???
ippi issippi ississippi
mississippi pi
sisippi ssippi ssissippi
Let S = mississippi L
Let P=issi
Two binary searches !! So total O(m log n)
ippi issippi ississippi
mississippi pi
sisippi ssippi ssissippi
Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
M = (L+R) / 2 if P > S[M]:
L = M+1 else:
R ssissippi
Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
M = (L+R) / 2 if P < S[M]:
R ssissippi
BACKWARD SEARCH FM-INDEX
(FULL-TEXT INDEX IN MINUTE SPACE)
Ferragina & Manzini
Modified from slides by
¢ Combine: Text Compression + Indexing (discard original text).
¢ Count and locate P with length m by looking at only a small portion of the compressed text.
¢ Do it efficiently: Time: O(m)
¢ Exploit the relationship between the Burrows- and the Suffix Array data structure.
¢ Compressed suffix array that encapsulates both the compressed text and the full-text indexing information.
¢ Supports two basic operations:
Count – return number of occurrences of P in T. Locate – find all positions of P in T.
• Every column is a permutation of T.
mississippi#
p i s s i i
• Given row i, char L[i] precedes F[i] in
# mississipp i #mississip
i ppi#missis i ssippi#mis i ssissippi# m ississippi p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m
ississippi#m
• Consiescustivpepcih#arm’siisnsL are adjacent ssippi#missi
original T.
ssissippi#mi sissippi#mis
Sort the rows
to similar strings in T.
sippi#missis
ippi#mississ
• Therefore – L usually contains long
ppi#mississi
runs of identical char’s.
pi#mississip i#mississipp #mississippi
1. 2. 3. 4.
Reminder: Recovering T from L
Find F by sorting L
Last char of T? #
L[i] precedes F[i] in T. Therefore we get
How do we choose the correct i in F? Thei’sareinthesameorderinLandF As are the rest of the char’s
p precedes i: pi# And so on....
¢ Backward-search algorithm
¢ Uses only L (output of BWT)
¢ Relies on 2 structures:
C[1,...,|Σ|] : C[c] contains the total number of text chars in T which are
alphabetically smaller then c (including repetitions of chars)
Occ(c,q): number of occurrences of char c in prefix L[1,q] Example
1 2 3 4 5 6 7 8 9 10 11 12
C[ ] for T = mississippi#
occ(s, 5) = 2 occ(s,12) = 4
Occ o Rank
¢ Works in p iterations, from p down to 1
P = msi i = 3 msi i = 2 msi i = 1 msi
c = ‘i’ c = ‘s’ c = ‘m’
¢ Remember that the BWT matrix rows = sorted suffixes of T All suffixes prefixed by pattern P, occupy a continuous set of rows
This set of rows has starting position First
and ending position Last
So, (Last – First +1) gives total pattern occurrences
¢ At the end of the i-th phase, First points to the first row prefixed by P[i,p], and Last points to the last row prefiex by P[i,p].
SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)
First step
rows prefixed by char “i”
#0 i1 m5 p6 S8
#mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m
occ=2 fr [lr-fr+1] lr
Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] ❴
Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars
Occ() oracle is enough16
Available info
¢P = pssi i=
(Last–First+1)=
First Last
5 6 7 8 9 10 11 12
CC[‘s[‘’i]’]++O1c=c(2‘s’,1) +1 = 8+0+1 = 9
CC[‘[i‘’s+’]1+]O=cCc[(‘ms’,’]5=) 5=8+2=10
(Last–First+1)=
First Last
5 6 7 8 9 10 11 12
C[‘s’] + Occ(‘s’,18) +1 = 8+02+1 = 911
C[‘[s‘s’]’]+Occc((‘s‘s’,’1,50)) =88+24=1102
(Last–First+1)=
First Last
5 6 7 8 9 10 11 12
C[‘sp’’] + Occ(‘‘sp’’,,81)0+) 1+1= =8+62++21+1= =119
C[[‘s‘p’]’]+Occc((‘s‘p’,’1,102)) ==86++42==182
COMPRESSED SUFFIX ARRAY / BWT
Slides modified from the original Makinen & Navarro’s
SIMPLE FM-INDEX
¢ Construct the Burrows-Wheeler-transformed text bwt(T) [BW94].
¢ From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time.
¢ Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T).
BURROWS-WHEELER TRANSFORMATION
¢ Construct a matrix M that contains as rows all rotations of T.
¢ Sort the rows in the lexicographic order.
¢ Let L be the last column and F be the first column.
¢ bwt(T)=L associated with the row number of T in the sorted M.
pos 123456789 T = kalevala#
sa F M L 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale
L = alvkl#aae, row 6 ==>
Exercise: Given L and the row number, we know how to compute T. What about sa(T)?
T-1= # a l a v e l a k
1a 2l 3v 4k 5l 6# 7a 8a 9e
1: 9 2: 8 3: 6 4: 2 5: 4 6: 1 7: 7 8: 3 9: 5
i 123456789 LF[i] 27968 1345
#kaleva l a al av ak e…l kaleva l a# la la ve
IMPLICIT LF[I]
¢ Ferragina and Manzini (2000) noticed the following
connection:
¢ LF[i]=CT[L[i]]+rankL[i](L,i) ¢ Here
CT[c] : amount of letters 0,1,…,c-1 in L=bwt(T)
rankc(L,i) : amount of letters c in the prefix L[1,i]
RANK/SELECT
select1(L,j)
L rank1(L,i)
3 6 9 10 12
001001001101 001112223445
i 123456789…
T-1= # a l a v e l a k
1a 2l 3v 4k 5l 6# 7a 8a 9e
1: 9 2: 8 3: 6 4: 2 5: 4 6: 1 7: 7 8: 3 9: 5
i 123456789 LF[i] 27968 1345
LF[7]=CT[a]+ranka(L,7) =1+2=3
#a al av ak e…l k# la la ve
RECALL: BACKWARD SEARCH ON
¢ Observation: If [i,j] is the range of rows of M that
start with string X, then the range [i’,j’] containing cX can be computed as
i’ := CT[c]+rankc(L,i-1)+1, j’ := CT[c]+rankc(L,j).
BACKWARD SEARCH ON BWT(T)…
¢ Array CT[1,s] takes O(s log |T|) bits.
¢ L=Bwt(T) takes O(|T| log s) bits.
¢ Assuming rankc(L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T.
RUN-LENGTH FM-INDEX
¢ We make the following changes to the previous FM- index variant:
– L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],
– Cumulative array CT[1,c] is replaced by CS[1,c],
– some formulas are changed.
RUN-LENGTH FM-INDEX…
L B S L F B’
c1cca1 c0aca0 c0gca1 a1aac1 a0tac0 g1gc0 g0gg1 a1ag0 t1tt1 t0tt0
CHANGES TO FORMULAS
¢ Recall that we need to compute CT[c]+rankc(L,i) in the backward search.
¢ Theorem: C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,
when L[i]1 c (e.g., when backward search) , and otherwise (e.g., when reverse, sometimes backward search too) to select1(B’,CS[c]+rankc(S,rank1(B,i)))+ i-select1(B,rank1(B,i)).
EXAMPLE, REVERSE L
L F BSB’ ca1c1
LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8))
= select1(B’,0+ranka(S,4))+8-select1(B,4) c a 0a0 =select1(B’,0+2)+8-8
ca0g1=3 ac1a1 ac0t0 gc10 gg01 ag10 tt11 tt00
¢ For more details, read the paper
¢ ipsm$pisi
¢ 111011111010
WHAT IS B’
i p s m $ p i s i
USUALLY B’ IS GIVEN TO SAVE COMPUTATIONS
is 1 i1 0 1 0
REVERSE BWT FROM ROW 6
is 1 i1 0 1 0
REVERSE BWT
is 1 i1 0 1 0
S[rank1(B, 6)]= $
REVERSE BWT
78 1 910 10
S[rank1(B, 6)]= $
= select1(B’, CS[$] + rank$(S, rank1(B, 6))) + 6 – select1(B, rank1(B, 6)))
= select1(B’, 0 + rank$(S, 5)) + 6 – select1(B 5) =1+6–6=1
REVERSE BWT
78 1 910 10
S[rank1(B, 1)]= i
= select1(B’, CS[i] + ranki(S, rank1(B, 1))) + 1 – select1(B, rank1(B, 1)))
= select1(B’, 1 + ranki(S, 1)) + 1 – select1(B, 1) =2+1–1=2
REVERSE BWT
78 1 910 10
S[rank1(B, 1)]= i
= select1(B’, CS[i] + ranki(S, rank1(B, 1))) + 1 – select1(B, rank1(B, 1)))
= select1(B’, 1 + ranki(S, 1)) + 1 – select1(B 1) =2+1–1=2
You can also construct the SA in this way:
12, 11, ….
12,11,8,5,2,1,10,9,7,4,6,3
BACKWARD SEARCH
78 1 910 10
Suppose search for si:
c = i, First = 2, Last = 5
First = C[c] + Occ(c, First – 1) + 1 Last = C[c] + Occ(c, Last)
BACKWARD SEARCH
c = i, First = 2, Last = 5 1 1 i 1 c=s
23 1 ps 1 First = select1(B’, CS[s]+1+ranks(S, rank1(B,2-
40m1 1)))-1+1
5 1 $ 0 =select (B’,7+1+rank (S,1)) 61p11s
111 1 1200
=select1(B’, 8) = 9
Last = select1(B’, CS[s]+1+ranks(S, rank1(B,5))) -1
=select1(B’,7+1+ranks(S,4)) – 1 =select1(B’, 9) -1 = 11 – 1 = 10
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com