程序代写 COMP9319 WEB DATA COMPRESSION AND SEARCH

COMP9319 WEB DATA COMPRESSION AND SEARCH
Search on Suffix Array, FM Index, Backward Search, Compressed BWT

SUFFIX ARRAY

Copyright By PowCoder代写 加微信 powcoder

¢ We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order

SUFFIX ARRAY
¢ We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
Note: If 0-based index: some papers assume 1-based, some are 0-based.

Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
R ssissippi

Let S = mississippi L
Let P=issi
i.e., To find every suffix that begins with issi. How???
ippi issippi ississippi
mississippi pi
sisippi ssippi ssissippi

Let S = mississippi L
Let P=issi
Two binary searches !! So total O(m log n)
ippi issippi ississippi
mississippi pi
sisippi ssippi ssissippi

Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
M = (L+R) / 2 if P > S[M]:
L = M+1 else:
R ssissippi

Let S = mississippi L
Let P=issi
ippi issippi ississippi
mississippi pi
sisippi ssippi
M = (L+R) / 2 if P < S[M]: R ssissippi BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE) Ferragina & Manzini Modified from slides by ¢ Combine: Text Compression + Indexing (discard original text). ¢ Count and locate P with length m by looking at only a small portion of the compressed text. ¢ Do it efficiently: — Time: O(m) ¢ Exploit the relationship between the Burrows- and the Suffix Array data structure. ¢ Compressed suffix array that encapsulates both the compressed text and the full-text indexing information. ¢ Supports two basic operations: — Count – return number of occurrences of P in T. — Locate – find all positions of P in T. • Every column is a permutation of T. mississippi# p i s s i i • Given row i, char L[i] precedes F[i] in # mississipp i #mississip i ppi#missis i ssippi#mis i ssissippi# m ississippi p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m ississippi#m • Consiescustivpepcih#arm’siisnsL are adjacent ssippi#missi original T. ssissippi#mi sissippi#mis Sort the rows to similar strings in T. sippi#missis ippi#mississ • Therefore – L usually contains long ppi#mississi runs of identical char’s. pi#mississip i#mississipp #mississippi 1. 2. 3. 4. Reminder: Recovering T from L Find F by sorting L Last char of T? # L[i] precedes F[i] in T. Therefore we get How do we choose the correct i in F? — Thei’sareinthesameorderinLandF — As are the rest of the char’s p precedes i: pi# And so on.... ¢ Backward-search algorithm ¢ Uses only L (output of BWT) ¢ Relies on 2 structures: — C[1,...,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars) — Occ(c,q): number of occurrences of char c in prefix L[1,q] Example 1 2 3 4 5 6 7 8 9 10 11 12 C[ ] for T = mississippi# occ(s, 5) = 2 occ(s,12) = 4 Occ o Rank ¢ Works in p iterations, from p down to 1 P = msi i = 3 msi i = 2 msi i = 1 msi c = ‘i’ c = ‘s’ c = ‘m’ ¢ Remember that the BWT matrix rows = sorted suffixes of T — All suffixes prefixed by pattern P, occupy a continuous set of rows — This set of rows has starting position First — and ending position Last — So, (Last – First +1) gives total pattern occurrences ¢ At the end of the i-th phase, First points to the first row prefixed by P[i,p], and Last points to the last row prefiex by P[i,p]. SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES) First step rows prefixed by char “i” #0 i1 m5 p6 S8 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m occ=2 fr [lr-fr+1] lr Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] ❴ Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars Occ() oracle is enough16 Available info ¢P = pssi —i= — (Last–First+1)= First Last 5 6 7 8 9 10 11 12 CC[‘s[‘’i]’]++O1c=c(2‘s’,1) +1 = 8+0+1 = 9 CC[‘[i‘’s+’]1+]O=cCc[(‘ms’,’]5=) 5=8+2=10 — (Last–First+1)= First Last 5 6 7 8 9 10 11 12 C[‘s’] + Occ(‘s’,18) +1 = 8+02+1 = 911 C[‘[s‘s’]’]+Occc((‘s‘s’,’1,50)) =88+24=1102 — (Last–First+1)= First Last 5 6 7 8 9 10 11 12 C[‘sp’’] + Occ(‘‘sp’’,,81)0+) 1+1= =8+62++21+1= =119 C[[‘s‘p’]’]+Occc((‘s‘p’,’1,102)) ==86++42==182 COMPRESSED SUFFIX ARRAY / BWT Slides modified from the original Makinen & Navarro’s SIMPLE FM-INDEX ¢ Construct the Burrows-Wheeler-transformed text bwt(T) [BW94]. ¢ From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time. ¢ Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T). BURROWS-WHEELER TRANSFORMATION ¢ Construct a matrix M that contains as rows all rotations of T. ¢ Sort the rows in the lexicographic order. ¢ Let L be the last column and F be the first column. ¢ bwt(T)=L associated with the row number of T in the sorted M. pos 123456789 T = kalevala# sa F M L 1:9 #kalevala 2:8 a#kaleval 3:6 ala#kalev 4:2 alevala#k 5:4 evala#kal 6:1 kalevala# 7:7 la#kaleva 8:3 levala#ka 9:5 vala#kale L = alvkl#aae, row 6 ==>
Exercise: Given L and the row number, we know how to compute T. What about sa(T)?

T-1= # a l a v e l a k
1a 2l 3v 4k 5l 6# 7a 8a 9e
1: 9 2: 8 3: 6 4: 2 5: 4 6: 1 7: 7 8: 3 9: 5
i 123456789 LF[i] 27968 1345
#kaleva l a al av ak e…l kaleva l a# la la ve

IMPLICIT LF[I]
¢ Ferragina and Manzini (2000) noticed the following
connection:
¢ LF[i]=CT[L[i]]+rankL[i](L,i) ¢ Here
CT[c] : amount of letters 0,1,…,c-1 in L=bwt(T)
rankc(L,i) : amount of letters c in the prefix L[1,i]

RANK/SELECT
select1(L,j)
L rank1(L,i)
3 6 9 10 12
001001001101 001112223445
i 123456789…

T-1= # a l a v e l a k
1a 2l 3v 4k 5l 6# 7a 8a 9e
1: 9 2: 8 3: 6 4: 2 5: 4 6: 1 7: 7 8: 3 9: 5
i 123456789 LF[i] 27968 1345
LF[7]=CT[a]+ranka(L,7) =1+2=3
#a al av ak e…l k# la la ve

RECALL: BACKWARD SEARCH ON
¢ Observation: If [i,j] is the range of rows of M that
start with string X, then the range [i’,j’] containing cX can be computed as
i’ := CT[c]+rankc(L,i-1)+1, j’ := CT[c]+rankc(L,j).

BACKWARD SEARCH ON BWT(T)…
¢ Array CT[1,s] takes O(s log |T|) bits.
¢ L=Bwt(T) takes O(|T| log s) bits.
¢ Assuming rankc(L,i) can be computed in constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T.

RUN-LENGTH FM-INDEX
¢ We make the following changes to the previous FM- index variant:
– L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],
– Cumulative array CT[1,c] is replaced by CS[1,c],
– some formulas are changed.

RUN-LENGTH FM-INDEX…
L B S L F B’
c1cca1 c0aca0 c0gca1 a1aac1 a0tac0 g1gc0 g0gg1 a1ag0 t1tt1 t0tt0

CHANGES TO FORMULAS
¢ Recall that we need to compute CT[c]+rankc(L,i) in the backward search.
¢ Theorem: C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,
when L[i]1 c (e.g., when backward search) , and otherwise (e.g., when reverse, sometimes backward search too) to select1(B’,CS[c]+rankc(S,rank1(B,i)))+ i-select1(B,rank1(B,i)).

EXAMPLE, REVERSE L
L F BSB’ ca1c1
LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8))
= select1(B’,0+ranka(S,4))+8-select1(B,4) c a 0a0 =select1(B’,0+2)+8-8
ca0g1=3 ac1a1 ac0t0 gc10 gg01 ag10 tt11 tt00

¢ For more details, read the paper

¢ ipsm$pisi
¢ 111011111010

WHAT IS B’
i p s m $ p i s i

USUALLY B’ IS GIVEN TO SAVE COMPUTATIONS
is 1 i1 0 1 0

REVERSE BWT FROM ROW 6
is 1 i1 0 1 0

REVERSE BWT
is 1 i1 0 1 0
S[rank1(B, 6)]= $

REVERSE BWT
78 1 910 10
S[rank1(B, 6)]= $
= select1(B’, CS[$] + rank$(S, rank1(B, 6))) + 6 – select1(B, rank1(B, 6)))
= select1(B’, 0 + rank$(S, 5)) + 6 – select1(B 5) =1+6–6=1

REVERSE BWT
78 1 910 10
S[rank1(B, 1)]= i
= select1(B’, CS[i] + ranki(S, rank1(B, 1))) + 1 – select1(B, rank1(B, 1)))
= select1(B’, 1 + ranki(S, 1)) + 1 – select1(B, 1) =2+1–1=2

REVERSE BWT
78 1 910 10
S[rank1(B, 1)]= i
= select1(B’, CS[i] + ranki(S, rank1(B, 1))) + 1 – select1(B, rank1(B, 1)))
= select1(B’, 1 + ranki(S, 1)) + 1 – select1(B 1) =2+1–1=2
You can also construct the SA in this way:
12, 11, ….
12,11,8,5,2,1,10,9,7,4,6,3

BACKWARD SEARCH
78 1 910 10
Suppose search for si:
c = i, First = 2, Last = 5
First = C[c] + Occ(c, First – 1) + 1 Last = C[c] + Occ(c, Last)

BACKWARD SEARCH
c = i, First = 2, Last = 5 1 1 i 1 c=s
23 1 ps 1 First = select1(B’, CS[s]+1+ranks(S, rank1(B,2-
40m1 1)))-1+1
5 1 $ 0 =select (B’,7+1+rank (S,1)) 61p11s
111 1 1200
=select1(B’, 8) = 9
Last = select1(B’, CS[s]+1+ranks(S, rank1(B,5))) -1
=select1(B’,7+1+ranks(S,4)) – 1 =select1(B’, 9) -1 = 11 – 1 = 10

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com