COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
This material has been reproduced and communicated to you by or on behalf of Monash University pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 1 / 36
Copyright By PowCoder代写 加微信 powcoder
Prepared by: [ ] Extended by:
FIT3155: Advanced Algorithms and Data Structures
Week 1: Gusfield’s Z-Algorithm and Exact Pattern Matching
Faculty of Information Technology, Monash University
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 2 / 36
What is covered here?
Linear-time approaches to the exact pattern matching problem on strings This week
Gusfield’s Z-algorithm Next week
Boyer-Moore’s algorithm
Knuth-Morris-Pratt’s algorithm
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 3 / 36
Source material and recommended reading
, Algorithms on Strings, Trees and Sequences, Cambridge University Press. (Chapters 1-2).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 4 / 36
Exact pattern matching: Introduction The exact pattern matching problem
Given a reference text txt[1 . . . n] and a pattern pat[1 . . . m], find ALL occurrences, if any, of pat in txt.
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
If,txt[1…12]=bbabaxababay andpat[1…3]=aba then pat occurs in txt at positions 3,7 and 9.
1 2 3 4 5 6 7 8 9 10 11 12
txt: b b a b a x a b a b a y
pat: aba
pat: aba
pat: aba
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching
Exact pattern matching: Introduction
Problem arises in innumerable applications
Word processing – grep command in Unix
Search Engines – Google
Library catalogs
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 6 / 36
Na ̈ıve algorithm
1forifrom1ton-m+1do //foreachpositioniintxt
2 3 4 5 6 7 8 9
for j from 1 to m do // check if txt[i…]==pat[j…] if txt[i+j-1] != pat[j] then
endif endfor;
//if pat matched fully at i, print position in txt
if (j == m+1) print i; endfor
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 7 / 36
Na ̈ıve algorithm
1 2 3 4 5 6 7 8 9 10
txt: aaaaaaaaaa
pat: aaa iteration-1
pat: aaa iteration-2 pat: aaa iteration-3
iteration-8
How many comparison of symbols does this approach perform in
worst case?
Total number of comparisons, in the worst case, is (n − m + 1) ∗ m = 24. In general the number of comparisons grows as O(mn).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 8 / 36
Na ̈ıve approach makes too many unnecessary comparisons
1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz iteration-1 #
pat: abxyabxz iteration-2 # pat: abxyabxz iteration-3 #
pat: abxyabxz iteration-4 #
pat: abxyabxz iteration-5 #
pat: abxyabxz iteration-6
Overall 20 comparisons in the na ̈ıve approach, on this example.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 9 / 36
Early ideas for speeding up the na ̈ıve method
Make an attempt to…
…shift pat by > 1 character (if possible) w.r.t. txt when mismatch occurs, while…
…never shifting so far as to miss any exact occurrence of pat in txt across the region shifted;
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 10 / 36
A smarter approach reduces unnecessary comparisons
Scenario 1:
A smarter algorithm can gather that after the mismatch in iteration-2, there is enough information to shift by > 1 character in the next iteration.
1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz
iteration-1 #
pat: abxyabxz
iteration-2 #
pat: abxyabxz iteration-3
How does this algorithm achieve this?
After the ninth comparison, the algorithm knows that pat[1 . . . 7] = txt[2 . . . 8]. It can gather that pat[1] (‘a’) does not occur until position 6 in txt. This is enough information to conclude that there are no possible matches in txt of pat to the left of position 6.
Overall, this ‘smarter’ algorithm saves 3 comparisons on this example.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 11 / 36
An even smarter approach . . .
Scenario 2:
An ‘even smarter’ algorithm can gather more information, beyond scenario 1, and render more comparisons redundant.
1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz
iteration-1 #
pat: abxyabxz
iteration-2 #
pat: abxyabxz iteration-3
Overall, this ‘even smarter’ algorithm saves 6 comparisons over ‘na ̈ıve’.
How does this algorithm achieve this?
This algorithm can preprocess pat, and from this preprocessing learn that pat[1…3] (i.e ‘abx’) appears again at pat[5…7]. But after the ninth comparison it knows that pat[5…7] = txt[6…8] and so after the shift when pat[1…3] is aligned with txt[6…8] it knows these regions already match.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 12 / 36
Take home message from these illustrated examples
The examples shown in the previous slides illustrate the kind of ideas that allow unnecessary comparisons to be skipped/jumped over.
Note, we haven’t yet rationalized how an algorithm can efficiently implement such ideas shown in previous scenarios.
Over weeks 1 and 2, we will study 3 remarkable algorithms that can be implemented to run pattern matching in linear time.
Gusfield’s Z -Algorithm
Boyer-Moore’s Algorithm
Knuth-Morris-Pratt’s Algorithm
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 13 / 36
1. Gusfield’s Z-Algorithm
This is actually a linear-time algorithm to preprocess a given string.
After preprocessing, the output from this algorithm (Zi-values) can be used to address a versatile set of problems that arise in strings, including linear-time pattern matching.
In this unit you will witness its various uses.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 14 / 36
Gusfield’s Z-algorithm – Defining Zi Definition of Zi:
For a string str[1 . . . n], define Zi (for each position i > 1 in str) as the length of the longest substring starting at position i of str that matches its prefix(i.e.,str[i…i+Zi-1]=str[1…Zi]).
1 2 3 4 5 6 7 8 9 10 11
str= a a b c a a b x a a y
Z2=1 Z7=0 Z3=0 Z8=0 Z4=0 Z9=2 Z5=3 Z10=1 Z6=1 Z11=0
(FIT3155 S1 2022, Monash University)
W1: Linear-time pattern matching
Gusfield’s Z-algorithm – Defining Zi-box Definition of Z-box:
For a string str[1..n], and for any i > 1 such that Zi> 0, a Zi-box is defined as the interval [i . . . i+Zi-1] of str.
Z2=1 Z2-box = [2..2] Z3=0 undefined Z4=0 undefined Z5=3 Z5-box = [5..7] Z6=1 Z6-box = [6..6]
Z7 = 0 Z8 = 0 Z9 = 2 Z10 = 1 Z11 = 0
undefined undefined
Z9-box = [9..10] Z10-box = [10..10] undefined
(FIT3155 S1 2022, Monash University)
W1: Linear-time pattern matching
Gusfield’s Z-algorithm – Defining ri Definition of ri:
For a string str[1..n], and for all i > 1, ri is the right-most endpoint of all Z-boxes that begin at or before position i.
Alternately, ri is the largest value of j+Zj-1 over all 1 < j ≤ i, such that Zj> 0.
1 2 3 4 5 6 7 8 9 10 11
str= a a b c a a b x a a y
Z2=1 Z3=0 Z4=0 Z5=3 Z6=1
Z7 = 0 Z8 = 0 Z9 = 2 Z10 = 1 Z11 = 0
Z2-box = [2..2] Z5-box = [5..7] Z6-box = [6..6] Z9-box = [9..10]
r2=2 r7 = 7 r3=2 r8 = 7 r4=2 r9 = 10 r5=7 r10 = 10
Z10-box = [10..10] r6=7 W1: Linear-time pattern matching
(FIT3155 S1 2022, Monash University)
Gusfield’s Z-algorithm – Defining li Definition of li:
For a string str[1..n], and for all i > 1, li is the left end of the Z-box that ends at ri.
In case there is more than one Z-box ending at ri, then li can be chosen to be the left end of any of those Z-boxes.
1 2 3 4 5 6 7 8 9 10 11
str= a a b c a a b x a a y
Z2 =1 Z3 =0 Z4 =0 Z5 =3 Z6 = 1
Z7 =0 Z8 =0 Z9 =2 Z10 =1 Z11 = 0
Z2-box=[2..2] Z5-box=[5..7] Z6-box=[6..6] Z9-box=[9..10]
l2 =2 l3 =2 l4 =2 l5 =5
l7 = 5 l8 = 5 l9 = 9 l10 = 9 l11 = 9
(FIT3155 S1 2022, Monash University)
Z10-box = [10..10] l6 =5 W1: Linear-time pattern matching
Gusfield’s Z-algorithm – Example
Z2 = 1 Z5 = 3 Z6 = 1 Z9 = 2
Z2-box = [2..2] Z5-box = [5..7] Z6-box = [6..6] Z9-box = [9..10]
l2=2 l7 = 5 l3=2 l8 = 5 l9 = 9
l10 = 9 l11 = 9
r2=2 r7 = 7
r3=2 r8 = 7
r4=2 r9 = 10 l4=2 r5=7 r10 = 10 l5=5
Z10 = 1 Z10-box = [10..10] r6=7 (FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching
r11 = 10 l6=5
Main point of Gusfield’s Z-algorithm!
In the previous slides, for any given string, we have defined: {Zi, Zi-box, li, ri}.
Na ̈ıve computation of these values, given some string str[1..n] would require O(n2)-time.
Gusfield’s Z-algorithm allows computation of these values, in O(n)-time.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 20 / 36
Overview of Gusfiled’s Z-algorithm
We compute {Zi, Zi-box, li, ri} values for each successive position i, starting from i = 2.
All successively computed Zi values are remembered.
Note: Each Zi-box interval can be computed from its corresponding Zi value in O(1) time. No need to store the interval information.
At each iteration, to compute (li,ri), this preprocessing only needs values of (lj , rj ) for j = i − 1.
Note: no earlier (lj , rj ) values are needed . . .
. . . so, temporary variables (l, r) can be used to keep track of the most
recently computed (li−1,ri−1) values to update (li,ri). Let’s see how this all works out in practice.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 21 / 36
Z-algorithm – base case
To begin, compute Z2 by explicit left-to-right comparison of characters str[2 . . . ] with str[1 . . . ] until a mismatch is found. If Z2 > 0
set r (i.e., r2) to Z2 +1
set l (i.e., l2) to 2
else (i.e., if Z2 == 0)
set r (i.e., r2) to 0
set l (i.e., l2) to 0
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 22 / 36
Z-algorithm – base case
Z2 =1sosetl=2andr=Z2+1=2.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 23 / 36
Z-algorithm – inductive assumption and general cases
Assume inductively . . .
. . . we have correctly computed the values Z2 through to Zk−1. . . . that r currently holds rk−1,
. . . that l currently holds lk−1.
For computing Zk at position k, these two scenarios arise: Case1: Ifk>r
Case2: Elsek≤r
Let’s see how to have to handle these two cases.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 24 / 36
preprocessing in practice – Case 1: k > r
CASE 1, if k > r:
Compute Zk by explicitly comparing characters str[k . . . q − 1] with
str[1 . . . q − k] until mismatch is found at some q ≥ k. IfZk>0:
⋆ set r (i.e., rk) to q−1.
⋆ set l (i.e., lk) to k.
Otherwise (i.e. Zk = 0) they retain as before
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching
Z-algorithm – Case 1: k > r
Z3 =0solandrremainunchanged(l=2andr=2).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 26 / 36
Z-algorithm – Case 1: k > r
Z4 =0solandrremainunchanged(l=2andr=2).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 26 / 36
Z-algorithm – Case 1: k > r
Z5 =3solandrareupdatedto: l=5andr=7.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 26 / 36
Z-algorithm – Case 2: k ≤ r
CASE 2, if k ≤ r:
The character str[k] lies anywhere within the substring str[l . . . r]
(i.e., anywhere within the Zl-box).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 27 / 36
Z-algorithm – Case 2: k ≤ r
CASE 2, if k ≤ r:
The character str[k] lies anywhere within the substring str[l . . . r]
(i.e., anywhere within the Zl-box).
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 27 / 36
Z-algorithm – Case 2: k ≤ r
CASE 2, if k ≤ r:
The character str[k] lies anywhere within the substring str[l . . . r]
(i.e., anywhere within the Zl-box).
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
This implies the character str[k] is identical to str[k − l + 1]
By extending this logic, it also implies that the substring str[k . . . r] is
identical to str[k−l+1…Zl].
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 27 / 36
Z-algorithm – Case 2: k ≤ r
Question: Can the previously computed value of Zk−l+1 (see red box) inform the computation of Zk (see white box)?
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 27 / 36
Z-algorithm – overview of handling case 2
In the previous slide, we asked “can the value of Zk−l+1 inform the computation of Zk?”. The answer is yes, and this can be handled by two sub-cases: CASES 2a and 2b.
CASE 2a, if Zk−l+1 < r−k+1: CASE 2b, else (if) Zk−l+1 ≥ r−k+1:
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 28 / 36
Z-algorithm – case 2a mechanics CASE 2a, if Zk−l+1 < r−k+1:
Therefore, when Zk−l+1 < r − k + 1
set Zk to Zk−l+1, and
r and l values remain unchanged.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 29 / 36
Z-algorithm – case 2a mechanics CASE 2a, if Zk−l+1 < r−k+1:
Therefore, when Zk−l+1 < r − k + 1
set Zk to Zk−l+1, and
r and l values remain unchanged.
CASE 2a is now complete. Let’s now focus on the remaining case 2b. (FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 29 / 36
Z-algorithm – Case 2b
CASE 2b, if Zk−l+1 ≥ r−k+1:
Zk must also be ≥ r−k+1
So, start explicitly comparing str[r + 1] with str[r − k + 2] and so on
until mismatch occurs.
Say the mismatch occurred at position q ≥ r + 1, then:
⋆ setZk toq−k, ⋆ setrtoq−1. ⋆ setltok.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 30 / 36
Z-algorithm – Case 2b mechanics
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
This implies the character str[k] is identical to str[k − l + 1]
By extension the substring str[k . . . r] is identical to str[k − l + 1 . . . Zl].
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 31 / 36
Z-algorithm – Case 2b mechanics
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
This implies the character str[k] is identical to str[k − l + 1]
By extension the substring str[k . . . r] is identical to str[k − l + 1 . . . Zl]. ButnowZk−l+1 =r−k+1.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 31 / 36
Z-algorithm – Case 2b mechanics
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
This implies the character str[k] is identical to str[k − l + 1]
By extension the substring str[k . . . r] is identical to str[k − l + 1 . . . Zl].
ButnowZk−l+1 =r−k+1.
What info have we gained so far? Ans: x ̸= y, z ̸= x. But we remain uncertain about the relationship between y and z.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 31 / 36
Z-algorithm – Case 2b mechanics
By the definition of Zl-box, str[l . . . r]=str[1 . . . Zl].
This implies the character str[k] is identical to str[k − l + 1]
By extension the substring str[k . . . r] is identical to str[k − l + 1 . . . Zl].
ButnowZk−l+1 =r−k+1.
What info have we gained so far? Ans: x ̸= y, z ̸= x. But we remain uncertain about the relationship between y and z.
Thus, explicit comprisons needed only in the regions of uncertainty.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 31 / 36
Z-algorithm – Case 2b Example
Perform explicit comparisons until mismatch between characters str[16...]andstr[4...]. Initially,l=9andr=15,sor+1=16and r − k + 2 = 4. After computation of Z13, the l and r values are updated to l = 13 and r = 17.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 32 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case2a: Z7 =0solandrremainunchanged(l=5andr=7).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case1: Z8 =0solandrremainunchanged(l=5andr=7).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case1: Z9 =7solandrareupdatedto: l=9andr=15.
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case2a: Z10 =1solandrremainunchanged(l=9andr=15).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case2a: Z11 =0solandrremainunchanged(l=9andr=15).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case2a: Z12 =0solandrremainunchanged(l=9andr=15).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Z-algorithm: continuing iterations beyond those already covered – self-study
On the slide series #26 and #29, examples iterations k = 2 . . . 5 and k = 6 were covered, respectively. Below we have continue the running example for k ≥ 7.
Case 2b (see slide #32): Z13 = 5 (but needs only 3 extra comparisons, half of those required for the naive approach).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 33 / 36
Preprocessing (of Zi values) for str[1..n] takes O(n) time Total time is directly proportional to the SUM of:
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 34 / 36
Preprocessing (of Zi values) for str[1..n] takes O(n) time Total time is directly proportional to the SUM of:
1 the number of iterations
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 34 / 36
Preprocessing (of Zi values) for str[1..n] takes O(n) time
Total time is directly proportional to the SUM of:
1 the number of iterations
2 the number of explicit character comparisons (matches or
mismatches).
(FIT3155 S1 2022, Monash University) W1: Linear-time pattern matching 34 / 36
Preprocessing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com