Advanced algorithms and data
structures
Week 1 lecture notes by
Based on lecture slides by
Copyright By PowCoder代写 加微信 powcoder
Faculty of Information Technology February 27, 2022
1 Linear-timeexactpatternmatching………………………………. 2
2 Naivepatternmatching ……………………………………. 2
3 Improvingonnaivepatternmatching……………………………… 5
4 Gusfield’sZ-Algorithm…………………………………….. 6
4.1 TheAlgorithm …………………………………….. 9 4.2 Summary……………………………………….. 14 4.3 Applicationtotheexactpatternmatchingproblem…………………… 15
1. LINEAR-TIME EXACT PATTERN MATCHING
1 Linear-time exact pattern matching
Sources and recommended reading: , Algorithms on Strings, Trees and Sequences, Cambridge Press. (Chapters 1-2).
We begin with an examination of a problem that is ubiquitous in modern computing; the exact pattern matching problem.
Definition 1: The exact pattern matching problem
Given a string txt [1…n] called the reference text and a second string pat [1…m] called the pattern, find all occurrences, if any, of pat in txt.
Example 1: Exact pattern matching
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
Ifwehave,txt[1…12]=bbabaxababay andpat[1…3]=aba thenpatoccursintxtatpositions 3,7 and 9, as shown by the alignments below,
1 2 3 4 5 6 7 8 9 10 11 12
txt: b b a b a x a b a b a y
pat: aba
pat: aba
pat: aba
The practical importance of such a problem should be obvious, and arises in a variety of domains including, but by no means limited to:
• Internet search engines such as Google.
• Word processors; think about the grep command in Unix or ‘find in file’ (crtl+f) command in Microsoft Word
(and other applications). • Library catalogues.
2 Naive pattern matching
It is often instructive, and somewhat traditional to begin by considering the naive solution to a problem before proceeding to more ingenious solutions. A naive approach to exact pattern matching would begin by aligning the left end of pat with the left end of txt and performing character comparisons left to right. In the first alignment it will compare pat[1] with txt[1], pat[2] with txt[2], and so on, until either a mismatch is found,
or a complete match occurs,
pat[i] ̸= txt[i], for some 1 ≤ i ≤ m, pat[i]=txt[i], forall1≤i≤m,
In either case, we proceed by shifting pat one place to the right so that pat[1] is now aligned with txt[2], and character by character comparisons are restarted from the left end of pat. This continues until we exhaust all the valid alignments between pat and txt (the right end of pat moves past the right end of txt). If at any point we find a complete match we report it by simply printing the index in txt that marks the start of the current alignment. This naive algorithm is detailed more succinctly by the pseudocode below.
4 5 6 7 8 9
forjfrom1tomdo
if txt[i+j-1] != pat[j] then
break // mismatch endif
if (j == m+1) print i; endfor
2. NAIVE PATTERN MATCHING
1 n = |txt|
2 m = |pat| 3forifrom1ton-m+1 do
Example 2: Naive pattern matching
1 2 3 4 5 6 7 8 9 10
To investigate the performance of our naive algorithm let’s consider, txt[1…10] = a a a a a a a a a a and 123
pat[1…3] = a a a . The algorithm will explore the following alignments, where the iteration is given on the far left,
1 2 3 4 5 6 7 8 9 10
txt: aaaaaaaaaa
pat: aaa iteration-1
pat: aaa iteration-2 pat: aaa iteration-3
iteration-8
In each alignment we find an occurrence of pat and so must perform 3 comparisons every iteration giving 8 × 3 = 24 comparisons in total.
2. NAIVE PATTERN MATCHING
Time complexity: Naive pattern matching
The previous example actually illustrates the worst case for our naive pattern matching algorithm. As each alignment is a complete match our comparisons never terminate prematurely and we perform a comparison for every character in pat in each iteration.
Let n and m denote the length of txt and pat respectively. In general, the naive method checks every valid alignment between pat and txt and in the worst case we perform m comparisons for each alignment. Therefore the total number of comparisons in the worst case is given by,
(n − m + 1) × m = O(mn).
Using more sophisticated methods it is possible to reduce the worst case complexity to O(m+n) (a significant improvement; try m = 1000 and n = 10, 000, 000), and our naive method is generally too slow for large texts and patterns.
3. IMPROVING ON NAIVE PATTERN MATCHING
3 Improving on naive pattern matching
To improve on the naive method we need to reduce the number of explicit comparisons we perform. One way of doing this is to shift pat right by more than a single character when a mismatch occurs. This reduces the number of alignments for which we perform explicit comparisons, consequentially reducing the number of explicit comparisons performed. However, we must ensure that we never shift so far as to miss an occurrence of pat in txt.
Example 3: Improving on the naive pattern matching
To better visualise this principle let’s consider the example below where we wish to find all occurrences of
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13
pat[1…8]=abxyabxz intxt[1…13]=xabxyabxyabxz . Let’sstartwiththenaivemethod, 1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz iteration-1 #
pat: abxyabxz iteration-2 # pat: abxyabxz iteration-3 #
pat: abxyabxz iteration-4 #
pat: abxyabxz iteration-5 #
pat: abxyabxz iteration-6
Overall the naive method performs 20 comparisons in this example and examines every valid alignment between the pattern and the text. However, we can see that it is not necessary to examine some of these alignments.
In iteration 2 we find that,
pat[1…7] = txt[2…8],
before pat[8] (‘z’) mismatches with txt[9] (‘y’). The next three iterations (3 − 5) then terminate after a single comparison as pat[1] (‘a’) mismatches with txt[3] (‘b’), txt[4] (‘x’) and txt[5] (‘y’) respectively. This occurs because the next occurrence of pat[1] (‘a’) in txt is at position 6, and so no matches are possible to the left of this position. A smarter algorithm could use this information to perform a larger shift and immediately align pat[1] with txt[6] after iteration 2. Such a shift would avoid these unnecessary comparisons and is guaranteed to never miss an occurrence of pat. As shown below, with our smarter shift we save 3 comparisons, performing 17 in total.
1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz
iteration-1 #
pat: abxyabxz
iteration-2 #
pat: abxyabxz iteration-3
In fact we can do even better than just skipping unnecessary alignments. By looking more closely at which comparisons need to be performed within an alignment we can further reduce the number of comparisons we must make.
Imagine that an even smarter algorithm processes the pattern before attempting to perform any matching, that is it preprocesses pat. During this preprocessing it learns that the substring pat[1 . . . 3] (i.e ‘abx’) appears
4. GUSFIELD’S Z-ALGORITHM
again at pat[5 . . . 7]. Following the ninth comparison (the eighth in iteration 2), the algorithm also discovers that,
pat[5…7] = txt[6…8]. But since the algorithm already knows that,
pat[1…3] = pat[5…7],
after the shift, when pat[1 . . .] is being compared with txt[6 . . .], it already knows that,
pat[1…3] = txt[6…7],
and it can skip comparing the characters in these substrings. This saves us a further 3 comparisons, giving 14
txt: xabxyabxyabxz
pat: abxyabxz
iteration-1 #
pat: abxyabxz
iteration-2 #
pat: abxyabxz iteration-3
1 2 3 4 5 6 7 8 9 10 11 12 13
In the example above we saw how we could use information about the structure of the pattern and the text to avoid unnecessary comparisons, making our search more efficient. However, we have assumed that this structural information can be gathered efficiently. After all, it is pointless to avoid comparisons if the task of determining which ones we can avoid requires more effort than performing those comparisons in the first place! Fortunately, there are several algorithms that permit us to implement these ideas efficiently. In particular, we will directly examine three algorithms that achieve this by preprocessing the pattern. In other words, they avoid unnecessary comparisons by first exerting a modest amount of effort to learn about the internal structure of the pattern. Other algorithms, like those based on suffix trees, preprocess the text to achieve similar efficiency gains in the exact pattern matching problem, but we will defer the examination of this approach until a later section.
4 Gusfield’s Z-Algorithm
There exist several linear time methods for the exact pattern matching problem that preprocess the pattern. However, the respective preprocessing phases, as originally presented, vary significantly in conceptual difficulty. Fortunately, much of the preprocessing required by many of these algorithms (we examine ’s and the Knuth-Morris-Pratt (KMP) algorithm in FIT3155) is remarkably similar. Thus, we will begin our examination of these methods by first describing Gusfield’s Z-Algorithm, which we find invaluable when attempting to efficiently implement the preprocessing of both and KMP. In addition, we will see that the Z-Algorithm itself will lead directly to a linear time approach to the exact pattern matching problem.
Even though Gusfield’s Z-Algorithm is typically used in the preprocessing phase, it is useful to abstract it from this specific application and describe it with respect to a general string. Put simply, the purpose of the Z-Algorithm is to determine the Z-values for a given string in linear time. To make this more concrete let’s define exactly what we mean by Z-values,
Definition 2: Z-values
For a string str[1 . . . n], define Zi (for each position i > 1 in str) as the length of the longest substring starting at position i of str that matches its prefix (i.e. str[i …i+Zi-1] = str[1 …Zi]
The set of values {Zi, for 2 ≤ i ≤ n} for the string str[1…n] are what we refer to as its Z-values.
4. GUSFIELD’S Z-ALGORITHM
Example 4: Z-values
1 2 3 4 5 6 7 8 9 10 11
If we consider, str[1…11] = a a b c a a b x a a y the corresponding Z-values are, Z2=1 Z3=0 Z4=0 Z5=3 Z6=1
Z7=0 Z8=0 Z9=2 Z10=1 Z11=0
For instance, if we examine Z5 we see that the substring str[5 . . . 7] (‘aab’) matches the prefix str[1 . . . 3] and that str[8] ̸= str[4] and so the longest substring starting at position 5 in str that matches its prefix is str[5 . . . 7]. The length of this substring is 3 and so we obtain Z5 = 3.
Question 1: Does Z1 matter?
We define Z-values for 2 ≤ i ≤ n, but what about Z1? What would it be for the example above? Is this always the case? How much does this tell us about the structure of the string?
The Z-values of a string define intervals (substrings) that match their prefix. Conceptually we can view each Z-value as defining a box that encapsulates the corresponding substring called a Z-box.
Definition 3: Z-boxes
For a string str[1…n], and for any i > 1 such that Zi> 0, the Zi-box is defined as the interval [i …i+Zi-1] of str.
Example 5: Z-boxes
1 2 3 4 5 6 7 8 9 10 11
Returning to our previous example – str[1…11] = a a b c a a b x a a y – the Z-boxes would be,
Z2=1 Z2-box = [2..2] Z3=0 undefined Z4=0 undefined Z5=3 Z5-box = [5..7] Z6=1 Z6-box = [6..6]
Z7 = 0 Z8 = 0 Z9 = 2 Z10 = 1 Z11 = 0
undefined undefined
Z9-box = [9..10] Z10-box = [10..10] undefined
a box (hence the name) around each of the
To better visualise the Z-boxes we often explicitly draw substrings that match their prefix as shown below.
Figure 1: The Z-boxes for the string str[1 . . . 11] = aabcaabxaay.
We will soon see that to efficiently compute the Z-values for a string we will need to keep track of particular Z- boxes. In preparation for this we introduce a compact method for storing the intervals that define them. Specifically, rather than storing the entire interval [i . . . i+Zi-1] we can simply keep track of the endpoints (i and i+Zi-1) of the Z-box.
4. GUSFIELD’S Z-ALGORITHM
Definition 4: ri
For a string str[1..n], and for all i > 1, ri is the right-most endpoint of all Z-boxes that begin at or before position i.
Alternately, ri is the largest value of j+Zj -1 over all 1 < j ≤ i, such that Zj > 0.
Example 6: ri values
1 2 3 4 5 6 7 8 9 10 11
Ifweinspectfigure1wecanseethattheri valuesforstr=aabcaabxaay are:
Z2=1 Z3=0 Z4=0 Z5=3 Z6=1
Z7 = 0 Z8 = 0 Z9 = 2 Z10 = 1 Z11 = 0
Z2-box = [2..2] Z5-box = [5..7] Z6-box = [6..6] Z9-box = [9..10] Z10-box = [10..10]
r2=2 r7 = 7 r3=2 r8 = 7 r4=2 r9 = 10 r5=7 r10 = 10 r6=7 r11 = 10
Definition 5: li
For a string str[1..n], and for all i > 1, li is the left end of the Z-box that ends at ri.
In the case that there is more than one Z-box ending at ri, then li can be chosen to be the left end of any of those Z-boxes.
Example 7: li values
Again by examining figure 1 and the corresponding ri values we can see that the li values for 1 2 3 4 5 6 7 8 9 10 11
str=aabcaabxaay are:
Z2=1 Z3=0 Z4=0 Z5=3 Z6=1
Z7 = 0 Z8 = 0 Z9 = 2 Z10 = 1 Z11 = 0
Z2-box = [2..2] Z5-box = [5..7] Z6-box = [6..6] Z9-box = [9..10] Z10-box = [10..10]
l2=2 l7 = 5 l3=2 l8 = 5 l4=2 l9 = 9 l5=5 l10 = 9 l6=5 l11 = 9
Note that the Z9-box and the Z10-box both end at i = 10 and that Z11 = 0. Due to this we have r9−11 = 10 and we have defined l9−11 = 9. However, it would also be valid to store the left end of the Z10-box for i ≥ 10 giving l10−11 = 10.
Example 8: Self study: calculating Zi, Zi-box and (li, ri).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
str= a a b a a b c a x a a b a a b c y
Z2=1 Z10=7 Z3=0 Z11=1 Z4=3 Z12=0 Z5=1 Z13=3 Z6=0 Z14=1 Z7=0 Z15=0 Z8=1 Z16=0 Z9=0 Z17=0
Z2-box = [2..2] Z4-box = [4..6] Z5-box = [5..5] Z8-box = [8..8]
(l2, r2) = (2, 2) (l3, r3) = (2, 2) (l4, r4) = (4, 6) (l5, r5) = (4, 6) (l6, r6) = (4, 6) (l7, r7) = (4, 6) (l8, r8) = (8, 8) (l9, r9) = (8, 8)
(l10 , r10 ) (l11 , r11 ) (l12 , r12 ) (l13 , r13 ) (l14 , r14 ) (l15 , r15 ) (l16 , r16 ) (l17 , r17 )
= (10, 16) = (10, 16) = (10, 16) = (10, 16) = (10, 16) = (10, 16) = (10, 16) = (10, 16)
Z10 -box Z11 -box Z13 -box Z14 -box
= [10..16] = [11..11] = [13..15] = [14..14]
4. GUSFIELD’S Z-ALGORITHM
4.1 The Algorithm
Given a string str[1 . . . n], the goal of Gusfield’s Z-algorithm is to compute {Zi, Zi-box, li, ri} for each position 1 < i ≤ n in O(n)-time. However, we don’t explicitly store all of these values. In particular, we store:
• All successively computed Zi-values starting from i = 2.
– The Zi-box can be computed from Zi in O(1) time and so we do not store it explicitly.
• Variables (l,r), which keep track of the most recently computed (li,ri) values.
– To compute (li, ri) at iteration i we only require the values (lj,rj) for j = i−1. At the start of iteration i we
have (l, r) = (li−1, ri−1) and these values are then used to update (l, r) so that, (l, r) = (li, ri).
Naive computation of the Z-values for a string would require explicitly matching the characters of each substring str[i...]
to the prefix until a mismatch is found.
The Z-algorithm attempts to avoid performing such comparisons where possible by using the information stored in the previously computed Z-boxes. With these ideas in mind we are ready to examine the details of the algorithm.
Question 2: Naive Z-value computation
What would be the worst case time-complexity for naively computing the Z-values for a string str[1 ...n]?
To begin, we compute Z2 by explicit left-to-right comparison of characters str[2 . . . ] with str[1 . . .] until a mismatch is found (recall that Z2 is the length of the matching substring). We then update the values of l and r accordingly:
If Z2 > 0 • set • set else (i.e., • set • set
r (i.e., r2) to Z2 + 1 l (i.e., l2) to 2
if Z2 == 0)
r (i.e., r2) to 0
l (i.e., l2) to 0
The general case
Let’s now imagine that we are attempting to compute Zk at position k, and assume inductively that: • We have correctly computed values Z2 through to Zk−1,
• r=rk−1 and
At this point the following two scenarios can arise:
Case1: Ifk>r
In this case k is outside the right-most Z-box as shown in figure 2, which means that we must again perform explicit comparisons in order to compute Zk. Specifically, we:
• Compute Zk by explicitly comparing characters str[k . . . q − 1] with str[1 . . . q − k] until a mismatch is found at some q ≥ k.
• IfZk >0:
– set r (i.e., rk) to q−1. – set l (i.e., lk) to k.
If Zk == 0 then there is no need to update l and r and they retain the same values as before.
4. GUSFIELD’S Z-ALGORITHM
Figure 2: Case 1 of the Z-Algorithm. As k > r the previously computed Z-values cannot be used to inform the computation of Zk and we must perform explicit comparisons with the prefix until a mismatch is found.
In this case we must compute Zk through naive comparisons as our previously computed Z-values offer us no insight into the structure of str[k . . . n]. However, if k resides inside a previously computed Z-box this is no longer the case.
Case2: Ifk≤r
In this case k is inside the right-most Z-box as shown below in figure 3. That is the character str[k] lies in the substring str[l . . . r] i.e. within the Zl-box.
Figure 3: Case 2 of the Z-Algorithm. As k ≤ r we can now use previously computed Z-values to inform the computation of Zk.
By definition of the Zl-box we know that, or that the yellow regions match in figure 4.
str[1…Zl] = str[l…r],
Figure 4: By definition of the Zl-box the substring str[l . . . r] must match the prefix str[1 . . . Zl]. Both regions are depicted by a yellow box.
This then implies that,
as these characters were found to match in the computation of the Zl-box. By the same logic we can conclude that,
or that the green regions in figure 5 match.
str[k] = str[k − l + 1],
str[k . . . r] = str[k − l + 1 . . . Zl],
4. GUSFIELD’S Z-ALGORITHM
Figure 5: Due to the properties of the Zl-box the substrings str[k − l + 1 . . . Zl] and str[k . . . r] must match. These substrings are depicted by the green boxes, while the yellow boxes denote the Zl-box and its matching prefix.
By our inductive hypothesis we have already correctly computed Zk−l+1, which is shown as a red box in figure 6. We already know that,
str[k . . . r] = str[k − l + 1 . . . Zl],
so it is reasonable to expect that the value of Zk−l+1 might be able to aid in the computation of Zk. This is indeed the case
and is handled by two separate subcases 2a and 2b.
Figure 6: The structure of related Z-boxes in case 2 (k ≤ r) of the Z-algorithm. The Zk−l+1-box is shown as a red box, while the Zl-box and its matching prefix are depicted by the yellow boxes, and the matching substrings str[k−l+1…Zl] and str[k…r] are depicted by the green boxes.
In this case the Zk−l+1-box does not extend to the end of the prefix str[1…Zl] that matches the Zl-box and so there exists some character x at position p, such that,
Case 2a: If Zk−l+1