Advanced algorithms and data
structures
Week 2 lecture notes by
Based on lecture slides by
Copyright By PowCoder代写 加微信 powcoder
Faculty of Information Technology August 2, 2021
1 TheBoyer-MooreandKnuth-Morris-Prattalgorithms …………………….. 2 1.1 TheBoyer-Moorealgorithm………………………………. 2
1.1.1 Right-to-leftscanning …………………………….. 2
1.1.2 Thebadcharacterrule ……………………………. 3
1.1.3 Theextendedbadcharacterrule……………………….. 6
1.1.4 Goodsuffixrule ……………………………….. 9
1.1.5 Galil’soptimisation ……………………………… 14
1.1.6 Summary…………………………………… 17
2 TheKnuth-Morris-Pratt(KMP)algorithm ………………………….. 18 2.1 Thealgorithm …………………………………….. 18
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
1 The Boyer-Moore and Knuth-Morris-Pratt algorithms
Let’s now return to the ideas discussed in week 1 for improving on naive pattern matching. We claimed that by preprocessing the pattern a more intelligent algorithm could avoid unnecessary comparisons and alignments between the pattern and the text. The avoidance of these unnecessary comparisons would then yield a linear time solution to the exact pattern matching problem. Although the linear time solution we discussed in the previous section can be seen to preprocess the pattern1, the actual matching is performed by calculating the Z-values of the text. This method certainly avoids unnecessary comparisons by using the previously computed Z-values, but it never skips over sections of the text where matches cannot occur and is quite different in spirit from the optimisations we initially proposed. In this section we will resolve this discrepancy by examining two algorithms – the Boyer-Moore algorithm and the Knuth-Morris-Pratt (KMP) algorithm – that take this more intuitive approach to the exact pattern matching problem.
Both the Boyer-Moore and KMP algorithms are linear time methods and have a worst case complexity of O(m + n), assuming a pattern and text of length m and n respectively. This can make their examination seem somewhat academic since we have already described a linear time approach based on the Z-algorithm, which is arguably much simpler than either of Boyer-Moore or KMP. However, both Boyer-Moore and KMP can be seen to hold several distinct advantages over a Z-algorithm based approach. In particular, we will see that although the worst case of Boyer-Moore is linear, on average it runs in sublinear time making it the algorithm of choice in many applications. In contrast KMP holds little advantage over the Z-algorithm based approach in terms of performance, but can be easily generalised to more complex pattern matching problems that are not amenable to Z-algorithm methods. Since Boyer-Moore and KMP are very similar in spirit we will dedicate the majority of this section to Boyer-Moore as it is of greater practical importance and is typically the more challenging of the two methods to grasp. We will then conclude with a brief discussion of KMP illuminating many of its inner workings by noting its similarities to Boyer-Moore.
1.1 The Boyer-Moore algorithm
Similar to the naive method, Boyer-Moore successively aligns the pattern with the text and checks for a complete match in each alignment. After a mismatch, or a complete match, the pattern is shifted to the right with respect to the text generating a new alignment. To improve on the naive method the algorithm incorporates three clever ideas:
• right-to-left scanning,
• the bad character shift rule, • the good suffix shift rule,
each of which we explore below.
1.1.1 Right-to-left scanning
For a given alignment between the pattern and the text, naive pattern matching will compare the characters of the pattern to those in text in a left-to-right scan. That is, in an alignment of pat[1 . . . m] with txt[j . . . j + m − 1], we begin by comparing pat[1] and txt[j], followed by pat[2] and txt[j + 1] and so on moving through the pat- tern and text in a left-to-right scan. In contrast the Boyer-Moore algorithm scans for matched characters right to left.
Example 1: Right-to-left scanning
Let’s consider the alignment shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: x p b c t b x a b a b x c t b p q 1234567
pat: tpabxab
Scanning from right to left we would make the following comparisons:
• Compare pat[7] ≡ b with txt[9] ≡ b: match.
1You should convince yourself that Z-algorithm based patter matching does indeed need to preprocess the pattern before it can begin searching for matches in the text.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
• Compare pat[6] ≡ a with txt[8] ≡ a: match.
• Compare pat[5] ≡ x with txt[7] ≡ x: match.
• Compare pat[4] ≡ b with txt[6] ≡ b: match.
• Compare pat[3] ≡ a with txt[5] ≡ t: mismatch.
Obviously this ‘backwards’ scanning alone offers no performance benefits in comparison to the left-to-right scan utilised by the naive method. However, the direction of our scan is not inconsequential and we will return to its importance after discussing the two shift rules that it enables.
1.1.2 The bad character rule
Example 2: A simple bad character shift
If we return to example 1, we find a mismatch when comparing pat[3] ≡ a with txt[5] ≡ t. We now wish to shift the pattern to the right with respect to the text, but would like to shift further than the naive shift of 1 position. By examining the pattern we can see that the rightmost occurrence of the bad character i.e. the mismatched character in the text (txt[5] ≡ t), occurs at position 1 of the pattern (pat[1] ≡ t). As this is the rightmost occurrence of the mismatched character (‘t’) we can safely shift the pattern so as to align pat[1] ≡ t with txt[5] ≡ t in the next alignment. In this case the bad character rule would tell us to shift the pattern 2 places to the right as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: x p b c t b x a b a b x c t b p q
pat: tpabxab #
pat: tpabxab
We will now formalise this idea so that we can realise it algorithmically. Recall that we only wish to preprocess the pattern, and so we want to store the rightmost occurrences of each character that appears in the pattern. Let’s define,
Position of the rightmost occurrence of x in the pattern if x occurs in pattern, 0 otherwise.
For instance, in the previous example we would have: R(t) = 1, R(b) = 7, and R(c) = 0, as ‘c’ only appears in the text and not the pattern. All that remains is to determine how to use the R(x) values to calculate how many places we should shift by when a mismatch occurs.
Let’s consider some general iteration where txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via right- to-left scan, and a mismatch occurs at position k in the pattern, i.e.
x ≡ txt[j + k − 1] ̸= pat[k] ≡ y
as shown in figure 1.
Figure 1: A general iteration of where txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via right-to-left scan and a mismatch occurs at position k in the pattern.
In this case we want to align the rightmost occurrence of x in the pattern with txt[j + k − 1]. Assuming that R(x) < k (see figure 2) we need to shift the pattern k − R(x) places to the right, as shown in figure 3.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Figure 2: The case where the position of the rightmost occurrence of the bad character (‘x’) in the pattern is to the left of the mismatch point k, i.e. k < R(x).
Figure 3: Assuming that R(x) < k, shifting the pattern to the right by k − R(x) places will align the character at position R(x) with the bad character (x) in the text.
If R(x) = 0 then the character x does not appear in the pattern and we can never find a complete match when any character of the pattern is aligned with x. In this case it is safe to shift the whole pattern past x (txt[j + k − 1]), as shown in figure 4. To do so we need to shift the pattern k places to the right or equivalently by k − R(x) places as R(x) = 0. This gives our definition of R(x) operational significance as we do not need to alter the general shift rule to account for this case. Conveniently, this definition is also of conceptual significance as we are 1 indexing, and so position 0 does not exist in the pattern.
Figure 4: If the mismatched character in the text x doesn’t occur in the pattern, then R(x) = 0 and it is safe to shift the whole pattern past the occurrence of x in the text.
But, what should we do if R(x) > k as shown in figure 5? Our general shift rule would tells us to shift the pattern right by k − R(x) places, but k − R(x) < 0 in this case! Obviously the bad character rule can’t be used to help inform our shift here, and we must resort to shifting naively shifting the pattern by a single position to the right.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Figure 5: If the rightmost occurrence of the mismatched character in the text (x), is to the right of the mismatched position (k) in the pattern, i.e. R(x) > k, then shift the pattern to the right by 1 place.
Example 3: The bad character rule
txt[1…17]=xpbctbxabacbxtbpqa andpat[1…7]=tbapxab . The preprocessing of the pattern to compute the R(x) values would yield:
xabcpqtx R(x) 6 7 0 4 0 1 5
These values would then be used to shift the pattern as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
txt: xpbctbxabacbxtbpqa
pat: tbapxab
iteration 1 #
pat: tbapxab
iteration 2 #
pat: tbapxab
iteration 3 #
pat: tbapxab
iteration 4 #
pat: tbapxab iteration 5 #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Bad character rule summary
Let the pattern and the text be defined over some alphabet א. • Calculate,
Position of the rightmost occurrence of x in the pattern if x occurs in pattern, 0 otherwise,
for each character x ∈ א, by preprocessing the pattern. This can be done in O(m)-time and O(|א|)-space.
• Let x denote the character txt[j + k − 1]. If a mismatch occurs at position k in the pattern during an iteration where txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via right-to-left scan (shown below), then it is safe to shift the pattern to the right by, max(k − R(x), 1) places.
Question 1: Bad character preprocessing
Describe using pseudocode, or otherwise, how to perform the preprocessing required for the bad character rule.
1.1.3 The extended bad character rule
The bad character rule is most effective when mismatches occur towards the right end of the pattern, but reduces to naive shifting when the bad character occurs to the right of the mismatch point in the pattern. This situation becomes increasingly common as we reduce the size of our alphabet, or when the text contains many substrings that almost, but don’t exactly match the pattern. To achieve better performance in these cases we would like to extend the preprocessing of the pattern to enable a more robust shift rule.
Consider iteration 3 in example 3. Here pat[5] ≡ x mismatches with txt[9] ≡ b, but we only shift the pattern by 1 place as R(b) = 7. However, we cannot obtain a complete match until pat[2] ≡ b is aligned with txt[9] ≡ b, and so we could safely shift the pattern by 3 places instead. In general if the character at position k in the pattern mismatches with txt[j + k − 1] ≡ x in the text, we want to align the rightmost occurrence of x that is to the left of k in the pattern with x as shown in figure 6. This will give us the best possible bad character shift and we define the extended bad character rule by simply replacing R(x) with Rk(x) which is defined as,
Position of the rightmost occurrence of x to the left of k in the pattern if x occurs to the left of k in the pattern, Rk(x) = 0 otherwise.
Furthermore, we no longer need to treat the R(x) > k case separately. Shifting by k − Rk (x) will always give a well defined shift since Rk(x) always refers to a character to the left of k. If no such character exists the extended bad character rule instructs us to shift the entire pattern past the mismatched position in the text.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Figure 6: If the mismatched character (x) in the text appears to the left of the mismatch point k in the pattern, then aligning the rightmost such occurrence in the pattern with the mismatched x in the text, is the optimal bad character shift.
Example 4: Extended bad character preprocessing
If pat[1…7] = t b a p x a b , then we could store the corresponding Rk(x) values in the following 2D array:
xabcpqtx R1(x) 0 0 0 0 0 0 0 R2(x) 0 0 0 0 0 1 0 R3(x) 0 2 0 0 0 1 0 R4(x) 3 2 0 0 0 1 0 R5(x) 3 2 0 4 0 1 0 R6(x) 3 2 0 4 0 1 5 R7(x) 6 2 0 4 0 1 5
We should note that every entry in the first row is 0 as if we mismatch on the first character in the pattern i.e. pat[1] ≡ t, then we simply shift past the mismatched character in the text. Furthermore, some of the entries will never be used. For instance R7(b) = 2, but to use this value we would need pat[7] ≡ b to mismatch with a ‘b’ in the text which is impossible.
Extended bad character rule summary
Let the pattern and the text be defined over some alphabet א.
• Calculate Rk(x) for each character x ∈ א, by preprocessing the pattern. This can be done in O(|א|m)
time and space by constructing a 2D array to store the Rk(x) values.
• Let x denote the character txt[j + k − 1]. If a mismatch occurs at position k in the pattern during an iteration where txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via right-to-left scan (shown below), then it is safe to shift the pattern to the right by, k − Rk(x) places.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Question 2: Extended bad character preprocessing
1. Describe using pseudocode, or otherwise, how to perform the preprocessing required for the extended bad character rule. Begin with the 2D array approach suggested above. Hint: you should use dynamic programming.
2. Explain how you could store the Rk(x) values in a more space efficient manner and discuss how this would impact the time complexity of the search phase. Hint: do we need to explicitly store the whole 2D array?
Question 3: Bad character implementation issues
At first glance it would appear that our extended bad character rule elegantly handles all of the cases that might arise when performing exact pattern matching, but what happens if we find an occurrence of the pattern in the text? What would be the bad character in this case, and how many places is it safe to shift the pattern by? Note: This is not an issue in Boyer-Moore as it is handled by the good suffix rule.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
1.1.4 Good suffix rule
The (extended) bad character rule is a significant improvement over naive pattern matching and performs particu- larly well for natural languages, such as English, due to their sizable alphabets, and the variety of characters that appear within words. However, the bad character rule alone does not guarantee linear time pattern matching.
Example 5: The worst case of the bad character rule
1 2 3 4 5 6 7 8 9 10 1 2 3
txt[1…10]=aaaaaaaaaa andpat[1…3]=baa .
In this case when scanning right-to-left every alignment between pat and txt will yield two matches fol-
lowed by a mismatch. The mismatch always occurs at position 1 of the pattern and so both the extended and regular bad character rule would tell us to shift by a single position each iteration, giving (10 − 3 + 1) × 3 = 24 comparisons overall. Evidently this reduces to naive pattern matching with some extra preprocessing!
Example 5 illustrates the worst case of the bad character rule and demonstrates that it alone does not reduce the worst case time complexity of O(mn). In cases like this we match large portions of the pattern to the text, but the bad character rule ignores this information and only considers the character that caused the mismatch in the text. To rectify this we need to develop a second shift rule – the good suffix rule – that considers how to shift the pattern based on the region that matched the text. The bad character and the good suffix rules are then used in conjunction determine the best possible shift for a given alignment.
Let’s develop the good suffix rule by considering an extended example and relating it to the general case as we proceed. Consider the pattern pat[1 . . . 11] = acababacaba and it’s alignment with a substring txt[j . . . j + 10] of a reference text shown in figure 7.
Figure 7: An alignment between txt[j . . . j + 10] and the pattern pat[1 . . . 11] = acababacaba.
If we make a right-to-left scan we discover that,
pat[9 . . . 11] = txt[j + 8 . . . j + 10] and pat[8] ̸= txt[j + 7],
as indicated by the green and red boxes, respectively in figure 8.
Figure 8: The matching regions and mismatched characters are indicated by the green and red boxes respectively.
In this alignment the bad character rule would suggest shifting the pattern right by 1 place as R(b) = 10, while the extended bad character rule would suggest a shift of 2 places as R8(b) = 6. However, we can do better than this.
Let’s first generalise this situation and imagine that during a particular iteration we’ve aligned pat[1 . . . m] with txt[j . . . j + m − 1]. In addition, assume that,
pat[k + 1 . . . m] = txt[j + k . . . j + m − 1] and pat[k] ̸= txt[j + k − 1], 9
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
in complete analogy to our example as shown in figure 9, where we’ve let x and y denote txt[j + k − 1] and pat[k] respectively.
Figure 9: A general alignment between pat[1…m] and txt[j …j + m − 1]. The regions pat[k + 1…m] and txt[j+k . . . j+m−1] are found to match via right-to-left scan, before the characters pat[k] ≡ y and txt[j+k−1] ≡ x are found to mismatch.
Now assume that pattern contains a substring (excluding the suffix itself ) that matches the suffix pat[k + 1 . . . m] that matched the text, as shown in figure 10. If we let p denote the right endpoint of this copy of the good suffix, then shifting the pattern to the right by m − p positions will align it with the matched region in the text, as shown in figure 11. Our good suffix shift is guaranteed to be safe provided that we use the rightmost substring that matches the suffix. However, we can impose an additional constraint on this substring to obtain more useful shifts. In particular, we require that the character (z) immediately preceding the substring (see figure 10) to be different to pat[k] ≡ y, otherwise we would mismatch in the exact same manner after the shift as shown in figure 11. This variant of the good suffix rule is called the strong good suffix rule and we will simply refer to it as the good suffix rule from this point onwards.
Figure 10: The pattern contains a substring that matches the suffix pat[k + 1 . . . m] that matched the text. The substring (denoted by the leftmost green box) that matches its suffix (the rightmost green box in the pattern) ends at position p, and the character immediately preceding this substring (z) is different to the mismatched character pat[k] ≡ y.
1. THE BOYER-MOORE AND KNUTH-MORRIS-PRATT ALGORITHMS
Figure 11: Shifting the pattern to the right by m − p positions will align the substring that matches its suffix with the region that was originally matched in the text.
To enable good suffix shifts we need to store the rightmost substrings that match their suffix. In particular, for each suffix in the pattern, we want to store the right end point p of the rightmost substring that matches the suffix and is immediately preceded by a character that differs from the character that immediately precedes the suffix. To formalise this idea we will make use of the following definition.
Now for each suffix starting
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com