编程代写 FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt

COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
This material has been reproduced and communicated to you by or on behalf of Monash University pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 1 / 30

Prepared by: [ ] Extended by:
FIT3155: Advanced Algorithms and Data Structures
Week 2: Linear-time string pattern matching (Boyer-Moore, Knuth-Morris-Pratt algorithms)
Faculty of Information Technology, Monash University
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 2 / 30

We looked at…
The Z-algorithm
The Z-algorithm is a general (preprocessing) algorithm to Computes
Zi-values of a string str in O(|str|)-time (refer week 1’s lecture). The exact pattern matching problem, addressed using Z-algorithm
Given a reference text txt[1 . . . n] and a pattern pat[1 . . . m], we used the Z-algorithm to find ALL occurrences, if any, of pat in txt in O(m + n) time.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 3 / 30

Week 1 recap: Na ̈ıve approach
1 2 3 4 5 6 7 8 9 10 11 12 13
txt: xabxyabxyabxz
pat: abxyabxz iteration-1 #
pat: abxyabxz iteration-2 # pat: abxyabxz iteration-3 #
pat: abxyabxz iteration-4 #
pat: abxyabxz iteration-5 #
pat: abxyabxz
iteration-6 Overall 20 comparisons in the na ̈ıve approach, on this example.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 4 / 30

Week 1 recap: Main ideas to improve na ̈ıve pattern matching
We imagined ideas to:
shift pat more aggressively (by > 1 positions, where feasible) below
the txt, and
ensure we never miss an occurrence of pat in txt.
We will look at two specific algorithms that implement these ideas efficiently and in concrete ways:
1 Boyer-Moore
2 Knuth-Morris-Pratt
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 5 / 30

2. The Boyer-Moore Algorithm
Defines the standard benchmark for string pattern matching.
Many find/search utilities that come packaged within programming languages/operating systems rely on this algorithm.
Incorporated within GNU’s implementation of grep.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 6 / 30

Boyer-Moore Algorithm – Introduction
Boyer-Moore algorithm incorporate three main ideas:
1 right-to-left scanning
2 bad character shift rule
3 good suffix shift rule
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 7 / 30

Boyer-Moore Algorithm – right-to-left scan
For any comparison of pat[1 . . . m] against txt[j . . . j + m − 1], the
Boyer-Moore algorithm checks/scans for matched characters right to left. Example: right to left scanning (in some arbitrary iteration)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: x p b c t b x a b p q x c t b p q
←−| | | 1234567 pat: tpabxab
Scanning from right to left in the above example: Compare pat[7] ≡ b with txt[9] ≡ b: match. Compare pat[6] ≡ a with txt[8] ≡ a: match. Compare pat[5] ≡ x with txt[7] ≡ x: match. Compare pat[4] ≡ b with txt[6] ≡ b: match. Compare pat[3] ≡ a with txt[5] ≡ t: mismatch.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 8 / 30

Boyer-Moore Algorithm – right-to-left scan
For any comparison of pat[1 . . . m] against txt[j . . . j + m − 1], the
Boyer-Moore algorithm checks/scans for matched characters right to left. Example: right to left scanning (in some arbitrary iteration)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: x p b c t b x a b p q x c t b p q
←−| | | 1234567 pat: tpabxab
Scanning from right to left in the above example: Compare pat[7] ≡ b with txt[9] ≡ b: match. Compare pat[6] ≡ a with txt[8] ≡ a: match. Compare pat[5] ≡ x with txt[7] ≡ x: match. Compare pat[4] ≡ b with txt[6] ≡ b: match. Compare pat[3] ≡ a with txt[5] ≡ t: mismatch.
After a mismatch during right-to-left scanning, Boyer-Moore tries to learn something from the character-pairs that were already matched in the current iteration, to explore if pat can shift rightwards under txt by ≥ 1 position. It does this employing two additional ideas/tricks discussed below.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 8 / 30

Boyer-Moore Algorithm – Bad character shift rule Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: xpbctbxababxctbpq pat: tpabxab
In the above arbitrary iteration that is trying to check if pat[1 . . . 7] matches exactly with txt[3…9],
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 9 / 30

Boyer-Moore Algorithm – Bad character shift rule Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: xpbctbxababxctbpq pat: tpabxab
iter-(i)
In the above arbitrary iteration that is trying to check if pat[1 . . . 7] matches exactly with txt[3…9],
…scanning right-to-left, we found txt[6 . . . 9] matched with pat[4 . . . 7],
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 9 / 30

Boyer-Moore Algorithm – Bad character shift rule Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 txt: x p b c Ot b x a b a b x c t b p q
pat: tpabxab iter-(i) #
In the above arbitrary iteration that is trying to check if pat[1 . . . 7] matches exactly with txt[3…9],
…scanning right-to-left, we found txt[6 . . . 9] matched with pat[4 . . . 7],
… and the scanning stopped at the mismatch when txt[5] ≡ t was compared with pat[3] ≡ a. The mismatched character txt[5] ≡ t is termed the ‘bad character’ in this iteration.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 9 / 30

Boyer-Moore Algorithm – Bad character shift rule Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 txt: x p b c Ot b x a b a b x c t b p q
pat: Ot p a b x a b iter-(i) #
In the above arbitrary iteration that is trying to check if pat[1 . . . 7] matches exactly with txt[3…9],
…scanning right-to-left, we found txt[6 . . . 9] matched with pat[4 . . . 7],
… and the scanning stopped at the mismatch when txt[5] ≡ t was compared with pat[3] ≡ a. The mismatched character txt[5] ≡ t is termed the ‘bad character’ in this iteration.
However, we observe that the rightmost occurrence in the entire pat of the bad (i.e. mismatched) character in txt (txt[5] ≡ t) in this iteration appears at position 1 of pat (i.e., pat[1] ≡ t).
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 9 / 30

Boyer-Moore Algorithm – Bad character shift rule Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
txt: x p b c Ot b x a b a b x c t b p q
pat: tpabxab
pat: −→ Ot p a b x a b
iter-(i+1) shift right by 2 places to line up the two Ot s (bad character from prev iteration)
In the above arbitrary iteration that is trying to check if pat[1 . . . 7] matches exactly with txt[3…9],
…scanning right-to-left, we found txt[6 . . . 9] matched with pat[4 . . . 7],
… and the scanning stopped at the mismatch when txt[5] ≡ t was compared with pat[3] ≡ a. The mismatched character txt[5] ≡ t is termed the ‘bad character’ in this iteration.
However, we observe that the rightmost occurrence in the entire pat of the bad (i.e. mismatched) character in txt (txt[5] ≡ t) in this iteration appears at position 1 of pat (i.e., pat[1] ≡ t).
So, in this scenario, one can safely shift pat by two places to the right under the txt so as to align pat[1] ≡ t under txt[5] ≡ t (instead of na ̈ıvely shifting by only one place).
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 9 / 30

Boyer-Moore Algorithm – Formalizing ‘Bad character’ shift rule
Let pat[1 . . . m] and txt[1 . . . n] be strings from the alphabet א.
Preprocess pat, and store for each character x ∈ א, the rightmost position of occurrence of character x in pat.
􏰥 Call this position, R(x).
􏰥 Note, R(x) = 0 when x does not occur in pat.
R(x) values stored in a rudimentary array data structure 1234567
TheR(x)valuesforpat[1…7]=tbapxab are:
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 10 / 30

Boyer-Moore Algorithm – Formalizing ‘Bad character’ shift rule
Let pat[1 . . . m] and txt[1 . . . n] be strings from the alphabet א.
Preprocess pat, and store for each character x ∈ א, the rightmost position of occurrence of character x in pat.
􏰥 Call this position, R(x).
􏰥 Note, R(x) = 0 when x does not occur in pat.
R(x) values stored in a rudimentary array data structure 1234567
TheR(x)valuesforpat[1…7]=tbapxab are:
Storing R(x) values for pat requires at most O(|א|) space, and one lookup per mismatch.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 10 / 30

Boyer-Moore Algorithm – Formalizing ‘Bad character’ shift rule (continued)
More abstractly, in some iteration, if txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via the right-to-left scan, and
…the kth character of txt counting from position j, i.e. x = txt[j + k − 1], be the bad character that mismatched with the kth character of the pattern y = pat[k], then …using the preprocessed R(·) data structure, (potentially) locate the rightmost occurrence R(x) of the bad character x in pat.
Shift/Jump RULE: Then, if R(x) < k, shift pat rightwards by k − R(x) positions. Else, no choice with the simple R(·) data structure but to shift pat right na ̈ıvely by 1 position. (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 11 / 30 Boyer-Moore Algorithm – Formalizing ‘Bad character’ shift rule (continued) More abstractly, in some iteration, if txt[j . . . j + m − 1] and pat[1 . . . m] are being compared via the right-to-left scan, and ...the kth character of txt counting from position j, i.e. x = txt[j + k − 1], be the bad character that mismatched with the kth character of the pattern y = pat[k], then ...using the preprocessed R(·) data structure, (potentially) locate the rightmost occurrence R(x) of the bad character x in pat. Shift/Jump RULE: Then, if R(x) < k, shift pat rightwards by k − R(x) positions. Else, no choice with the simple R(·) data structure but to shift pat right na ̈ıvely by 1 position. 􏰥 Note: The above rule automatically implies that if x does not occur in pat[1..m] (i.e., R(x) = 0), then the pat gets shifted one position past the point of mismatch in txt. (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 11 / 30 Self study: Bad character shift rule example Consider the pattern and text below. The R(x) values for the pattern are: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 txt: x p b c t b x a b a c b x t b p q a pat: t b a p x a b (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 12 / 30 Self study: Bad character shift rule example Consider the pattern and text below. The R(x) values for the pattern are: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 txt: x p b c t b x a b a c b x t b p q a pat: t b a p x a b pat: tbapxab i=2 # (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 12 / 30 Self study: Bad character shift rule example Consider the pattern and text below. The R(x) values for the pattern are: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 txt: x p b c t b x a b a c b x t b p q a pat: t b a p x a b pat: tbapxab i=2 # pat: tbapxab i=3 # (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 12 / 30 Self study: Bad character shift rule example Consider the pattern and text below. The R(x) values for the pattern are: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 txt: x p b c t b x a b a c b x t b p q a pat: t b a p x a b pat: tbapxab pat: tbapxab i=3 # pat: tbapxab i=4 # (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 12 / 30 Self study: Bad character shift rule example Consider the pattern and text below. The R(x) values for the pattern are: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 txt: x p b c t b x a b a c b x t b p q a pat: t b a p x a b pat: tbapxab pat: tbapxab pat: tbapxab pat: tbapxab i=5 # (FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 12 / 30 Extended bad-character (shift) rule Recall from the bad character shift rule (see slide 11, last overlay), that when R(x) > k, then we shift na ̈ıvely by 1 position.
How can we extend the R(·) data structure so that we can do better than that, when any R(x) > k?
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 13 / 30

Extended bad-character (shift) rule Extended Bad-Character Rule
When a mismatch occurs at some position k in pat[1 . . . m], and the corresponding mismatched character is x = txt[j + k − 1], then shift pat[1..m] to the right so that the closest x in pat that is to the left of position k is now below the (previously mismatched) x in txt.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 13 / 30

Extended bad-character (shift) rule
To achieve this, preprocess pat[1 . . . m] so that… 􏰥 foreachposition1≤k≤minpat,
􏰥 and for each character x ∈ א,
􏰥 the position of the closest occurrence of x to the left of each
position k
…can be efficiently looked up.
If we call such a position (for any combination of k and x) Rk(x), then the extended bad character (shift) rule tells us to shift pat by k − Rk(x) positions, for a mismatch at any position k in pat.
Note: The rudimentary implementation of the Rk(·) data structure is a 2D array (or shift/jump table) of size m × |א|. (Self-study: Think how this can be implemented more space-efficiently? Does its use come by trading-off time with space?)
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 13 / 30

Extended bad-character (shift) rule
Self-study: Example of a basic extended Rk(·) data structure
For pat[1…7] = t b a p x a b , one could store the corresponding Rk(x) values in the following rudimentary 2D matrix of size m × |א|:
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 13 / 30

Boyer-Moore Algorithm – Good suffix rule: Motivation
In some iteration, say txt[j …j + m − 1] and pat[1…m] are being compared via right-to-left scan.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 14 / 30

Boyer-Moore Algorithm – Good suffix rule: Motivation
In some iteration, say txt[j …j + m − 1] and pat[1…m] are being compared via right-to-left scan.
Let the kth character of txt counting from j, i.e. x = txt[j + k − 1], be mismatched with the kth character of the pattern y = pat[k].
The plain R(·) based bad character rule will only suggest a na ̈ıve shift of pat by 1 position to the right.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 14 / 30

Boyer-Moore Algorithm – Good suffix rule: Motivation
In some iteration, say txt[j …j + m − 1] and pat[1…m] are being compared via right-to-left scan.
Let the kth character of txt counting from j, i.e. x = txt[j + k − 1], be mismatched with the kth character of the pattern y = pat[k].
The plain R(·) based bad character rule will only suggest a na ̈ıve shift of pat by 1 position to the right.
The extended Rk(·) based bad character rule suggests a shift of 2 positions to the right.
(FIT3155 S1 2022, Monash University) W2: Boyer-Moore & Knuth-Morris-Pratt 14 / 30

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts