THE UNIVERSITY OF NEW SOUTH WALES
7. STRING MATCHING
Raveen de Silva,
office: K17 202
Copyright By PowCoder代写 加微信 powcoder
Course Admin: ,
School of Computer Science and Engineering UNSW Sydney
Term 1, 2022
Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle
String matching algorithms
Suppose you have an alphabet consisting of d symbols. Strings are just sequences of these symbols.
Determine whether a string B = b0b1 . . . bm−1 (the pattern) appears as a (contiguous) substring of a much longer string A = a0a1 …an−1 (the text).
String matching problem
The “na ̈ıve” algorithm for string matching is as follows: for every position i in A, check ai …ai+m−1 against B character-by-character.
Even with early exit, this runs in O(mn) time in the worst case.
Can we do better?
Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle
String matching with hashing
We now show how hashing can be combined with recursion to produce an efficient string matching algorithm.
We compute a hash value for the string B = b0b1b2 . . . bm−1 in the following way.
First, map each symbol to a corresponding integer
i ∈ {0, 1, 2, . . . , d − 1} so as to identify each string with a sequence of these integers.
Hereafter, when we refer to an integer ai or bi , we really mean the ID of the symbol ai or bi.
Polynomial rolling hash
We can therefore identify B with a sequence of IDs
⟨b0, b1, b2, . . . , bm−1⟩, each between 0 and d − 1 inclusive. Viewing these IDs as digits in base d, we can construct a corresponding integer
h(B) = h(b0b1b2 . . . bm) = dm−1b0+dm−2b1+. . .+d·bm−2+bm−1.
This can be evaluated efficiently using Horner’s rule:
h(B) = bm−1+d(bm−2+d(bm−3+d(bm−4+. . .+d(b1+d·b0) . . .))), requiring only m − 1 additions and m − 1 multiplications.
Next we choose a large prime number p and define the hash value of B as H(B) = h(B) mod p.
Polynomial rolling hash
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……an−1 where n ≫ m.
We want to find efficiently all s such that the string of length m of the form asas+1 …as+m−1 and string b0b1 …bm−1 are equal.
ForeachcontiguoussubstringAs =asas+1…as+m−1 of string A we also compute its hash value as
H(As) = dm−1as+dm−2as+1+…+d1as+m−2+as+m−1 mod p.
Polynomial rolling hash
We can now compare the hash values H(B) and H(As). If they disagree, we know that there is no match.
If they agree, there might be a match. To confirm this, we’ll check it character-by-character.
There are O(n) substrings As to check. If we compute each hash value H(As) in O(m) time, this is no better than the na ̈ıve algorithm.
This is where recursion comes into play: we do not have compute the hash value H(As+1) “from scratch”. Instead, we can compute it efficiently from the previous hash value H(As).
Polynomial rolling hash
Recall that As = as as+1 . . . as+m−1, so
H(As)=dm−1as +dm−2as+1 +…+as+m−1 modp.
Multiplying both sides by d gives
d·H(As)=dmas +dm−1as+1+…+d1as+m−1
=dmas +dm−1as+1 +…+d1as+m−1 +as+m−as+m = dmas + H(As+1) − as+m mod p,
since As+1 = as+1 . . . as+m−1as+m.
Polynomial rolling hash
Rearranging the last equation, we have H(As+1)=d·H(As)−dmas +as+m modp.
Thus we can compute H(As+1) from H(As) in constant time.
When implementing this, we need to make sure that the intermediate values are not too large. In particular, we need to choose p such that (d + 1)p fits in a single register. This is in conflict with the desired to choose large primes for better hashing.
1. First compute H(B) and H(A0) in O(m) time using Horner’s rule.
2. Then compute the O(n) subsequent values of H(As) for s > 0 each in constant time using the recurrence above.
3. Compare H(As) with H(B), and if they are equal then confirm the potential match by checking the strings As and B character- by-character.
Since p was chosen large, the false positives (where
H(As) = H(B) but As ̸= B) are very unlikely, which makes the algorithm run fast in the average case.
However, as always when we use hashing, we cannot achieve useful bounds for the worst case performance.
So we now look for algorithms whose worst case performance is guaranteed to be linear.
Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle
String matching with finite automata
A string matching finite automaton for a pattern B of length m has:
m+1 many states 0,1,…,m, which correspond to the number of characters matched thus far, and
a transition function δ(k, a), where 0 ≤ k ≤ m and a ∈ S.
Suppose that the last k characters of the text A match the first k characters of the pattern B, and that a is the next character in the text. Then δ(k,a) is the new state after character a is read, i.e. the largest k′ so that the last k′ characters of A (ending at the new character c) match the first k′ characters of B.
String matching with finite automata
As an example, consider the string B = xyxyxzx. The table defining δ(k,a) would then be as follows:
String matching with finite automata
This table can be visualised as a state transition diagram. From each state k, we draw an arrow to δ(k,a) for each character a that could be encountered next. Arrows pointing to state 0 have been omitted for clarity.
B = xyxyxzx xx
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk = b0 …bk−1 denote a prefix of length k of the string B.
Being at state k means that so far we have matched the prefix Bk.
If we now see an input character a, then δ(k,a) is the largest l such that the prefix Bl of string B is a suffix of the string Bk a.
String matching with finite automata
How do we find δ(k,a), the largest l such that Bl is a suffix of Bk a?
If a happens to match with bk , i.e. Bk a = Bk+1, then l = k + 1 and so δ(k, a) = k + 1.
But what if a ̸= bk ? Then we can’t extend our match from length k to k +1.
We’d like to extend some shorter match instead.
Any other candidates to be extended must be both a proper suffixofBk andaprefixofB.
The failure function
Definition
Let π(k) be the largest integer l such that the prefix Bl of B is a proper suffix of Bk.
Bl is therefore the longest substring which appears at both the start and the end of Bk (allowing partial but not total overlap).
We will refer to π as the failure function.
The failure function
The transition function δ(k,a) is closely related to this failure function.
If a = bk , then δ(k, a) = k + 1.
Otherwise, we look to extend a shorter match. The next candidate has length π(k), so we check whether a = bπ(k) and if so we have δ(k, a) = π(k) + 1.
Otherwise, the next candidate has length π(π(k)) (why?) and so on.
We will describe the algorithm and complete the analysis entirely in terms of the failure function.
The failure function
Our first step is to solve the value of the failure function for each index of the pattern B.
We will do this recursively using dynamic programming. Naturally, the subproblems are the π(k) values for each
1 ≤ k ≤ m, and the base case is π(1) = 0.
Suppose we have solved π(1), . . . , π(k). How do we find π(k + 1)?
The failure function
Recall that Bπ(k) = b0 . . . bπ(k)−1 is the longest prefix of B which is a proper suffix of Bk = b0 …bk−1.
To compute π(k + 1), we first try to extend this match, by checking whether bk = bπ(k).
If true, then π(k + 1) = π(k) + 1.
If false, then we cannot extend Bπ(k). What’s the next
longest prefix of B which is a proper suffix of Bk? It’s Bπ(π(k))! So we check whether bk = bπ(π(k)).
If true, then π(k + 1) = π(π(k)) + 1.
If false, check whether bk = bπ(π(π(k))) . . .
The failure function
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
function Compute-Failure-Function(B) m←|B|
let π[1..m] be a new array, and π[1] ← 0 l←0
fork=1tom−1do
while l≥0 do
if bk = bl then
break end if
if l > 0 then l ← π[l]
else break
end if end while
π[k + 1] ← l end for
return π end function
The Knuth-Morris-Pratt algorithm
What is the complexity of this algorithm? There are O(m) values of k, and for each we might try several values l; is this O m2?
No! It is actually linear, i.e. O(m).
Maintain two pointers: the left pointer at k − 1 − l (the start point of the match we are trying to extend) and the right pointer at k.
After each ‘step’ of the algorithm (i.e. each comparison between bk and bl), exactly one of these two pointers is moved forwards.
Each can take up to m values, so the total number of steps is O(m). This is an example of amortisation.
The Knuth-Morris-Pratt algorithm
We can now do our search for the pattern B in the text A. Suppose Bs is the longest prefix of B which is a suffix of
Ai =a0…ai−1.
To answer the same question for Ai+1, we begin by checking whether ai = bs .
If true, then the answer for Ai+1 is s + 1 If false, check whether ai = bπ(s) . . .
If the answer for any Ai is m, we have a match!
Reset to state π(m) to detect any overlapping full matches.
By the same two pointer argument, the time complexity is O(n).
The Knuth-Morris-Pratt algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
function KMP-Matcher(A,B) n←|A|
π ← Compute-Failure-Function(B ) s←0
fori=0ton−1do
while s ≥ 0 do
if ai = bs then
break end if
The Knuth-Morris-Pratt algorithm
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
ifs>0then s ← π[s]
else break
end if end while
if s = m then
print match found from index i − m s ← π[m]
end if end for
end function
Looking for imperfect matches
Sometimes we are not interested in finding just the perfect matches, but also in matches that might have a few errors, such as a few insertions, deletions and replacements.
Instance: a text
A = a0a1a2a3 ……asas+1 …as+m−1 ……an−1,
a pattern B = b0b1b2 …bm−1 where m ≪ n, and an integer k ≪ m.
Task: find all matches for B in A which have up to k errors.
Looking for imperfect matches
Solution Outline
Split B into k + 1 substrings of (approximately) equal length. Within any match in A with at most k errors, at least one of these k + 1 parts of B must be matched perfectly.
For each of the k + 1 parts of B, we use the Knuth-Morris-Pratt algorithm to find all perfect matches in A.
Then for each of these matches, we test by brute force whether the remaining parts of B match sufficiently with the appropriate parts of A.
Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle
On a rectangular table there are 25 round coins of equal size, and no two of these coins overlap. You observe that in current arrangement, it is not possible to add another coin without overlapping any of the existing coins and without the coin falling off the table (for a coin to stay on the table its centre must be within the table).
Show that it is possible to completely cover the table with 100 coins (allowing overlaps).
That’s All, Folks!!
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com