CS代考 THE UNIVERSITY OF NEW SOUTH WALES

THE UNIVERSITY OF NEW SOUTH WALES
9. STRING MATCHING
Raveen de Silva,
office: K17 202
Course Admin: ,
School of Computer Science and Engineering UNSW Sydney
Term 3, 2021

Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle

String Matching algorithms
Suppose you have an alphabet S = {s0,s1,…,sd−1} of d characters.
You want to determine whether a string B = b0b1 . . . bm−1 appears as a (contiguous) substring of a much longer string A = a0a1 …an−1.
The “naive” string matching algorithm runs in O(nm); we want to do better.

Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle

Rabin –
We now show how hashing can be combined with recursion to produce an efficient string matching algorithm.
We compute a hash value for the string B = b0b1b2 . . . bm−1 in the following way.
First, map each symbol si to a corresponding integer i: S = { s 0 , s 1 , s 2 , . . . , s d − 1 } −→ { 0 , 1 , 2 , . . . , d − 1 } ,
so as to identify each string with a sequence of these integers.
Hereafter, when we refer to an integer ai or bi , we really mean the ID of the symbol ai or bi.

Rabin –
We can therefore identify B with a sequence of IDs
⟨b0, b1, b2, . . . , bm−1⟩, each between 0 and d − 1 inclusive. Viewing these IDs as digits in base d, we can construct a corresponding integer
h(B) = h(b0b1b2 . . . bm) = dm−1b0+dm−2b1+. . .+d·bm−2+bm−1.
This can be evaluated efficiently using Horner’s rule:
h(B) = bm−1+d(bm−2+d(bm−3+d(bm−4+. . .+d(b1+d·b0) . . .))), requiring only m − 1 additions and m − 1 multiplications.
Next we choose a large prime number p and define the hash valueofB asH(B)=h(B)modp. Werequirethat(d+1)p fits in a register; it will be made clear later why this is important.

Rabin –
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……an−1 where n ≫ m.
We want to find efficiently all s such that the string of length m of the form asas+1 …as+m−1 and string b0b1 …bm−1 are equal.
ForeachcontiguoussubstringAs =asas+1…as+m−1 of string A we also compute its hash value as
H(As) = dm−1as+dm−2as+1+…+d1as+m−2+as+m−1 mod p.

Rabin –
We can now compare the hash values H(B) and H(As) and do a symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol comparison only if we can compute the hash values of substrings As faster than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can compute it efficiently from the hash value H(As ) of As = as as+1 . . . as+m−1 as follows.

Rabin –
Since
H(As)=dm−1as +dm−2as+1 +…d1as+m−2 +as+m−1 modp,
by multiplying both sides by d we obtain
d ·H(As)
=dmas +dm−1as+1 +…+d1as+m−1
=dmas +(dm−1as+1 +…+d1as+m−1 +as+m)−as+m = dmas + H(As+1) − as+m mod p

Rabin –
Consequently,
H(As+1)=d·H(As)−dmas +as+m modp.
To find dmas mod p, we use the precomputed value
dm mod p, multiply it by as and again take the remainder modulo p.
Also, since (−dmas + as+m) mod p and H(As) are each less than p, it follows that
d·H(As)+[(−dmas +as+m)modp]<(d+1)p. Thus, since we chose p such that (d + 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Rabin - Thus, we first compute H(B) and H(A0) using Horner’s rule. The O(n) subsequent values of H(As) for s > 0 are computed
in constant time using the above recursion.
H(As) is compared with H(B), and if they are equal the strings As and B are compared by brute force character by character to confirm whether they are genuinely equal.

Rabin –
Since p was chosen large, the false positives when
H(As) = H(B) but As ̸= B are very unlikely, which makes the algorithm run fast in the average case.
However, as always when we use hashing, we cannot achieve useful bounds for the worst case performance.
So we now look for algorithms whose worst case performance is guaranteed to be linear.

Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle

String matching with finite automata
A string matching finite automaton for a pattern B of length m has:
m+1 many states 0,1,…,m, which correspond to the number of characters matched thus far, and
a transition function δ(s, c), where 0 ≤ s ≤ m and c ∈ S.
Suppose that the last s characters of the text A match the first s characters of the pattern B, and that c is the next character in the text. Then δ(s,c) is the new state after character c is read, i.e. the largest s′ so that the last s′ characters of A (ending at the new character c) match the first s′ characters of B.

String matching with finite automata
We first suppose that δ(s,c) is given as a pre-constructed table; we will discuss how to construct this table later. As an example, consider the string B = xyxyxzx. The table defining δ(s,c) would then be as follows:
s
0
matched
x
1
y
0
z
0
1
x
1
2
0
2
xy
3
0
0
3
xyx
1
4
0
4
xyxy
5
0
0
5
xyxyx
1
4
6
6
xyxyxz
7
0
0
7
xyxyxzx
1
2
0

String matching with finite automata
This table can be visualised as a state transition diagram. From each state s, we draw an arrow to δ(s,c) for each character c that could be encountered next. Arrows pointing to state 0 have been omitted for clarity.
B = xyxyxzx xx
x xyxyxzx
01234567
x
y y

String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk = b0 …bk−1 denote a prefix of length k of the string B.
Being at state k means that so far we have matched the prefix Bk.
If we now see an input character a, then δ(k,a) is the largest m such that the prefix Bm of string B is a suffix of the string Bk a.
In the particular case where a happens to be bk, i.e. Bk a = Bk+1, then m = k + 1 and so δ(k, a) = k + 1.

String matching with finite automata
But what if a ̸= bk ? We can’t extend our match from length k to k + 1.
We’d like to extend some shorter match.
How do we find δ(k,a), the largest m such that Bm is a suffix
of Bka?
We match the string against itself: we can recursively compute a function π(k) which for each k returns the largest integer m such that the prefix Bm of B is a proper suffix of Bk .

String matching with finite automata
Suppose we have already found that π(k), i.e.
Bπ(k) = b0 . . . bπ(k)−1 is the longest prefix of B which is a proper suffix of Bk.
To compute π(k + 1), we first check whether bk = bπ(k). If true, then π(k + 1) = π(k) + 1.
If false, then we cannot extend Bπ(k). What’s the next longest prefix of B which is a proper suffix of Bk?
It’s Bπ(π(k))! So we check whether bk = bπ(π(k)).
If true, then π(k + 1) = π(π(k)) + 1.
If false, check whether bk = bπ(π(π(k))) . . .

The Knuth-Morris-Pratt algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
function Compute-Prefix-Function(B) m←|B|
let π[1..m] be a new array, and π[1] ← 0 l←0
fork=1tom−1do
while l≥0 do
if bk = bl then
l←l+1
break end if
if l > 0 then l ← π[l]
else break
end if end while
π[k + 1] ← l end for
return π end function

The Knuth-Morris-Pratt algorithm
What is the complexity of this algorithm? There are O(m) values of k, and for each we might try several values l; is this O(m2)?
No! It is actually linear, i.e. O(m).
Maintain two pointers: the left pointer at k + 1 − l (the start point of the match we are trying to extend) and the right pointer at k.
After each ‘step’ of the algorithm (i.e. each comparison between bk and bl), exactly one of these two pointers is moved forwards.
Each can take up to m values, so the total number of steps is O(m). This is an example of amortisation.

The Knuth-Morris-Pratt algorithm
We can now do our search for string B in a longer string A. Suppose Bs is the longest prefix of B which is a suffix of
Ai =a0…ai−1.
To answer the same question for Ai+1, we begin by checking whether ai = bs .
If true, then the answer for Ai+1 is s + 1 If false, check whether ai = bπ(s) . . .
If the answer for any Ai is m, we have a match!
Reset to state π(m) to detect any overlapping full matches.
By the same two pointer argument, the time complexity is O(n).

The Knuth-Morris-Pratt algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11:
function KMP-Matcher(A,B) n←|A|
m←|B|
π ← Compute-Prefix-Function(B ) s←0
fori=0ton−1do
while s ≥ 0 do
if ai = bs then
s←s+1
break end if

The Knuth-Morris-Pratt algorithm
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:
ifs>0then s ← π[s]
else break
end if end while
if s = m then
print match found from index i − m s ← π[m]
end if end for
end function

Looking for imperfect matches
Sometimes we are not interested in finding just the perfect matches, but also in matches that might have a few errors, such as a few insertions, deletions and replacements.
Problem
Instance: a very long string
A = a0a1a2a3 ……asas+1 …as+m−1 ……an−1,
a shorter string B = b0b1b2 . . . bm−1 where m ≪ n, and an integer k ≪ m.
Task: find all matches for B in A which have up to k errors.

Looking for imperfect matches
Solution Outline
Split B into k + 1 substrings of (approximately) equal length. Then any match in A with at most k errors must contain a substring which is a perfect match for a substring of B.
We look for all perfect matches in A for each of the k + 1 parts of B. For every match, we test by brute force whether the remaining parts of B match sufficiently with the appropriate parts of A.

Table of Contents
1. Introduction
2. Hashing
3. Finite Automata 4. Puzzle

PUZZLE!!
On a rectangular table there are 25 round coins of equal size, and no two of these coins overlap. You observe that in current arrangement, it is not possible to add another coin without overlapping any of the existing coins and without the coin falling off the table (for a coin to stay on the table its centre must be within the table).
Show that it is possible to completely cover the table with 100 coins (allowing overlaps).

That’s All, Folks!!