NEW SOUTH WALES
Algorithms: COMP3121/9101
School of Computer Science and Engineering University of New South Wales
9. STRING MATCHING ALGORITHMS
COMP3121/3821/9101/9801 1 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears as a (contiguous) substring of a much longer string A = a0 a1 . . . an−1 .
COMP3121/3821/9101/9801 2 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears as a (contiguous) substring of a much longer string A = a0 a1 . . . an−1 .
The “naive” string matching algorithm does not work well if B is much longer than what can fit in a single register; we need something cleverer.
COMP3121/3821/9101/9801 2 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears as a (contiguous) substring of a much longer string A = a0 a1 . . . an−1 .
The “naive” string matching algorithm does not work well if B is much longer than what can fit in a single register; we need something cleverer.
We now show how hashing can be combined with recursion to produce an efficient string matching algorithm.
COMP3121/3821/9101/9801 2 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many symbols in total.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way. We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0,s1,s2,…,sd−1} −→ {0,1,2,…,d − 1}
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way. We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0,s1,s2,…,sd−1} −→ {0,1,2,…,d − 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits in base d are integers corresponding to each symbol in B:
h(B)=h(b0b1b2…bm)=dm−1b0 +dm−2b1 +…+d·bm−2 +bm−1
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way. We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0,s1,s2,…,sd−1} −→ {0,1,2,…,d − 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits in base d are integers corresponding to each symbol in B:
h(B)=h(b0b1b2…bm)=dm−1b0 +dm−2b1 +…+d·bm−2 +bm−1 This can be done efficiently using the Horner’s rule:
h(B)=bm−1 +d(bm−2 +d(bm−3 +d(bm−4 +…+d(b1 +d·b0)))…)
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0 b1 b2 . . . bm in the following way. We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0,s1,s2,…,sd−1} −→ {0,1,2,…,d − 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits in base d are integers corresponding to each symbol in B:
h(B)=h(b0b1b2…bm)=dm−1b0 +dm−2b1 +…+d·bm−2 +bm−1 This can be done efficiently using the Horner’s rule:
h(B)=bm−1 +d(bm−2 +d(bm−3 +d(bm−4 +…+d(b1 +d·b0)))…) Next we choose a large prime number p such that (d + 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p. COMP3121/3821/9101/9801
3 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m. We want to find efficiently all s such that the string of length m of the form
asas+1…as+m−1 andstringb0b1…bm−1 areequal.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m. We want to find efficiently all s such that the string of length m of the form
asas+1…as+m−1 andstringb0b1…bm−1 areequal.
For each contiguous substring As = as as+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (dm−1as + dm−2as+1 + . . . + d1as+m−2 + as+m−1) mod p
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m. We want to find efficiently all s such that the string of length m of the form
asas+1…as+m−1 andstringb0b1…bm−1 areequal.
For each contiguous substring As = as as+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (dm−1as + dm−2as+1 + . . . + d1as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a symbol-by-symbol matching only if H(B) = H(As).
COMP3121/3821/9101/9801
4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m. We want to find efficiently all s such that the string of length m of the form
asas+1…as+m−1 andstringb0b1…bm−1 areequal.
For each contiguous substring As = as as+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (dm−1as + dm−2as+1 + . . . + d1as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol comparison only if we can compute the hash values of substrings As faster than what it takes to compare strings B and As character by character.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1 where N >> m. We want to find efficiently all s such that the string of length m of the form
asas+1…as+m−1 andstringb0b1…bm−1 areequal.
For each contiguous substring As = as as+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (dm−1as + dm−2as+1 + . . . + d1as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol comparison only if we can compute the hash values of substrings As faster than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash value H(As+1) of As+1 = as+1as+2 …as+m “from scratch”, but we can compute it efficiently from the hash value H(As) of As = asas+1 …as+m−1 as follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Since
H(As) = (dm−1as + dm−2as+1 + . . . d1as+m−2 + as+m−1) mod p
COMP3121/3821/9101/9801 5 / 13
Rabin – Karp Algorithm
Since
H(As) = (dm−1as + dm−2as+1 + . . . d1as+m−2 + as+m−1) mod p
by multiplying both sides by d we obtain
(d · H(As)) mod p =
= (dmas + dm−1as+1 + . . . d · as+m−1) mod p
= (dmas + (dm−1as+1 + . . . d2as+m−2 + d as+m−1 + as+m) mod p − as+m) mod p = (dm as + H(As+1) − as+m) mod p
COMP3121/3821/9101/9801 5 / 13
Rabin – Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
COMP3121/3821/9101/9801 6 / 13
Rabin – Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
Note that
and that the value dm mod p can be precomputed and stored.
(dmas) mod p = ((dm mod p)as) mod p
COMP3121/3821/9101/9801
6 / 13
Rabin – Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
Note that
and that the value dm mod p can be precomputed and stored.
(dmas) mod p = ((dm mod p)as) mod p Also, (−dmas + as+m)mod p < p
COMP3121/3821/9101/9801
6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
Note that
(dmas) mod p = ((dm mod p)as) mod p
and that the value dm mod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain
d · H(As) + (−dmas + as+m) mod p < (d + 1)p
COMP3121/3821/9101/9801
6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
Note that
(dmas) mod p = ((dm mod p)as) mod p
and that the value dm mod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain
d · H(As) + (−dmas + as+m) mod p < (d + 1)p
Thus, since we chose p such that (d + 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d · H(As) − dmas + as+m) mod p.
Note that
(dmas) mod p = ((dm mod p)as) mod p
and that the value dm mod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain
d · H(As) + (−dmas + as+m) mod p < (d + 1)p
Thus, since we chose p such that (d + 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register.
Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
COMP3121/3821/9101/9801 7 / 13
Rabin - Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and B are compared by brute force character by character to see if they are equal.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and B are compared by brute force character by character to see if they are equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As ̸= B are very unlikely, which makes the algorithm run fast in practice.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and B are compared by brute force character by character to see if they are equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As ̸= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst case performance.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and B are compared by brute force character by character to see if they are equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As ̸= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst case performance.
So we now look for algorithms whose worst case performance can be guaranteed.
COMP3121/3821/9101/9801 7 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many states 0, 1, . . . k which correspond to the number of characters matched thus far and a transition function δ(s, c) where s is a state and c is a character red at the moment.
COMP3121/3821/9101/9801 8 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many states 0, 1, . . . k which correspond to the number of characters matched thus far and a transition function δ(s, c) where s is a state and c is a character red at the moment.
We first look at the case when such δ(s, c) is given by a pre-constructed table.
COMP3121/3821/9101/9801 8 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many states 0, 1, . . . k which correspond to the number of characters matched thus far and a transition function δ(s, c) where s is a state and c is a character red at the moment.
We first look at the case when such δ(s, c) is given by a pre-constructed table. To make things easier to describe, we consider the string S = ababaca. The table
defining δ(s, c) would then be
input state a b c
0100a
b,c
a
a a
1120b
b,c
0a1b2a3b4a5c6a7
2300a
3140b
a
b
b
4500a
c
5146c
6700a 7120
state transition diagram for string ababaca
COMP3121/3821/9101/9801 8 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk denote the prefix of the string B consisting of the first k characters of string B.
COMP3121/3821/9101/9801
9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk denote the prefix of the string B consisting of the first k characters of string B.
If we are at a state k this means that so far we have matched the prefix Bk; if we now see an input character a, then δ(k,a) is the largest m such that the prefix Bm of string B is the suffix of the string Bka.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk denote the prefix of the string B consisting of the first k characters of string B.
If we are at a state k this means that so far we have matched the prefix Bk; if we now see an input character a, then δ(k,a) is the largest m such that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1 and Bka = Bk+1.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the table?
Let Bk denote the prefix of the string B consisting of the first k characters of string B.
If we are at a state k this means that so far we have matched the prefix Bk; if we now see an input character a, then δ(k,a) is the largest m such that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1 and Bka = Bk+1.
We do that by matching the string against itself: we can recursively compute a function π(k) which for each k returns the largest integer m such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
The Knuth-Morris-Pratt algorithm
k
p
1: function
Compute − Prefix − Function(B)
B[p+1] B
2: 3: 4: 5: 6: 7:
8:
9: 10: 11: 12: 13: 14:
m ← length[B]
let π[1..m] be a new array π[1]=0
k=0
forq=2tomdo
while k > 0 and B[k + 1] ̸= B[q]
k = π[k]
if B[k + 1] == B[q]
k=k+1 π[q] = k
end for
return π end function
π[q]=p+1
COMP3121/3821/9101/9801
10 / 13
[k+1]
m=length(B)
Assume that length of B is m and that we have already found that π[q − 1] = k; to compute π[q] we check if B[q] = B[k+1]; if true then π[q] = k+1; if not true then we find π[k] = p; if now B[q] = B[p + 1] then π[q] = p + 1.
p=π[k]
B[q]
B[q]= B[p+1] B[q] ≠ B[k+1]
k=π[q-1]
q-1
The Knuth-Morris-Pratt algorithm
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15:
We can now do our search for string B in a longer string A:
function KMP − Matcher(A, B)
n ← length[A]
m ← length[B]
π = Compute − Prefix − Function(B) q=0
fori=1tondo
while q > 0 and B[q + 1] ̸= A[i] q = π[q]
if B[q + 1] == A[i]
q=q+1 if q==m
print pattern occurs with shift i − m
q = π[q] end for
end function
COMP3121/3821/9101/9801
11 / 13
Looking for imperfect matches
Sometimes we are not interested in finding just the prefect matches, but also in matches that might have a few errors, such as a few insertions, deletions and replacements.
So assume that we have a very long string
A = a0a1a2a3 ……asas+1 …as+m−1 ……aN−1, a shorter string
B = b0b1b2 …bm−1 where m << N and an integer k << m. We are interested in finding all matches for B in A which allow up to k many errors.
Idea: split B into k + 1 consecutive subsequences of (approximately) equal length. Then any match in A with at most k errors must contain a subsequence which is a perfect match for a subsequence of B. Thus, we look for all perfect matches for all of k + 1 subsequences of B and for every hit we test by brute force if the remaining parts of B have sufficient number of matches in the appropriate parts of A.
COMP3121/3821/9101/9801 12 / 13
PUZZLE!!
On a rectangular table there are 25 non-overlapping round coins of equal size placed in such a way that it is not possible to add another coin without overlapping any of the existing coins and without the coin falling off the table (for a coin to stay on the table its centre must be within the table). Show that it is possible to completely cover the table with 100 coins (of course with overlapping of coins).
COMP3121/3821/9101/9801 13 / 13