Algorithms: COMP3121/9101
THE UNIVERSITY OF
NEW SOUTH WALES
Algorithms:
COMP3121/9101
School of Computer Science and Engineering
University of New South Wales
9. STRING MATCHING ALGORITHMS
COMP3121/3821/9101/9801 1 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.
The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.
We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.
COMP3121/3821/9101/9801 2 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.
The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.
We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.
COMP3121/3821/9101/9801 2 / 13
String Matching algorithms
Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.
The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.
We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.
COMP3121/3821/9101/9801 2 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
We compute a hash value for the string B = b0b1b2 . . . bm in the following way.
We will assume that strings A and B are in an alphabet A with d many
symbols in total.
Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:
A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}
To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:
h(B) = h(b0b1b2 . . . bm) = d
m−1
b0 + d
m−2
b1 + . . .+ d · bm−2 + bm−1
This can be done efficiently using the Horner’s rule:
h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)
Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.
COMP3121/3821/9101/9801 3 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.
We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.
For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as
H(As) = (d
m−1
as + d
m−2
as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p
We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).
Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.
This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.
COMP3121/3821/9101/9801 4 / 13
Rabin – Karp Algorithm
Since
H(As) = (d
m−1as + d
m−2as+1 + . . . d
1as+m−2 + as+m−1) mod p
by multiplying both sides by d we obtain
(d ·H(As)) mod p =
= (dmas + d
m−1as+1 + . . . d · as+m−1) mod p
= (dmas + (d
m−1as+1 + . . . d
2as+m−2 + d as+m−1 + as+m) mod p− as+m) mod p
= (dm as +H(As+1)− as+m) mod p
COMP3121/3821/9101/9801 5 / 13
Rabin – Karp Algorithm
Since
H(As) = (d
m−1as + d
m−2as+1 + . . . d
1as+m−2 + as+m−1) mod p
by multiplying both sides by d we obtain
(d ·H(As)) mod p =
= (dmas + d
m−1as+1 + . . . d · as+m−1) mod p
= (dmas + (d
m−1as+1 + . . . d
2as+m−2 + d as+m−1 + as+m) mod p− as+m) mod p
= (dm as +H(As+1)− as+m) mod p
COMP3121/3821/9101/9801 5 / 13
Rabin – Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Consequently,
H(As+1) = (d ·H(As)− dmas + as+m) mod p.
Note that
(dmas) mod p = ((d
m mod p)as) mod p
and that the value dmmod p can be precomputed and stored.
Also, (−dmas + as+m)mod p < p
Thus, since H(As) < p we obtain
d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p
Thus, since we chose p such that (d+ 1) p fits in a register, all the values
and the intermediate results for the above expression also fit in a single
register.
Thus, for every s except s = 0 the value of H(As) can be computed in
constant time independent of the length of the strings A and B.
COMP3121/3821/9101/9801 6 / 13
Rabin - Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
Rabin – Karp Algorithm
Thus, we first compute H(B) and H(A0) using the Horner’s rule.
Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.
H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.
Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.
However, as always when we use hashing, we cannot guarantee the worst
case performance.
So we now look for algorithms whose worst case performance can be
guaranteed.
COMP3121/3821/9101/9801 7 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.
We first look at the case when such δ(s, c) is given by a pre-constructed table.
To make things easier to describe, we consider the string S = ababaca. The table
defining δ(s, c) would then be
input
state a b c
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
2
b,c
a
c
b a b a c a
0 1 3 4 5 6 7
a
a
a
a
b
b
b,c
state transition diagram for string ababaca
COMP3121/3821/9101/9801 8 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.
We first look at the case when such δ(s, c) is given by a pre-constructed table.
To make things easier to describe, we consider the string S = ababaca. The table
defining δ(s, c) would then be
input
state a b c
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
2
b,c
a
c
b a b a c a
0 1 3 4 5 6 7
a
a
a
a
b
b
b,c
state transition diagram for string ababaca
COMP3121/3821/9101/9801 8 / 13
String matching finite automata
A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.
We first look at the case when such δ(s, c) is given by a pre-constructed table.
To make things easier to describe, we consider the string S = ababaca. The table
defining δ(s, c) would then be
input
state a b c
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0
2
b,c
a
c
b a b a c a
0 1 3 4 5 6 7
a
a
a
a
b
b
b,c
state transition diagram for string ababaca
COMP3121/3821/9101/9801 8 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the
table?
Let Bk denote the prefix of the string B consisting of the first k
characters of string B.
If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.
We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the
table?
Let Bk denote the prefix of the string B consisting of the first k
characters of string B.
If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.
We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the
table?
Let Bk denote the prefix of the string B consisting of the first k
characters of string B.
If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.
We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the
table?
Let Bk denote the prefix of the string B consisting of the first k
characters of string B.
If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.
We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
String matching with finite automata
How do we compute the transition function δ, i.e., how do we fill the
table?
Let Bk denote the prefix of the string B consisting of the first k
characters of string B.
If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.
Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.
We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.
COMP3121/3821/9101/9801 9 / 13
The Knuth-Morris-Pratt algorithm
1: function
Compute− Prefix− Function(B)
2: m← length[B]
3: let π[1..m] be a new array
4: π[1] = 0
5: k = 0
6: for q = 2 to m do
7: while k > 0 and
B[k + 1] 6= B[q]
8: k = π[k]
9: if B[k + 1] == B[q]
10: k = k + 1
11: π[q] = k
12: end for
13: return π
14: end function
B[q]= B[p+1] π[q]=p+1
k
m=length(B)
q-1
B[q]
k=π[q-1]
B[q] ≠ B[k+1]
p=π[k]
B[p+1]
p
B[k+1]
Assume that length of B is m and that
we have already found that π[q − 1] =
k; to compute π[q] we check if B[q] =
B[k + 1]; if true then π[q] = k + 1; if
not true then we find π[k] = p; if now
B[q] = B[p+ 1] then π[q] = p+ 1.
COMP3121/3821/9101/9801 10 / 13
The Knuth-Morris-Pratt algorithm
We can now do our search for string B in a longer string A:
1: function KMP−Matcher(A,B)
2: n← length[A]
3: m← length[B]
4: π = Compute− Prefix− Function(B)
5: q = 0
6: for i = 1 to n do
7: while q > 0 and B[q + 1] 6= A[i]
8: q = π[q]
9: if B[q + 1] == A[i]
10: q = q + 1
11: if q == m
12: print pattern occurs with shift i−m
13: q = π[q]
14: end for
15: end function
COMP3121/3821/9101/9801 11 / 13
Looking for imperfect matches
Sometimes we are not interested in finding just the prefect matches, but also in
matches that might have a few errors, such as a few insertions, deletions and
replacements.
So assume that we have a very long string
A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1, a shorter string
B = b0b1b2 . . . bm−1 where m << N and an integer k << m. We are interested
in finding all matches for B in A which allow up to k many errors.
Idea: split B into k + 1 consecutive subsequences of (approximately) equal
length. Then any match in A with at most k errors must contain a
subsequence which is a perfect match for a subsequence of B. Thus, we look
for all perfect matches for all of k + 1 subsequences of B and for every hit we
test by brute force if the remaining parts of B have sufficient number of
matches in the appropriate parts of A.
COMP3121/3821/9101/9801 12 / 13
PUZZLE!!
On a rectangular table there are 25 non-overlapping round coins of
equal size placed in such a way that it is not possible to add another
coin without overlapping any of the existing coins and without the coin
falling off the table (for a coin to stay on the table its centre must be
within the table). Show that it is possible to completely cover the table
with 100 coins (of course with overlapping of coins).
COMP3121/3821/9101/9801 13 / 13