CS计算机代考程序代写 algorithm Algorithms: COMP3121/9101

Algorithms: COMP3121/9101

THE UNIVERSITY OF
NEW SOUTH WALES

Algorithms:
COMP3121/9101

School of Computer Science and Engineering
University of New South Wales

9. STRING MATCHING ALGORITHMS

COMP3121/3821/9101/9801 1 / 13

String Matching algorithms

Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.

The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.

We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.

COMP3121/3821/9101/9801 2 / 13

String Matching algorithms

Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.

The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.

We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.

COMP3121/3821/9101/9801 2 / 13

String Matching algorithms

Assume that you want to find out if a string B = b0b1 . . . bm−1 appears
as a (contiguous) substring of a much longer string A = a0a1 . . . an−1.

The “naive” string matching algorithm does not work well if B is much
longer than what can fit in a single register; we need something cleverer.

We now show how hashing can be combined with recursion to produce
an efficient string matching algorithm.

COMP3121/3821/9101/9801 2 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

We compute a hash value for the string B = b0b1b2 . . . bm in the following way.

We will assume that strings A and B are in an alphabet A with d many
symbols in total.

Thus, we can identify each string with a sequence of integers by mapping each
symbol si into a corresponding integer i:

A = {s0, s1, s2, . . . , sd−1} −→ {0, 1, 2, . . . , d− 1}

To any string B = b0b1 . . . bm−1 we can now associate an integer whose digits
in base d are integers corresponding to each symbol in B:

h(B) = h(b0b1b2 . . . bm) = d
m−1

b0 + d
m−2

b1 + . . .+ d · bm−2 + bm−1

This can be done efficiently using the Horner’s rule:

h(B) = bm−1 + d(bm−2 + d(bm−3 + d(bm−4 + . . .+ d(b1 + d · b0))) . . .)

Next we choose a large prime number p such that (d+ 1) p still fits into a
single register and define the hash value of B as H(B) = h(B) mod p.

COMP3121/3821/9101/9801 3 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Recall that A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1 where N >> m.

We want to find efficiently all s such that the string of length m of the form
asas+1 . . . as+m−1 and string b0b1 . . . bm−1 are equal.

For each contiguous substring As = asas+1 . . . as+m−1 of string A we also
compute its hash value as

H(As) = (d
m−1

as + d
m−2

as+1 + . . .+ d
1
as+m−2 + as+m−1) mod p

We can now compare the hash values H(B) and H(As) and do a
symbol-by-symbol matching only if H(B) = H(As).

Clearly, such an algorithm would be faster than the naive symbol-by-symbol
comparison only if we can compute the hash values of substrings As faster
than what it takes to compare strings B and As character by character.

This is where recursion comes into play: we do not have compute the hash
value H(As+1) of As+1 = as+1as+2 . . . as+m “from scratch”, but we can
compute it efficiently from the hash value H(As) of As = asas+1 . . . as+m−1 as
follows.

COMP3121/3821/9101/9801 4 / 13

Rabin – Karp Algorithm

Since

H(As) = (d
m−1as + d

m−2as+1 + . . . d
1as+m−2 + as+m−1) mod p

by multiplying both sides by d we obtain

(d ·H(As)) mod p =
= (dmas + d

m−1as+1 + . . . d · as+m−1) mod p
= (dmas + (d

m−1as+1 + . . . d
2as+m−2 + d as+m−1 + as+m) mod p− as+m) mod p

= (dm as +H(As+1)− as+m) mod p

COMP3121/3821/9101/9801 5 / 13

Rabin – Karp Algorithm

Since

H(As) = (d
m−1as + d

m−2as+1 + . . . d
1as+m−2 + as+m−1) mod p

by multiplying both sides by d we obtain

(d ·H(As)) mod p =
= (dmas + d

m−1as+1 + . . . d · as+m−1) mod p
= (dmas + (d

m−1as+1 + . . . d
2as+m−2 + d as+m−1 + as+m) mod p− as+m) mod p

= (dm as +H(As+1)− as+m) mod p

COMP3121/3821/9101/9801 5 / 13

Rabin – Karp Algorithm

Consequently,

H(As+1) = (d ·H(As)− dmas + as+m) mod p.

Note that
(dmas) mod p = ((d

m mod p)as) mod p

and that the value dmmod p can be precomputed and stored.

Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Consequently, H(As+1) = (d ·H(As)− dmas + as+m) mod p. Note that (dmas) mod p = ((d m mod p)as) mod p and that the value dmmod p can be precomputed and stored. Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Consequently, H(As+1) = (d ·H(As)− dmas + as+m) mod p. Note that (dmas) mod p = ((d m mod p)as) mod p and that the value dmmod p can be precomputed and stored. Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Consequently, H(As+1) = (d ·H(As)− dmas + as+m) mod p. Note that (dmas) mod p = ((d m mod p)as) mod p and that the value dmmod p can be precomputed and stored. Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Consequently, H(As+1) = (d ·H(As)− dmas + as+m) mod p. Note that (dmas) mod p = ((d m mod p)as) mod p and that the value dmmod p can be precomputed and stored. Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Consequently, H(As+1) = (d ·H(As)− dmas + as+m) mod p. Note that (dmas) mod p = ((d m mod p)as) mod p and that the value dmmod p can be precomputed and stored. Also, (−dmas + as+m)mod p < p Thus, since H(As) < p we obtain d ·H(As) + (−dmas + as+m) mod p < (d+ 1)p Thus, since we chose p such that (d+ 1) p fits in a register, all the values and the intermediate results for the above expression also fit in a single register. Thus, for every s except s = 0 the value of H(As) can be computed in constant time independent of the length of the strings A and B. COMP3121/3821/9101/9801 6 / 13 Rabin - Karp Algorithm Thus, we first compute H(B) and H(A0) using the Horner’s rule. Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

Rabin – Karp Algorithm

Thus, we first compute H(B) and H(A0) using the Horner’s rule.

Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

Rabin – Karp Algorithm

Thus, we first compute H(B) and H(A0) using the Horner’s rule.

Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

Rabin – Karp Algorithm

Thus, we first compute H(B) and H(A0) using the Horner’s rule.

Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

Rabin – Karp Algorithm

Thus, we first compute H(B) and H(A0) using the Horner’s rule.

Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

Rabin – Karp Algorithm

Thus, we first compute H(B) and H(A0) using the Horner’s rule.

Subsequent values of H(As) for s > 0 are computed in constant time
using the above recursion.

H(As) is compared with H(B) and if they are equal the strings As and
B are compared by brute force character by character to see if they are
equal.

Since p was chosen large, the false positives when H(As) = H(B) but
As 6= B are very unlikely, which makes the algorithm run fast in practice.

However, as always when we use hashing, we cannot guarantee the worst
case performance.

So we now look for algorithms whose worst case performance can be
guaranteed.

COMP3121/3821/9101/9801 7 / 13

String matching finite automata

A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.

We first look at the case when such δ(s, c) is given by a pre-constructed table.

To make things easier to describe, we consider the string S = ababaca. The table

defining δ(s, c) would then be

input
state a b c

0 1 0 0 a

1 1 2 0 b

2 3 0 0 a

3 1 4 0 b

4 5 0 0 a

5 1 4 6 c

6 7 0 0 a

7 1 2 0

2

b,c

a

c

b a b a c a
0 1 3 4 5 6 7

a

a

a

a

b
b

b,c

state transition diagram for string ababaca

COMP3121/3821/9101/9801 8 / 13

String matching finite automata

A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.

We first look at the case when such δ(s, c) is given by a pre-constructed table.

To make things easier to describe, we consider the string S = ababaca. The table

defining δ(s, c) would then be

input
state a b c

0 1 0 0 a

1 1 2 0 b

2 3 0 0 a

3 1 4 0 b

4 5 0 0 a

5 1 4 6 c

6 7 0 0 a

7 1 2 0

2

b,c

a

c

b a b a c a
0 1 3 4 5 6 7

a

a

a

a

b
b

b,c

state transition diagram for string ababaca

COMP3121/3821/9101/9801 8 / 13

String matching finite automata

A string matching finite automaton for a string S with k symbols has k + 1 many
states 0, 1, . . . k which correspond to the number of characters matched thus far and a
transition function δ(s, c) where s is a state and c is a character red at the moment.

We first look at the case when such δ(s, c) is given by a pre-constructed table.

To make things easier to describe, we consider the string S = ababaca. The table
defining δ(s, c) would then be

input
state a b c

0 1 0 0 a

1 1 2 0 b

2 3 0 0 a

3 1 4 0 b

4 5 0 0 a

5 1 4 6 c

6 7 0 0 a

7 1 2 0

2

b,c

a

c

b a b a c a
0 1 3 4 5 6 7

a

a

a

a

b
b

b,c

state transition diagram for string ababaca

COMP3121/3821/9101/9801 8 / 13

String matching with finite automata

How do we compute the transition function δ, i.e., how do we fill the
table?

Let Bk denote the prefix of the string B consisting of the first k
characters of string B.

If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.

Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.

We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.

COMP3121/3821/9101/9801 9 / 13

String matching with finite automata

How do we compute the transition function δ, i.e., how do we fill the
table?

Let Bk denote the prefix of the string B consisting of the first k
characters of string B.

If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.

Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.

We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.

COMP3121/3821/9101/9801 9 / 13

String matching with finite automata

How do we compute the transition function δ, i.e., how do we fill the
table?

Let Bk denote the prefix of the string B consisting of the first k
characters of string B.

If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.

Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.

We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.

COMP3121/3821/9101/9801 9 / 13

String matching with finite automata

How do we compute the transition function δ, i.e., how do we fill the
table?

Let Bk denote the prefix of the string B consisting of the first k
characters of string B.

If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.

Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.

We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.

COMP3121/3821/9101/9801 9 / 13

String matching with finite automata

How do we compute the transition function δ, i.e., how do we fill the
table?

Let Bk denote the prefix of the string B consisting of the first k
characters of string B.

If we are at a state k this means that so far we have matched the prefix
Bk; if we now see an input character a, then δ(k, a) is the largest m such
that the prefix Bm of string B is the suffix of the string Bka.

Thus, if a happens to be B[k + 1], then m = k + 1 and so δ(k, a) = k + 1
and Bka = Bk+1.

We do that by matching the string against itself: we can recursively
compute a function π(k) which for each k returns the largest integer m
such that the prefix Bm of B is a proper suffix of Bk.

COMP3121/3821/9101/9801 9 / 13

The Knuth-Morris-Pratt algorithm

1: function
Compute− Prefix− Function(B)
2: m← length[B]
3: let π[1..m] be a new array
4: π[1] = 0
5: k = 0
6: for q = 2 to m do
7: while k > 0 and

B[k + 1] 6= B[q]
8: k = π[k]
9: if B[k + 1] == B[q]

10: k = k + 1
11: π[q] = k
12: end for
13: return π
14: end function

B[q]= B[p+1] π[q]=p+1

k

m=length(B)

q-1

B[q]

k=π[q-1]
B[q] ≠ B[k+1]

p=π[k]

B[p+1]

p

B[k+1]

Assume that length of B is m and that
we have already found that π[q − 1] =
k; to compute π[q] we check if B[q] =
B[k + 1]; if true then π[q] = k + 1; if
not true then we find π[k] = p; if now
B[q] = B[p+ 1] then π[q] = p+ 1.

COMP3121/3821/9101/9801 10 / 13

The Knuth-Morris-Pratt algorithm

We can now do our search for string B in a longer string A:

1: function KMP−Matcher(A,B)
2: n← length[A]
3: m← length[B]
4: π = Compute− Prefix− Function(B)
5: q = 0
6: for i = 1 to n do
7: while q > 0 and B[q + 1] 6= A[i]
8: q = π[q]
9: if B[q + 1] == A[i]

10: q = q + 1
11: if q == m
12: print pattern occurs with shift i−m
13: q = π[q]
14: end for
15: end function

COMP3121/3821/9101/9801 11 / 13

Looking for imperfect matches

Sometimes we are not interested in finding just the prefect matches, but also in
matches that might have a few errors, such as a few insertions, deletions and
replacements.

So assume that we have a very long string
A = a0a1a2a3 . . . . . . asas+1 . . . as+m−1 . . . . . . aN−1, a shorter string
B = b0b1b2 . . . bm−1 where m << N and an integer k << m. We are interested in finding all matches for B in A which allow up to k many errors. Idea: split B into k + 1 consecutive subsequences of (approximately) equal length. Then any match in A with at most k errors must contain a subsequence which is a perfect match for a subsequence of B. Thus, we look for all perfect matches for all of k + 1 subsequences of B and for every hit we test by brute force if the remaining parts of B have sufficient number of matches in the appropriate parts of A. COMP3121/3821/9101/9801 12 / 13 PUZZLE!! On a rectangular table there are 25 non-overlapping round coins of equal size placed in such a way that it is not possible to add another coin without overlapping any of the existing coins and without the coin falling off the table (for a coin to stay on the table its centre must be within the table). Show that it is possible to completely cover the table with 100 coins (of course with overlapping of coins). COMP3121/3821/9101/9801 13 / 13