Algorithms
ROBERT SEDGEWICK | KEVIN WAYNE
Algorithms FOURTH EDITION
ROBERT SEDGEWICK | KEVIN WAYNE
http://algs4.cs.princeton.edu
5.3 SUBSTRING SEARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt ‣ Boyer-Moore
‣ Rabin-Karp
Algorithms
ROBERT SEDGEWICK | KEVIN WAYNE
http://algs4.cs.princeton.edu
5.3 SUBSTRING SEARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt ‣ Boyer-Moore
‣ Rabin-Karp
Substring search
Goal. Find pattern of length M in a text of length N.
typically N >> M
(in some cases : MI
Nz)
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
Substring search
match
3
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
Substring search
match
4
Substring search applications
Goal. Find pattern of length M in a text of length N.
NEEDLE INAHAYSTACKNEEDLEINA
match
Substring search
Computer forensics. Search memory or disk for signatures,
e.g., all URLs or RSA keys that the user has entered.
pattern
text
typically N >> M
http://citp.princeton.edu/memory
5
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
Identify patterns indicative of spam.
Substring search
match
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
I・dentify patterns indicative of spam. PROFITS
Substring search
match
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
I・dentify patterns indicative of spam. ・ PROFITS
L0SE WE1GHT
Substring search
match
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
I・dentify patterns indicative of spam. ・ PROFITS
・L0SE WE1GHT
herbal Viagra
Substring search
match
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
I・dentify patterns indicative of spam. ・ PROFITS
・L0SE WE1GHT
・herbal Viagra
There is no catch.
Substring search
match
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
Identify patterns indicative of spam.
Substring search
match
・ PROFITS
・L0SE WE1GHT
・herbal Viagra
・There is no catch.
・This is a one-time mailing.
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
typically N >> M
Identify patterns indicative of spam.
Substring search
match
・ PROFITS
・L0SE WE1GHT
・herbal Viagra
・There is no catch.
・This is a one-time mailing.
・This message is sent in compliance with spam regulations.
6
Substring search applications
Goal. Find pattern of length M in a text of length N.
pattern
text
NEEDLE INAHAYSTACKNEEDLEINA
Identify patterns indicative of spam.
Substring search
match
typically N >> M
・ PROFITS
・L0SE WE1GHT
・herbal Viagra
・There is no catch.
・This is a one-time mailing.
・This message is sent in compliance with spam regulations.
6
Substring search applications
Electronic surveillance.
Need to monitor all internet traffic. (security)
7
Substring search applications
Electronic surveillance.
Need to monitor all internet traffic. (security)
No way! (privacy)
7
Substring search applications
Electronic surveillance.
Need to monitor all internet traffic. (security)
No way! (privacy)
Well, we’re mainly interested in “ATTACK AT DAWN”
7
Substring search applications
Electronic surveillance.
Need to monitor all internet traffic. (security)
No way! (privacy)
Well, we’re mainly interested in “ATTACK AT DAWN”
OK. Build a machine that just looks for that.
7
Substring search applications
Electronic surveillance.
Need to monitor all internet traffic. (security)
No way! (privacy)
Well, we’re mainly interested in “ATTACK AT DAWN”
OK. Build a machine that just looks for that.
“ATTACK AT DAWN” substring search
machine
found
7
Substring search applications
Screen scraping. Extract relevant data from web page.
Ex. Find string delimited by and after first occurrence of
pattern Last Trade:.
…
Last Trade:
Trade Time:
http://finance.yahoo.com/q?s=goog
8
Screen scraping: Java implementation
Java library. The indexOf() method in Java’s string library returns the index of the first occurrence of a given string, starting at a given offset.
public class StockQuote
{
public static void main(String[] args)
{
String name = “http://finance.yahoo.com/q?s=”;
In in = new In(name + args[0]);
String text = in.readAll();
String price = text.substring(from + 3, to);
StdOut.println(price);
}
}
int start
int from
int to
= text.indexOf(“Last Trade:”, 0);
= text.indexOf(““, start);
= text.indexOf(““, from);
% java StockQuote goog
582.93
% java StockQuote msft
24.84
9
Algorithms
ROBERT SEDGEWICK | KEVIN WAYNE
http://algs4.cs.princeton.edu
5.3 SUBSTRING SEARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt ‣ Boyer-Moore
‣ Rabin-Karp
Brute-force substring search
Check for pattern starting at each text position.
N =
of txt
i j i+j 0 1 2 3 4 5 6 7 8 9 10 txt ABACADABRAC
6 410
return i when j is M
ABRA
match
entries in red are mismatches
entries in gray are for reference only
entries in black A B R A match the text A B R A
M
=
length length of
pattern
0 2 2
101
213
303
4 1 5
5 05
A B R A pat ABRA
ABRA ABRA
Brute-force substring search
11
Brute-force substring search: Java implementation
Check for pattern starting at each text position.
public static int search(String pat, String txt)
{
int M = pat.length();
int N = txt.length();
for (int i = 0; i <= N - M; i++)
{
int j;
for (j = 0; j < M; j++)
if (txt.charAt(i+j) != pat.charAt(j))
}
return N; }
not found
break;
if (j == M) return i;
index in text where
pattern starts
12
i ji+j 012345678910 ABACADABRAC
437 ADACR 505 ADACR
Brute-force substring search: worst case
Brute-force algorithm can be slow if text and pattern are repetitive.
--T -_ 10
i j i+j 0 1 2 3 4 5 6 7 8 9 10 j i+j 0 1 2 3 4 5 6 7 8 9
i 044AAAAB
txt ABACADABRAC
μ N
txt
AAAAAAAAAB
0 2 2 A B R A pat
pat
entries in red are mismatches
entries in gray are for reference only
101 ABRA 145 AAAAB
213 ABRA 246 AAAAB
303 ABRA 347 AAAAB
4 1 5 entries in black A B R A 448 AAAAB
505 match the text 5 5 10
A B R A
A A A A B
6 4 10
return i when j is M
A B R A
Brute-force substring search (worst case)
Brute-force substring search
match
13
Brute-force substring search: worst case
Brute-force algorithm can be slow if text and pattern are repetitive.
i j i+j 0 1 2 3 4 5 6 7 8 9 10 j i+j 0 1 2 3 4 5 6 7 8 9
txt ABACADABRAC txt AAAAAAAAAB
i 044AAAAB
0 2 2 A B R A pat
pat
1 0 1 1 4 5
A B R A entries in red are AAAAB mismatches
213
ABRA
A A A A B entries in gray are
246 30
347 4 1
5 5 10
6 4 10
Brute-force substring search (worst case)
Brute-force substring search
448 50
AAAAB A B R A
A A A A B A B R A
Worst case. ~ M N char compares.
3 5 5
A B R A for reference only AAAAB
entries in black A B R A
match the text return i when j is M
=
imagine
or
N TN
M
match
109 MINT
- and
13
Backup
I・n many applications, we want to avoid backup in text stream.
・Treat input as stream of data. Abstract model: standard input.
“ATTACK AT DAWN” substring search machine
14
Backup
I・n many applications, we want to avoid backup in text stream.
・Treat input as stream of data. Abstract model: standard input.
“ATTACK AT DAWN” substring search machine
Brute-force algorithm needs backup for every mismatch.
text→
pattern
matched chars
ex
mismatch
backup
14
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
shift pattern right one position
Backup
I・n many applications, we want to avoid backup in text stream.
・Treat input as stream of data. Abstract model: standard input.
“ATTACK AT DAWN” substring search machine
Brute-force algorithm needs backup for every mismatch.
matched chars
mismatch
backup
Approach 1. Maintain buffer of last M characters.
14
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
shift pattern right one position
Backup
I・n many applications, we want to avoid backup in text stream.
・Treat input as stream of data. Abstract model: standard input.
“ATTACK AT DAWN” substring search machine
Brute-force algorithm needs backup for every mismatch.
matched chars
mismatch
backup
Approach 1. Maintain buffer of last M characters. Approach 2. Stay tuned.
14
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
AAAAAAAAAAAAAAAAAAAAAB AAAAAB
shift pattern right one position
Brute-force substring search: alternate implementation
Same sequence of char compares as previous implementation.
・ i points to end of sequence of already-matched chars in text. ・ j stores # of already-matched chars (end of sequence in pattern).
public static int search(String pat, String txt)
{
int i, N = txt.length();
int j, M = pat.length();
for (i = 0, j = 0; i < N && j < M; i++)
{
if (txt.charAt(i) == pat.charAt(j)) j++;
}
if (j == M) return i - M;
else return N;
}
else { i -= j; j = 0; }
explicit backup
15
i j 012345678910 ABACADABRAC
73 ADACR 50 ADACR
Algorithmic challenges in substring search
Brute-force is not always good enough.
Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party.
16
Algorithmic challenges in substring search
Brute-force is not always good enough.
Theoretical challenge. Linear-time guarantee.
fundamental algorithmic problem
Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party.
16
Algorithmic challenges in substring search
Brute-force is not always good enough.
Theoretical challenge. Linear-time guarantee. Practical challenge. Avoid backup in text stream.
fundamental algorithmic problem often no room or time to save text
Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for each good person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party. Now is the time for all people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good people to come to the aid of their party. Now is the time for all of the good people to come to the aid of their party. Now is the time for all good people to come to the aid of their attack at dawn party. Now is the time for each person to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for many or all good people to come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is the time for all good Democrats to come to the aid of their party.
16
Algorithms
ROBERT SEDGEWICK | KEVIN WAYNE
http://algs4.cs.princeton.edu
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt ‣ Boyer-Moore
‣ Rabin-Karp
5.3 SUBSTRING SEARCH
Michael Rabin Dick Karp
Rabin-Karp fingerprint search
Basic idea = modular hashing.
( division
method )
73
Rabin-Karp fingerprint search
B・asic idea = modular hashing. Compute a hash of pat[0..M-1].
pat.charAt(i)
i 01234
2 6 5 3 5 % 997 = 613
txt.charAt(i)
i 01234 5 6 7 8 9 10 11 12 13 14 15 3141592653589793
large
prime
number
( division
method )
k mod of
hlkl
%
=
tQ (
=
" mod " )
0 31 4 1 5 % 997 = 508
1 1 4 1 5 9 % 997 = 201
2 4 1 5 9 2 % 997 = 715
3
4
5
6 returni = 6
1 5 9 2 6 % 997 = 971
5 9 2 6 5 % 997 = 442
9 2 6 5 3 % 997 = 929
2 6 5 3 5 % 997 = 613
Basis for Rabin-Karp substring search
modular hashing with R = 10 and hash(s) = s (mod 997)
match
73
Rabin-Karp fingerprint search
B・asic idea = modular hashing. ・Compute a hash of pat[0..M-1].
For each i, compute a hash of txt[i..M+i-1].
method )
( division
large
prime
numbe
hlkl
=
k mod of
(% = "mod")
pat.charAt(i) tQ i 0 1 2 3 4
2
i 0 3
0 3 1
2
3
4 5
6
1 1
1 1
5 3 5
2 3 4 4 1 5
4 1 5 4 1 5 4 1 5
% 997 = 613
txt.charAt(i)
6 returni = 6
5 6 7 8 9101112131415 9 2 6 5 3 5 8 9 7 9 3
% 997 = 508
9 % 997 = 201
9 2 % 997 = 715
9 2 6 % 997 = 971
9 2 6 5 3 % 997 = 929
2 6 5 3 5 % 997 = 613
1 5
5 9 2 6 5 % 997 = 442
match
Basis for Rabin-Karp substring search
modular hashing with R = 10 and hash(s) = s (mod 997)
73
Rabin-Karp fingerprint search
B・asic idea = modular hashing.
・Compute a hash of pat[0..M-1].
・For each i, compute a hash of txt[i..M+i-1].
If pattern hash = text substring hash, check for a match. pat.charAt(i)
i 0 2
i 0 3
0 3 1
2
3
4 5
1 6
1 1
1 1
2 3 4 5 3 5
2 3 4 4 1 5
4 1 5 4 1 5 4 1 5
% 997 = 613
txt.charAt(i)
5 6 7 8 9101112131415 9 2 6 5 3 5 8 9 7 9 3
6 returni = 6
% 997 = 508
9 % 997 = 201
9 2 % 997 = 715
9 2 6 % 997 = 971
9 2 6 5 3 % 997 = 929
2 6 5 3 5 % 997 = 613
1 5
5 9 2 6 5 % 997 = 442
match
Basis for Rabin-Karp substring search
modular hashing with R = 10 and hash(s) = s (mod 997)
73
- (division method) of strings with Modular hashing
alphabet
(# of distinct characters that can appear in text)
general
μ = length of pattern 1st char of text ( represented in range toil, - - i
'
R
= size
alphabet
of
t. " ""
(to
Rn
tt-Rt ,
)Q text
R-H
=
hash value of the initial M characters of
keychatleng.es -challenge 1 : If M is large , might have
( of length : cost
Yo
.
-
tta, h÷mod
...
chatter
Hashing
-
substring Mt)
numerical overflow m) : costs m
NM
one Hashing N
order
Chatkngel:
If M is large , then the number will overflow.
modular arithmetic identies :
Cll (atb) modof = ((a mudQ) +(bmodal)modQ
(21(d-b)modQ=((amodQ). (bmodal)mudQ
Two
÷method Horner's
÷÷÷.
" ""
Xo
ii.
...
"
.
4to mod Q)Rtt, )mod4)Atta), mod Q
= (to. rn tt- R t ,
-
t ta N)mod Q
,
i.
of:÷÷÷:: . . .
totalcostofMN x; is hash value for tititi,. .. , titu- I
Chalking Avoiding
"
"
tin
= - t,- t..-t-i-
tin
Xit,
Xi (tiRM" tiRM" tinN)mode
...
Titu
Xiti = (
tin- RM" t . . - - t time, - R'
ttm.m.ro/modQ
= ((x tiI" ) team ) mod xi+, ..-RRt Q
= Rm- ' mod Q
Observation !
precomnule I
:
Modular arithmetic
Math trick. To keep numbers small, take intermediate results modulo Q.
Ex.
10000 mod 997 = 30
(mod 997) (mod 997) (mod 997) (mod 997)
(10000 + 535) * 1000 = (30 + 535) * 3 = 1695 = 698
1000 mod 997 = 3
74
Modular arithmetic
Math trick. To keep numbers small, take intermediate results modulo Q.
Ex.
10000 mod 997 = 30
(mod 997) (mod 997) (mod 997) (mod 997)
(10000 + 535) * 1000 = (30 + 535) * 3 = 1695 = 698
1000 mod 997 = 3
(a + b) mod Q = ((a mod Q) + (b mod Q)) mod Q (a * b) mod Q = ((a mod Q) * (b mod Q)) mod Q
two useful modular arithmetic identities
74
Efficiently computing the hash function
Modular hash function. Using the notation ti for txt.charAt(i),
we wish to compute
xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0 (modQ)
Intuition. M-digit, base-R integer, modulo Q.
26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5
75
Efficiently computing the hash function
Modular hash function. Using the notation ti for txt.charAt(i),
we wish to compute
xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0 (modQ)
Intuition. M-digit, base-R integer, modulo Q.
Horner's method. Linear-time method to evaluate degree-M polynomial.
pat.charAt()
i01234 26535
0 2 % 997 = 2
RQ
1 2 6
2 2 6
3 2 6
4 2 6
% 997 = (2*10 + 6) % 997 = 26
5 % 997 = (26*10 + 5) % 997 = 265
5 3 % 997 = (265*10 + 3) % 997 = 659
5 3 5 % 997 = (659*10 + 5) % 997 = 613
Computing the hash value for the pattern with Horner’s method
26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5
= ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5
75
Efficiently computing the hash function
Modular hash function. Using the notation ti for txt.charAt(i),
we wish to compute
xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0 (modQ)
Intuition. M-digit, base-R integer, modulo Q.
Horner's method. Linear-time method to evaluate degree-M polynomial.
pat.charAt()
i01234 26535
0 2 % 997 = 2
RQ
1 2 6
2 2 6
3 2 6
4 2 6
% 997 = (2*10 + 6) % 997 = 26
5 % 997 = (26*10 + 5) % 997 = 265
5 3 % 997 = (265*10 + 3) % 997 = 659
5 3 5 % 997 = (659*10 + 5) % 997 = 613
// Compute hash for M-digit key
private long hash(String key, int M)
{
long h = 0;
for (int j = 0; j < M; j++)
h = (h * R + key.charAt(j)) % Q;
return h;
}
Computing the hash value for the pattern with Horner’s method
26535 = 2*10000 + 6*1000 + 5*100 + 3*10 + 5
= ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5
75
Efficiently computing the hash function
Challenge. How to efficiently compute xi+1 given that we know xi. xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0
xi+1 = ti+1 RM–1 +ti+2 RM–2 +...+ti+M R0
i ... 2 3 4 5 6 7 ...
6 5 6 5
current value 1 4 1 5 9 2
text
new value
4 1 5 9 2
41592 -40000 1592 *10 15920 +6 15926
current value
subtract leading digit multiply by radix
add new trailing digit new value
Key computation in Rabin-Karp substring search
76
(move right one position in the text)
Efficiently computing the hash function
Challenge. How to efficiently compute xi+1 given that we know xi. xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0
xi+1 = ti+1 RM–1 +ti+2 RM–2 +...+ti+M R0
i ... 2 current value 1 4 newvalue 4
4 - 4
1 1
3 4 1 5 1 5
1 5 0 0 1 5
5 6 9 2 9 2
9 2 0 0 9 2
7 ...
6 5 6 5
text
5 9 5 9
2 0 + 6 2 6
* 1 0
current value
subtract leading digit multiply by radix
add new trailing digit new value
Key computation in Rabin-Karp substring search
76
(move right one position in the text)
Efficiently computing the hash function
Challenge. How to efficiently compute xi+1 given that we know xi.
Key property.
xi = ti RM–1 +ti+1 RM–2 +...+ti+M–1 R0 xi+1 = ti+1 RM–1 +ti+2 RM–2 +...+ti+M R0
Can update "rolling" hash function in constant time!
xi+1 = (xi – tiRM–1) R + ti+M current subtract
multiply add new
value leading digit by radix trailing digit
(can precompute RM-1)
i ... 2 current value 1 4 newvalue 4
4 - 4
1 1
3 4 1 5 1 5
1 5 0 0 1 5
5 6 9 2 9 2
9 2 0 0 9 2
7 ...
6 5 6 5
text
5 9 5 9
2 0 + 6 2 6
* 1 0
current value
subtract leading digit multiply by radix
add new trailing digit new value
Key computation in Rabin-Karp substring search
76
(move right one position in the text)
Rabin-Karp substring search example
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3141592653589793
0 3 % 997 = 3
Q
1 3 1
2 3 1
3 3 1
4 3 1
% 997 = (3*10 + 1) % 997 = 31
4 % 997 = (31*10 + 4) % 997 = 314
4 1 % 997 = (314*10 + 1) % 997 = 150
5 6 7 8 9
10
1 5
% 997 = (150*10 + 5) % 997 = 508
9 % 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201
9 2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715
9 2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971
4 1 5
1 4
4 1 5
RM R
2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442
returni-M+1 = 6
2 6
5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613 Rabin-Karp substring search example
1 5
5 9
match
9 2
6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
77
Rabin-Karp substring search example
First R entries: Use Horner's rule.
i 0 3
1 2 3 4 5 6 7 8 9101112131415 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
0 3
1 3 1 2 3 1 3 3 1 4 3 1
4 1
4 1 5
Horner's rule
RM R
5 6 7 8 9
10
1 4
4 1 5
1 5
5 9
% 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201
2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715
2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971
2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442
2 6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
2 6 5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613
Rabin-Karp substring search example
Q
% 997 = (314*10 + 1) % 997 = 150
% 997 = (150*10 + 5) % 997 = 508
% 997 = 3
% 997 = (3*10 + 1) % 997 = 31
4 % 997 = (31*10 + 4) % 997 = 314
return i-M+1
= 6
1 5
9 9 9
9
match
77
Rabin-Karp substring search example
First R entries: Use Horner's rule.
Remaining entries: Use rolling hash (and % to avoid overflow).
i 0 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
Q
% 997 = (314*10 + 1) % 997 = 150
% 997 = (150*10 + 5) % 997 = 508
0 3
1 3 1 2 3 1 3 3 1 4 3 1
4 1
4 1 5
Horner's rule
RM R
% 997 = 3
% 997 = (3*10 + 1) % 997 = 31
4 % 997 = (31*10 + 4) % 997 = 314
5 6 7 8 9
10
1 4
4 1 5
1 5
5 9
% 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201
2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715
2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971
2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442
2 6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
2 6 5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613
Rabin-Karp substring search example
return i-M+1
= 6
1 5
9 9 9
9
rolling hash
match
-30 (mod 997) = 997 – 30 10000 (mod 997) = 30
77
Rabin-Karp: Java implementation
public class RabinKarp
{
private long patHash;
private int M;
private long Q;
private int R;
private long RM1;
// pattern hash value
// pattern length
// modulus
// radix
// R^(M-1) % Q
public RabinKarp(String pat) {
M = pat.length();
R = 256;
}
private long hash(String key, int M)
{ /* as before */ }
public int search(String txt)
{ /* see next slide */ }
}
Q = longRandomPrime();
RM1 = 1;
for (int i = 1; i <= M-1; i++)
RM1 = (R * RM1) % Q;
patHash = hash(pat, M);
a large prime
(but avoid overflow)
precompute RM – 1 (mod Q)
78
Rabin-Karp: Java implementation (continued)
Monte Carlo version. Return match if hash match.
public int search(String txt)
{
int N = txt.length();
int txtHash = hash(txt, M);
if (patHash == txtHash) return 0;
for (int i = M; i < N; i++)
{
check for hash collision
using rolling hash function
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;
if (patHash == txtHash) return i - M + 1;
}
return N; }
Las Vegas version. Check for substring match if hash match;
continue search if false collision.
79
Rabin-Karp analysis
Theory. If Q is a sufficiently large random prime (about M N 2),
then the probability of a false collision is about 1 / N.
80
Rabin-Karp analysis
Theory. If Q is a sufficiently large random prime (about M N 2),
then the probability of a false collision is about 1 / N.
Practice. Choose Q to be a large prime (but not so large to-cause overflow).
Und-
Las- Vests Alg
er reasonable assumptions, probability of a collision is about 1 / Q.
.
use
Rabin and u p o n
- Karn to find hash matches,
expected
cost of alg
'
.
..-
O ( N t N Yo M
↳
each hash
match,
check ifsabstrinsoftext}"⇐n actually matches pattern
) Nt
suppose
KIM
of
Ntn- ml=out
80
Rabin-Karp analysis
Theory. If Q is a sufficiently large random prime (about M N 2),
then the probability of a false collision is about 1 / N.
Practice. Choose Q to be a large prime (but not so large to cause overflow).
Under reasonable assumptions, probability of a collision is about 1 / Q.
Monte Carlo version.
・Always runs in linear time.
・Extremely likely to return correct answer (but not always!).
L・as Vegas version.
・Always returns correct answer.
Extremely likely to run in linear time (but worst case is M N).
80
Rabin-Karp fingerprint search
Advantages.
・Extends to 2d patterns.
・Extends to finding multiple patterns.
81
Rabin-Karp fingerprint search
Advantages.
・Extends to 2d patterns.
・Extends to finding multiple patterns.
Disadvantages.
・Arithmetic ops slower than char compares. ・Las Vegas version requires backup. ・Poor worst-case guarantee.
81
Rabin-Karp fingerprint search
Advantages.
・Extends to 2d patterns.
・Extends to finding multiple patterns.
Disadvantages.
Q. How would you extend Rabin-Karp to efficiently search for any
one of P possible patterns in a text of length N ?
・Arithmetic ops slower than char compares. ・Las Vegas version requires backup. ・Poor worst-case guarantee.
81
Algorithms
ROBERT SEDGEWICK | KEVIN WAYNE
http://algs4.cs.princeton.edu
5.3 SUBSTRING SEARCH
‣ introduction
‣ brute force
‣ Knuth-Morris-Pratt ‣ Boyer-Moore
‣ Rabin-Karp
Knuth-Morris-Pratt substring search
Intuition. Suppose we are searching in text for pattern BAAAAAAAAA.
i
text
after mismatch A B A A A A B A A A A A A A A A
on sixth char
brute-force backs up to try this
and this
and this
and this
and this
B A A A A A A A A A pattern BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA Text pointer backup in substring searching
but no backup is needed
18
Knuth-Morris-Pratt substring search
I・ntuition. Suppose we are searching in text for pattern BAAAAAAAAA. Suppose we match 5 chars in pattern, with mismatch on 6th char.
i
text
after mismatch A B A A A A B A A A A A A A A A
on sixth char
brute-force backs up to try this
and this
and this
and this
and this
B A A A A A A A A A pattern BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA Text pointer backup in substring searching
but no backup is needed
18
Knuth-Morris-Pratt substring search
I・ntuition. Suppose we are searching in text for pattern BAAAAAAAAA. ・Suppose we match 5 chars in pattern, with mismatch on 6th char.
We know previous 6 chars in text are BAAAAB. i
assuming { A, B } alphabet
text
after mismatch A B A A A A B A A A A A A A A A
on sixth char
brute-force backs up to try this
and this
and this
and this
and this
B A A A A A A A A A pattern BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA BAAAAAAAAA
BAAAAAAAAA Text pointer backup in substring searching
but no backup is needed
18
Knuth-Morris-Pratt substring search
I・ntuition. Suppose we are searching in text for pattern BAAAAAAAAA. ・Suppose we match 5 chars in pattern, with mismatch on 6th char.
・We know previous 6 chars in text are BAAAAB. Don't need to back up text pointer!
i
text
aftermismatch A B A A A A B
onsixthchar B A A A A A
assuming { A, B } alphabet
brute-force backs up to try this
and this
and this
A A A A A
A A A A A
B A A A A A B A A A A A
A A
A A A
A A A A
A A A A
B A A A A B A A A B A A B A
A A A A A A A A A
A A A A pattern A A A A A
A A A A A A
and this
and this
but no backup is needed
Text pointer backup in substring searching
18
Knuth-Morris-Pratt substring search
I・ntuition. Suppose we are searching in text for pattern BAAAAAAAAA. ・Suppose we match 5 chars in pattern, with mismatch on 6th char.
text
aftermismatch A B A A A
onsixthchar B A A A
brute-force backs up to try this
i
A B A A
A A A A A A B A
A A A A A A A A A
A A A A pattern A A A A A
A A A A A A
・We know previous 6 chars in text are BAAAAB. Don't need to back up text pointer!
assuming { A, B } alphabet
B A A B A B
and this
and this
A A A A A
AAAAA
A A
A A A
A A A A
A A A A
and this
and this
but no backup is needed
B A A A A A B A A A A A
Text pointer backup in substring searching
Knuth-Morris-Pratt algorithm. Clever method to always avoid backup. (!)
18
Deterministic finite state automaton (DFA)
DFA is abstract string-searching machine.
・Finite number of states (including start and halt). ・Exactly one transition for each char in alphabet. ・Accept if sequence of transitions leads to halt state.
j
pat.charAt(j)
dfa[][j]
C
graphical representation
=,
XCt FA B,
A B A B A C A113151 B020404
000006
alphabet
012345
B,C
A
A
B,C C
A
B 0A1B2A3B4A5C6
C
B,C
Constructing the DFA for KMP substring search for A B A B A C
19
Deterministic finite state automaton (DFA)
DFA is abstract string-searching machine.
・Finite number of states (including start and halt). ・Exactly one transition for each char in alphabet. ・Accept if sequence of transitions leads to halt state.
X
internal representation X t
j012345
j pat.ch0arAt1(j)2A3B4A5 B
pat.charAt(j) A BAA1B1A3C 1 A 1 1 3 1 5 1
A
variable c
If in state j reading char c:
1
5 dfa[][j]B 0 2 0 4 0 4
C
020404
if j is 6 halt and accept
else move to state dfa[c][j]
dfa[][j] B C00C0000006006
graphical representation
B,CAA AA B,C A AB B
0A1B2A3B4A5C6 C0A1B2A3B4A5C6
B,C C C
B,C B,C C
B,C
Constructing the DFA for KMP substring search for A B A B A C
Constructing the DFA for KMP substring search for A B A B A C
19
Knuth-Morris-Pratt demo: DFA simulation
A A B A C A -A B A B A C A A
012345
pat.charAt(j)
dfa[][j]
A
ABABAC A113151 B020404 C000006
B, C A A
B
0A1B2A3B4A5C6 C
B, C
C
B, C
20
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
21
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
21
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
22
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
22
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
23
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
23
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C B, C
24
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C B, C
24
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
B, C A A
B
012345
pat.charAt(j)
dfa[][j]
A
ABABAC A113151 B020404 C000006
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
25
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
B, C A A
B
012345
pat.charAt(j)
dfa[][j]
A
ABABAC A113151 B020404 C000006
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
25
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
26
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
26
Knuth-Morris-Pratt demo: DFA simulation
A A B A C A -A B A B A C A A
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
27
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
27
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
28
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
28
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C B, C
29
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C B, C
29
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
B, C A A
B
012345
pat.charAt(j)
dfa[][j]
A
ABABAC A113151 B020404 C000006
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
30
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
B, C A A
B
012345
pat.charAt(j)
dfa[][j]
A
ABABAC A113151 B020404 C000006
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
30
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
31
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
31
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
32
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
32
Knuth-Morris-Pratt demo: DFA simulation
AABACAABABACAA
pat.charAt(j)
dfa[][j]
A
012345
ABABAC A113151 B020404 C000006
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
substring found
B, C
33
Interpretation of Knuth-Morris-Pratt DFA
Q. What is interpretation of DFA state after reading in txt[i]?
A
B, C A A
B
0A1B2A3B4A5C6 C
B, C
C
B, C
34
Interpretation of Knuth-Morris-Pratt DFA
Q. What is interpretation of DFA state after reading in txt[i]?
A. State = number of characters in pattern that have been matched.
A
B, C A A
B
0A1B2A3B4A5C6 C
B, C
C
B, C
34
Interpretation of Knuth-Morris-Pratt DFA
Q. What is interpretation of DFA state after reading in txt[i]?
A. State = number of characters in pattern that have been matched.
Ex. DFA is in state 3 after reading in txt[0..6]. i
012345678 012345 txt BCBAABACA pat ABABAC
A
B, C A A
B
0A1B2A3B4A5C6 C
B, C
C
B, C
34
Interpretation of Knuth-Morris-Pratt DFA
Q. What is interpretation of DFA state after reading in txt[i]?
A. State = number of characters in pattern that have been matched.
Ex. DFA is in state 3 after reading in txt[0..6]. i
length of longest prefix of pat[] that is a suffix of txt[0..i]
012345678 012345 txt BCBAABACA pat ABABAC
B, C A A
B
suffix of txt[0..6]
prefix of pat[]
A
0A1B2A3B4A5C6 C
B, C
C
B, C
34
Knuth-Morris-Pratt substring search: Java implementation
Key differences from brute-force implementation.
・Need to precompute dfa[][] from pattern. ・Text pointer i never decrements.
public int search(String txt)
{
int i, j, N = txt.length();
for (i = 0, j = 0; i < N && j < M; i++)
if (j == M) return i - M;
else return N;
}
j = dfa[txt.charAt(i)][j];
no backup
35
Knuth-Morris-Pratt substring search: Java implementation
K・ey differences from brute-force implementation. ・Need to precompute dfa[][] from pattern.
Text pointer i never decrements.
Running time.
no backup
public int search(String txt)
{
int i, j, N = txt.length();
for (i = 0, j = 0; i < N && j < M; i++)
if (j == M) return i - M;
else return N;
}
j = dfa[txt.charAt(i)][j];
35
Knuth-Morris-Pratt substring search: Java implementation
Key differences from brute-force implementation.
・Need to precompute dfa[][] from pattern.
・Text pointer i never decrements.
R・unning time.
Simulate DFA on text: at most N character accesses.
no backup
public int search(String txt)
{
int i, j, N = txt.length();
for (i = 0, j = 0; i < N && j < M; i++)
if (j == M) return i - M;
else return N;
}
j = dfa[txt.charAt(i)][j];
35
Knuth-Morris-Pratt substring search: Java implementation
Key differences from brute-force implementation.
no backup
・Need to precompute dfa[][] from pattern.
・Text pointer i never decrements.
public int search(String txt)
{
int i, j, N = txt.length();
for (i = 0, j = 0; i < N && j < M; i++)
if (j == M) return i - M;
else return N;
}
j = dfa[txt.charAt(i)][j];
R・unning time.
・Simulate DFA on text: at most N character accesses.
Build DFA: how to do efficiently? [warning: tricky algorithm ahead]
35
Knuth-Morris-Pratt substring search: Java implementation
K・ey differences from brute-force implementation. ・Need to precompute dfa[][] from pattern. ・Text pointer i never decrements.
Could use input stream.
public int search(In in)
{
int i, j;
for (i = 0, j = 0; !in.isEmpty() && j < M; i++)
if (j == M) return i - M;
}
else
return NOT_FOUND;
X
j = dfa[in.readChar()][j];
no backup
j012345 pat.charAt(j) A B A B A C A113151 dfa[][j]B 0 2 0 4 0 4 C000006
B,C
AA AB
0A1B2A3B4A5C6
C
B,C C
B,C
Constructing the DFA for KMP substring search for A B A B A C
36
Knuth-Morris-Pratt demo: DFA construction
Include one state for each character in pattern (plus accept state).
012345
ABABAC A
B C
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
37
Knuth-Morris-Pratt demo: DFA construction
Include one state for each character in pattern (plus accept state).
012345
ABABAC A
B C
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
pat.charAt(j)
0123456
37
Knuth-Morris-Pratt demo: DFA construction
Match transition. If in state j and next char c == pat.charAt(j), go to j+1.
first j characters of pattern next char matches now first j+1 characters of
have already been matched
pat.charAt(j)
A B C
pattern have been matched
012345
ABABAC
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
0123456
38
Knuth-Morris-Pratt demo: DFA construction
Match transition. If in state j and next char c == pat.charAt(j), go to j+1.
first j characters of pattern next char matches now first j+1 characters of
have already been matched
pat.charAt(j)
pattern have been matched
012345
ABABAC
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A135 B24 C6
0A1B2A3B4A5C6
38
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
012345
ABABAC A135 B24 C6
0A1B2A3B4A5C6
39
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
012345
ABABAC A135 B02 4 C06
B, C
0A1B2A3B4A5C6
39
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j). mismatch
t A- A
in text by
l shift
i A
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
A B A B A C A135 B02 4 C06
B, C 0A1B2A3B4A5C6
40
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j). f. mismatch
AC
c shift
by
I
⇒
pat.charAt(j)
dfa[][j]
012345
ABABAC A113 5 B02 4 C00 6
Constructing the DFA for KMP substring search for A B A B A C
j
B, C A
0A1B2A3B4A5C6 C
40
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j). ABB
match at all
£
of 012345
BB £
is
match
at all
pat.charAt(j) A B A B A C A113 5
dfa[][j]B 0 2 4 C00 6
¢
Constructing the DFA for KMP substring search for A B A B A C
B, C A
j
0A1B2A3B4A5C6 C
41
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5 B0204 C000 6
B, C A
j
0A1B2A3B4A5C6 C
B, C
41
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5 B0204 C000 6
B, C A
j
0A1B2A3B4A5C6 C
B, C
42
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
j
012345
ABABAC A11315 B0204 C0000 6
0A1B2A3B4A5C6 C
B, C
C
42
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j). ¢ mismatch
B
t BA BB
A- BAD
¥.
so
→
is →
pat.charAt(j)
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
j
012345
A B A B A C A11315
dfa[][j] B 0 2 0 4 C0000 6
0A1B2A3B4A5C6 C
B, C
C
43
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
012345
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
ABABAC A11315 B02040 C000006
pat.charAt(j)
j
0A1B2A3B4A5C6 C
B, C
C
B, C
43
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j). f- mismatch
A- BABA
t. BABAI
↳AS
B
-
match
A B A B A C A11315 B02040 C000006
of first 4
characters of pattern
pat.charAt(j)
dfa[][j]
I
0 1 2 3 4 5
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
j
0A1B2A3B4A5C6 C
B, C
C
B, C
44
Knuth-Morris-Pratt demo: DFA construction
Mismatch transition: back up if c != pat.charAt(j).
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
A
-- .
012345
ABABAC A113151 B020404 C000006
exercise
-
j B
0A1B2A3B4A5C6 C
B, C
C
B, C
44
Knuth-Morris-Pratt demo: DFA construction
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A
B, C A A
0A1B2A3B4A5C6 C
B, C
C
012345
ABABAC A113151 B020404 C000006
B
B, C
45
How to build DFA from pattern?
Include one state for each character in pattern (plus accept state).
012345
ABABAC A
B C
pat.charAt(j)
dfa[][j]
0123456
46
How to build DFA from pattern?
Match transition. If in state j and next char c == pat.charAt(j), go to j+1.
first j characters of pattern next char matches now first j +1 characters of
have already been matched
pattern have been matched
012345
ABABAC A135
B24 C6
pat.charAt(j)
dfa[][j]
0A1B2A3B4A5C6
47
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
j
pat.charAt(j)
j
012345 ABABAC
B, C A A
0A1B2A3B4A5C6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
j
pat.charAt(j)
j
012345 ABABAC
B, C A A
0A1B2A3B4A5C6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1;
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
j
pat.charAt(j)
j
012345 ABABAC
B, C A A
0A1B2A3B4A5C6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
j
pat.charAt(j)
012345 ABABAC
j
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
B, C A A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A'
j
pat.charAt(j)
012345 ABABAC
j
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
B, C A A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
B, C A A
j
pat.charAt(j)
012345 ABABAC
j
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
B, C A A
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
dfa['B'][5] = 4
j
pat.charAt(j)
012345 ABABAC
j
A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
dfa['B'][5] = 4
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
simulate BABA;
j
pat.charAt(j)
012345 ABABAC
j
B, C A A
A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
dfa['B'][5] = 4
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
simulate BABA; take transition 'B'
j
pat.charAt(j)
012345 ABABAC
j
B, C A A
A
simulation of BABA
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]:
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
dfa['B'][5] = 4
Simulate pat[1..j-1] on DFA and take transition c. still under construction (!)
simulate BABA; take transition 'B'
= dfa['B'][3]
j
pat.charAt(j)
012345 ABABAC
B, C A A
A
simulation of BABA
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.
Running time. Seems to require j steps.
still under construction (!)
Ex. dfa['A'][5] = 1; simulate BABA;
take transition 'A' = dfa['A'][3]
dfa['B'][5] = 4
simulate BABA; take transition 'B'
= dfa['B'][3]
j
pat.charAt(j)
012345 ABABAC
B, C A A
A
simulation of BABA
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
48
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
B, C A A
X
012345 ABABAC
j
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1;
B, C A A
X
012345 ABABAC
j
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
012345 ABABAC
j
B, C A A
X
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A'
012345 ABABAC
j
B, C A A
X
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1;
from state X, take transition 'A'
= dfa['A'][X]
B, C A A
A
X
012345 ABABAC
j
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1;
from state X, take transition 'A'
= dfa['A'][X]
B, C A A
dfa['B'][5] = 4
A
X
012345 ABABAC
j
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
from state X,
012345 ABABAC
j
B, C A A
A
X
0A1B2A3B4A5C6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
from state X, take transition 'B'
012345 ABABAC
j
B, C A A
A
X
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
from state X, take transition 'B'
= dfa['B'][X]
012345 ABABAC
B, C A A
A
X
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
X' = 0
from state X, take transition 'B'
= dfa['B'][X]
012345 ABABAC
B, C A A
A
X
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
X' = 0
from state X,
from state X, take transition 'B'
= dfa['B'][X]
012345 ABABAC
B, C A A
A
X
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
X' = 0
from state X, take transition 'C'
from state X, take transition 'B'
= dfa['B'][X]
012345 ABABAC
B, C A A
A
X
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
How to build DFA from pattern?
Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c. Running time. Takes only constant time if we maintain state X.
Ex. dfa['A'][5] = 1; from state X,
take transition 'A' = dfa['A'][X]
dfa['B'][5] = 4
X' = 0
from state X, take transition 'C'
= dfa['C'][X]
from state X, take transition 'B'
= dfa['B'][X]
012345 ABABAC
B, C A A
A
X
j B
0 A 1 B 2 A 3 B 4 A 5 C 6 C
B, C
C
B, C
49
Knuth-Morris-Pratt demo: DFA construction in linear time
Include one state for each character in pattern (plus accept state).
012345
ABABAC A
B C
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
50
Knuth-Morris-Pratt demo: DFA construction in linear time
Include one state for each character in pattern (plus accept state).
012345
ABABAC A
B C
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
pat.charAt(j)
0123456
50
Knuth-Morris-Pratt demo: DFA construction in linear time
Match transition. For each state j, dfa[pat.charAt(j)][j] = j+1.
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
first j characters of pattern have already been matched
pat.charAt(j)
now first j+1 characters of
pattern have been matched
012345
ABABAC
A B C
0123456
51
Knuth-Morris-Pratt demo: DFA construction in linear time
Match transition. For each state j, dfa[pat.charAt(j)][j] = j+1.
first j characters of pattern now first j+1 characters of
have already been matched pattern have been matched
pat.charAt(j)
012345
ABABAC
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A135 B24 C6
0A1B2A3B4A5C6
51
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition.
set dfa[c][0] = 0.
For state 0 and char c != pat.charAt(j),
pat.charAt(j)
dfa[][j]
012345
ABABAC A135 B24 C6
Constructing the DFA for KMP substring search for A B A B A C
j
0A1B2A3B4A5C6
52
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition.
set dfa[c][0] = 0.
For state 0 and char c != pat.charAt(j),
012345
pat.charAt(j)
dfa[][j]
ABABAC A135 B02 4 C06
Constructing the DFA for KMP substring search for A B A B A C
j
B, C
0A1B2A3B4A5C6
52
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
X = simulation of empty string
012345
ABABAC A1 3 5 B02 4
C0 6
B, C 0A1B2A3B4A5C6
X
53
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of empty string
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
ABABAC A113 5 B02 4 C00 6
B, C A 0A1B2A3B4A5C6
C
X
53
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of empty string
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
j
ABABAC A113 5 B02 4 C00 6
B, C A 0A1B2A3B4A5C6
C
X
53
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of B
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5 B02 4
C 0 0 6
B, C A
j
0A1B2A3B4A5C6 C
X
54
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of B
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5 B0204 C000 6
B, C A
j
0A1B2A3B4A5C6 C
X
B, C
54
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of B
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5 B0204 C000 6
B, C A
j
0A1B2A3B4A5C6 C
X
B, C
54
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of B A
012345
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
ABABAC A113 5
B 0 2 0 4
C 0 0 0 6
B, C A
j
0A1B2A3B4A5C6
C
B, C
X
55
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
j
X = simulation of B A
012345
ABABAC A11315
B 0 2 0 4 C0000 6
0A1B2A3B4A5C6
C
B, C
X
C
55
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
j
X = simulation of B A
012345
ABABAC A11315
B 0 2 0 4 C0000 6
0A1B2A3B4A5C6
C
B, C
X
C
55
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
X = simulation of B A B
012345
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
ABABAC A11315
B 0 2 0 4
C 0 0 0 0 6
pat.charAt(j)
j
0A1B2A3B4A5C6
C
B, C
X
C
56
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
X = simulation of B A B
012345
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
ABABAC A11315 B02040 C000006
j
0A1B2A3B4A5C6
C
B, C
X
C
B, C
56
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
X = simulation of B A B
012345
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
ABABAC A11315 B02040 C000006
j
0A1B2A3B4A5C6
C
B, C
X
C
B, C
56
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
B, C A A
X = simulation of B A B A
012345
ABABAC A 1 1 3 1 5
B 0 2 0 4 0 C000006
j
0A1B2A3B4A5C6
C
B, C
X
C B, C
57
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A
B, C A A
0A1B2A3B4A5C6
X = simulation of B A B A
012345
ABABAC A113151 B020404 C000006
B
j
C
B, C
X
C B, C
57
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A
B, C A A
0A1B2A3B4A5C6
X = simulation of B A B A
012345
ABABAC A113151 B020404 C000006
B
j
C
B, C
X
C B, C
57
Knuth-Morris-Pratt demo: DFA construction in linear time
Mismatch transition. For each state j and char c != pat.charAt(j), set dfa[c][j] = dfa[c][X]; then update X = dfa[pat.charAt(j)][X].
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A
B, C A A
0A1B2A3B4A5C6 C
B, C
C
X = simulation of B A B A C
012345
ABABAC A113151 B020404 C000006
B
j
X
B, C
58
Knuth-Morris-Pratt demo: DFA construction in linear time
pat.charAt(j)
dfa[][j]
Constructing the DFA for KMP substring search for A B A B A C
A
B, C A A
0A1B2A3B4A5C6 C
B, C
C
012345
ABABAC A113151 B020404 C000006
B
B, C
59
Constructing the DFA for KMP substring search: Java implementation
For each state j:
・Copy dfa[][X] to dfa[][j] for mismatch case. ・Set dfa[pat.charAt(j)][j] to j+1 for match case. ・Update X.
public KMP(String pat)
{
this.pat = pat;
M = pat.length();
dfa = new int[R][M];
dfa[pat.charAt(0)][0] = 1;
for (int X = 0, j = 1; j < M; j++)
{
} }
for (int c = 0; c < R; c++)
dfa[c][j] = dfa[c][X];
dfa[pat.charAt(j)][j] = j+1;
X = dfa[pat.charAt(j)][X];
copy mismatch cases set match case update restart state
60
Constructing the DFA for KMP substring search: Java implementation
F・or each state j:
・Copy dfa[][X] to dfa[][j] for mismatch case. ・Set dfa[pat.charAt(j)][j] to j+1 for match case.
Update X.
copy mismatch cases set match case update restart state
public KMP(String pat)
{
this.pat = pat;
M = pat.length();
dfa = new int[R][M];
dfa[pat.charAt(0)][0] = 1;
for (int X = 0, j = 1; j < M; j++)
{
} }
for (int c = 0; c < R; c++)
dfa[c][j] = dfa[c][X];
dfa[pat.charAt(j)][j] = j+1;
X = dfa[pat.charAt(j)][X];
Running time. M character accesses (but space/time proportional to R M).
60
KMP substring search analysis
Proposition. KMP substring search accesses no more than M + N chars
to search for a pattern of length M in a text of length N.
61
KMP substring search analysis
Proposition. KMP substring search accesses no more than M + N chars
to search for a pattern of length M in a text of length N.
Pf. Each pattern char accessed once when constructing the DFA;
each text char accessed once (in the worst case) when simulating the DFA.
61
KMP substring search analysis
Proposition. KMP substring search accesses no more than M + N chars
to search for a pattern of length M in a text of length N.
Pf. Each pattern char accessed once when constructing the DFA;
each text char accessed once (in the worst case) when simulating the DFA.
Proposition. KMP constructs dfa[][] in time and space proportional to R M.
61
KMP substring search analysis
Proposition. KMP substring search accesses no more than M + N chars
to search for a pattern of length M in a text of length N.
Pf. Each pattern char accessed once when constructing the DFA;
each text char accessed once (in the worst case) when simulating the DFA.
internal representation
012345 next[j] 0 0 0 0 0 3
j
pat.charAt(j) A B A B A C
Proposition. KMP constructs dfa[][] in time and space proportional to R M.
Larger alphabets. Improved version of KMP constructs nfa[] in time and
space proportional to M. graphical representation
mismatch transition (back up at least one state)
0A1B2A3B4A5C6
KMP NFA for ABABAC
NFA corresponding to the string A B A B A C 61
Knuth-Morris-Pratt: brief history
・Independently discovered by two theoreticians and a hacker.
62
Knuth-Morris-Pratt: brief history
・Independently discovered by two theoreticians and a hacker.
– Knuth: inspired by esoteric theorem, discovered linear algorithm
Don Knuth
62
Knuth-Morris-Pratt: brief history
・Independently discovered by two theoreticians and a hacker.
– Knuth: inspired by esoteric theorem, discovered linear algorithm – Pratt: made running time independent of alphabet size
Don Knuth Vaughan Pratt
62
Knuth-Morris-Pratt: brief history
・Independently discovered by two theoreticians and a hacker.
– Knuth: inspired by esoteric theorem, discovered linear algorithm – Pratt: made running time independent of alphabet size
– Morris: built a text editor for the CDC 6400 computer
Don Knuth Jim Morris Vaughan Pratt
62
Knuth-Morris-Pratt: brief history
・Independently discovered by two theoreticians and a hacker.
– Knuth: inspired by esoteric theorem, discovered linear algorithm – Pratt: made running time independent of alphabet size
・– Morris: built a text editor for the CDC 6400 computer Theory meets practice.
SIAM J. COMPUT. Vol. 6, No. 2, June 1977
FAST PATTERN MATCHING IN STRINGS*
DONALD E. KNUTHf, JAMES H. MORRIS, JR.:l: AND VAUGHAN R. PRATT
Abstract. An algorithm is presented which finds all occurrences of one. given string within another, in running time proportional to the sum of the lengths of the strings. The constant of proportionality is low enough to make this algorithm of practical use, and the procedure can also be
extended to deal with some more general pattern-matching problems. A theoretical application of the algorithm shows that the set of concatenations of even palindromes, i.e., the language {can}*, can be recognized in linear time. Other algorithms which run even faster on the average are also considered.
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a string, palindrome, optimum algorithm, Fibonacci string, regular expression
Text-editing programs are often required to search through a string of characters looking for instances of a given "pattern" string; we wish to find all positions, or perhaps only the leftmost position, in which the pattern occurs as a contiguoussubstringofthetext.Forexample,ca enarycontainsthepattern
en,butwedonotregardcanaryasasubstring.
Don Knuth Jim Morris Vaughan Pratt
The obvious way to search for a matching pattern is to try searching at every starting position of the text, abandoning the search as soon as an incorrect
62