ITBN710 Lecture 8
Queensland University of Technology
CRICOS No. 00213J
CAB301 Algorithms and Complexity
Lecture 12 — String matching algorithms
Maolin Tang
School of Computer Science
Queensland University of Technology
CRICOS No. 00213J
a university for the worldreal R 2
Aims of this lecture
• In this lecture we study string processing algorithms:
– Brute force algorithm
– Horspool’s algorithm
– Boyer-Moore’ algorithm
CRICOS No. 00213J
a university for the worldreal R 3
References
• A. Levitin. Introduction to the Design and Analysis of Algorithms.
Addison-Wesley, third edition, 2012. Sections 3.2 and 7.2.
• T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein,
Introduction to Algorithms, third edition, 2009. Chapter 32.
CRICOS No. 00213J
a university for the worldreal R 4
Introduction
• String-matching is the problem of finding occurrence(s) of
a pattern string within another string or body of text.
• There are many string-matching problems in the real world.
– Text-editing programs frequently need to a string-matching
algorithm to find all occurrences of a pattern in the text.
– DNA sequencing uses a string matching algorithm for pattern
matching.
– Information security also uses a string matching algorithm for
pattern matching.
• There are a number of efficient string-matching algorithms.
https://xlinux.nist.gov/dads/HTML/pattern.html
https://xlinux.nist.gov/dads/HTML/string.html
CRICOS No. 00213J
a university for the worldreal R
BRUTE FORCE ALGORITHM
5
CRICOS No. 00213J
a university for the worldreal R 6
Brute force algorithm
• Pattern: a string of m characters to search for
• Text: a (long) string of n characters to search in
• Brute force algorithm:
1. Align pattern at beginning of text
2. moving from left to right, compare each character of pattern to
the corresponding character in text until
• all characters are found to match (successful search); or
• a mismatch is detected
3. while pattern is not found and the text is not yet exhausted,
realign pattern one position to the right and repeat step 2.
CRICOS No. 00213J
a university for the worldreal R 7
Brute force algorithm
Text:
Pattern:
CRICOS No. 00213J
a university for the worldreal R 8
Brute force algorithm
CRICOS No. 00213J
a university for the worldreal R 9
Efficiency of the brute force algorithm
• We may have to perform m comparisons before shifting the pattern.
• This can happen for each of the n-m+1 tries.
• Therefore, the worst case complexity is O(mn).
• For random texts, it has been shown to be linear Θ(n).
CRICOS No. 00213J
a university for the worldreal R
HORSPOOL’S ALGORITHM
10
CRICOS No. 00213J
a university for the worldreal R
Horspool’s algorithm
• Starting with the last character, R, of the pattern and moving right to
left, we compare the corresponding pairs of characters in the pattern
and the text.
• If all the pattern’s characters match successfully, a matching
substring is found.
• If a mismatch occurs, we need to shift the pattern to the right as
much as possible without risking the possibility of missing a
matching substring in the text.
• This is achieved through “input enhancement”
11
CRICOS No. 00213J
a university for the worldreal R
How to determine the size of such a shift?
• Consider the following example:
• By looking at the character c of the text that is aligned against the
last character of the pattern.
• Case 1: If there is no c in the pattern, then shift the pattern by its
entire length.
12
CRICOS No. 00213J
a university for the worldreal R
How to determine the size of such a shift?
• Case 2: If there are occurrences of c in the pattern, but it is not the
last one, then shift the pattern such that the rightmost occurrence of
c in the pattern is aligned with the c in the text.
13
CRICOS No. 00213J
a university for the worldreal R
How to determine the size of such a shift?
• Case 3: If c is the last character in the pattern and there is no c
among the rest m-1 characters in the pattern, then shift the pattern
by its entire length.
14
CRICOS No. 00213J
a university for the worldreal R
How to determine the size of such a shift?
• Case 4: If c is the last character in the pattern, but c is also among
the rest m-1 characters in the pattern, then shift the pattern such that
the rightmost occurrence of c in the rest m-1 characters in the
pattern is aligned with the c in the text.
15
CRICOS No. 00213J
a university for the worldreal R
Shift table
16
CRICOS No. 00213J
a university for the worldreal R 17
Pattern: barbaric
After the first for loop:
Table[a]=8; Table[b]=8; Table[c]=8; Table[i]=8; Table[r]=8; Table[*]=8.
The second for loop:
When j=0: P[0]=b; Table[b]=m-1-j=8-1-0=7
When j=1: P[1]=a; Table[a]=m-1-j=8-1-1=6
When j=2: P[2]=r; Table[r]=m-1-j=8-1-2=5
When j=3: P[3]=b; Table[b]=m-1-j=8-1-3=4
When j=4: P[4]=a; Table[a]=m-1-j=8-1-4=3
When j=5: P[5]=r; Table[r]=m-1-j=8-1-5=2
When j=6: P[6]=i; Table[i]=m-1-j=8-1-6=1
CRICOS No. 00213J
a university for the worldreal R
Horspool’s algorithm
18
CRICOS No. 00213J
a university for the worldreal R 19
Text = the_artic_sarcastic_barbaric_bar
Pattern = barbaric
Shift table
Table[a] = 3; Table[b] = 4; Table[i] = 1; Table[r] = 2; Table[*] = 8
–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[i]=) 1 character (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 3)
–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[t]=) 8 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[a]=) 3 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbaric
Pattern found at position 21 (#comparisons = 8)
CRICOS No. 00213J
a university for the worldreal R 20
Text = the_artic_sarcastic_barbaric_bar
Pattern = barbarik
Shift table
Table[a] = 3; Table[b] = 4; Table[i] = 1; Table[r] = 2; Table[*] = 8
–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[i]=) 1 character (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[t]=) 8 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[a]=) 3 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 1)
–> the_artic_sarcastic_barbaric_bar
–> barbarik (Pattern not found)
CRICOS No. 00213J
a university for the worldreal R 21
Efficiency of Horspool’s algorithm
• The worst case efficiency of Horspool’s algorithm is O(mn).
• For random texts, it has been shown to be Θ(n).
CRICOS No. 00213J
a university for the worldreal R
BOYER-MOORE ALGORITHM
22
CRICOS No. 00213J
a university for the worldreal R 23
Boyer–Moore algorithm
• Another string matching algorithm using input enhancement
• It precomputes two tables
– one table calculates how many positions ahead to start the next
search based on the identity of the character that caused the
match attempt to fail.
– the other makes a similar calculation based on how many
characters were matched successfully before the match attempt
failed.
CRICOS No. 00213J
a university for the worldreal R
Bad-symbol shift
• Case 1: b does not occur in x
• Example:
24
CRICOS No. 00213J
a university for the worldreal R
Bad-symbol shift
• Case 2: b occurs in x
• Example:
25
CRICOS No. 00213J
a university for the worldreal R
Bad-symbol shift
•
26
CRICOS No. 00213J
a university for the worldreal R
Good-suffix shift
• Case 1: Suffix u preceded by a character c different from a in
pattern x re-occurs
27
CRICOS No. 00213J
a university for the worldreal R
Good-suffix shift
•
28
CRICOS No. 00213J
a university for the worldreal R
Good-suffix shift
•
29
CRICOS No. 00213J
a university for the worldreal R 30
ALGORITHM BoyerMooreMatching(P[0..m-1], T[0..n-1])
//Input: Pattern (P[0..m-1] and text T[0..n-1])
//Output: The index of the left end of the first matching substring or -1 if there are no matches
create bad-symbol shift table t1 and good-suffix shift table t2
i ← m – 1
while i ≤ n-1 do
k ← 0
while k ≤ m – 1 and P[m – 1 – k] = T[i – k] do
k ← k + 1
if k = m
return i – m + 1
else
if k = 0 // use bad-symbol shift table
i ← i + t1(T[i])
else
d1 ← max {t1(T[i – k]) – k, 1}
d2 ← t2(suff(k))
i ← i + max {d1, d2}
return -1
CRICOS No. 00213J
a university for the worldreal R 31
Text = tgacccttctatgggcgctccgatacgccgacttatccga
Pattern = taattaat
Bad symbol shift table
[a, 1] [t, 3] [*, 8]
Good suffix shift table
[t, 3] [at, 7] [aat, 7] [taat, 4] [ttaat, 4] [attaat, 4] [aattaat, 4]
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 3 characters
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 1 character
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 7 characters
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 7 characters
CRICOS No. 00213J
a university for the worldreal R 32
Text = tgacccttctatgggcgctccgatacgccgacttatccga
Pattern = taattaat
Bad symbol shift table
[a, 1] [t, 3] [*, 8]
Good suffix shift table
[t, 3] [at, 7] [aat, 7] [taat, 4] [ttaat, 4] [attaat, 4] [aattaat, 4]
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 8 characters
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 3 characters
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 8 characters
–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
Pattern not found
CRICOS No. 00213J
a university for the worldreal R 33
Text = bess_knew_about_baobabs
Pattern = baobab
Bad character shift table
[a, 1] [b, 2] [o, 3] [*, 6]
Good suffix shift table
[b, 2] [ab, 5] [bab, 5] [obab, 5] [aobab, 5]
–> bess_knew_about_baobabs
–> baobab
bad symbol shift by 6 characters
–> bess_knew_about_baobabs
–> baobab
good suffix shift by 5 characters
–> bess_knew_about_baobabs
–> baobab
bad symbol shift by 5 characters
–> bess_knew_about_baobabs
–> baobab
Pattern found at position 17
CRICOS No. 00213J
a university for the worldreal R 34
Efficiency of Boyer-Moore algorithm
• The time efficiency of Boyer-Moore algorithm is O(n).
CAB301 Algorithms and Complexity��Lecture 12 — String matching algorithms
Aims of this lecture
References
Introduction
BRUTE FORCE ALGORITHM
Brute force algorithm
Brute force algorithm
Slide Number 8
Efficiency of the brute force algorithm
HORSPOOL’S ALGORITHM
Horspool’s algorithm
How to determine the size of such a shift?
How to determine the size of such a shift?
How to determine the size of such a shift?
How to determine the size of such a shift?
Shift table
Slide Number 17
Horspool’s algorithm
Slide Number 19
Slide Number 20
Efficiency of Horspool’s algorithm
BOYER-MOORE ALGORITHM
Boyer–Moore algorithm
Bad-symbol shift
Bad-symbol shift
Bad-symbol shift
Good-suffix shift
Good-suffix shift
Good-suffix shift
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Efficiency of Boyer-Moore algorithm