CS计算机代考程序代写 DNA algorithm ITBN710 Lecture 8

ITBN710 Lecture 8

Queensland University of Technology

CRICOS No. 00213J

CAB301 Algorithms and Complexity

Lecture 12 — String matching algorithms

Maolin Tang
School of Computer Science

Queensland University of Technology

CRICOS No. 00213J

a university for the worldreal R 2

Aims of this lecture

• In this lecture we study string processing algorithms:
– Brute force algorithm
– Horspool’s algorithm
– Boyer-Moore’ algorithm

CRICOS No. 00213J

a university for the worldreal R 3

References

• A. Levitin. Introduction to the Design and Analysis of Algorithms.
Addison-Wesley, third edition, 2012. Sections 3.2 and 7.2.

• T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein,
Introduction to Algorithms, third edition, 2009. Chapter 32.

CRICOS No. 00213J

a university for the worldreal R 4

Introduction

• String-matching is the problem of finding occurrence(s) of
a pattern string within another string or body of text.

• There are many string-matching problems in the real world.
– Text-editing programs frequently need to a string-matching

algorithm to find all occurrences of a pattern in the text.
– DNA sequencing uses a string matching algorithm for pattern

matching.
– Information security also uses a string matching algorithm for

pattern matching.
• There are a number of efficient string-matching algorithms.

https://xlinux.nist.gov/dads/HTML/pattern.html
https://xlinux.nist.gov/dads/HTML/string.html

CRICOS No. 00213J

a university for the worldreal R

BRUTE FORCE ALGORITHM

5

CRICOS No. 00213J

a university for the worldreal R 6

Brute force algorithm

• Pattern: a string of m characters to search for
• Text: a (long) string of n characters to search in

• Brute force algorithm:
1. Align pattern at beginning of text
2. moving from left to right, compare each character of pattern to

the corresponding character in text until
• all characters are found to match (successful search); or
• a mismatch is detected

3. while pattern is not found and the text is not yet exhausted,
realign pattern one position to the right and repeat step 2.

CRICOS No. 00213J

a university for the worldreal R 7

Brute force algorithm

Text:
Pattern:

CRICOS No. 00213J

a university for the worldreal R 8

Brute force algorithm

CRICOS No. 00213J

a university for the worldreal R 9

Efficiency of the brute force algorithm

• We may have to perform m comparisons before shifting the pattern.
• This can happen for each of the n-m+1 tries.
• Therefore, the worst case complexity is O(mn).
• For random texts, it has been shown to be linear Θ(n).

CRICOS No. 00213J

a university for the worldreal R

HORSPOOL’S ALGORITHM

10

CRICOS No. 00213J

a university for the worldreal R

Horspool’s algorithm

• Starting with the last character, R, of the pattern and moving right to
left, we compare the corresponding pairs of characters in the pattern
and the text.

• If all the pattern’s characters match successfully, a matching
substring is found.

• If a mismatch occurs, we need to shift the pattern to the right as
much as possible without risking the possibility of missing a
matching substring in the text.

• This is achieved through “input enhancement”

11

CRICOS No. 00213J

a university for the worldreal R

How to determine the size of such a shift?

• Consider the following example:

• By looking at the character c of the text that is aligned against the
last character of the pattern.

• Case 1: If there is no c in the pattern, then shift the pattern by its
entire length.

12

CRICOS No. 00213J

a university for the worldreal R

How to determine the size of such a shift?

• Case 2: If there are occurrences of c in the pattern, but it is not the
last one, then shift the pattern such that the rightmost occurrence of
c in the pattern is aligned with the c in the text.

13

CRICOS No. 00213J

a university for the worldreal R

How to determine the size of such a shift?

• Case 3: If c is the last character in the pattern and there is no c
among the rest m-1 characters in the pattern, then shift the pattern
by its entire length.

14

CRICOS No. 00213J

a university for the worldreal R

How to determine the size of such a shift?

• Case 4: If c is the last character in the pattern, but c is also among
the rest m-1 characters in the pattern, then shift the pattern such that
the rightmost occurrence of c in the rest m-1 characters in the
pattern is aligned with the c in the text.

15

CRICOS No. 00213J

a university for the worldreal R

Shift table

16

CRICOS No. 00213J

a university for the worldreal R 17

Pattern: barbaric

After the first for loop:
Table[a]=8; Table[b]=8; Table[c]=8; Table[i]=8; Table[r]=8; Table[*]=8.

The second for loop:
When j=0: P[0]=b; Table[b]=m-1-j=8-1-0=7
When j=1: P[1]=a; Table[a]=m-1-j=8-1-1=6
When j=2: P[2]=r; Table[r]=m-1-j=8-1-2=5
When j=3: P[3]=b; Table[b]=m-1-j=8-1-3=4
When j=4: P[4]=a; Table[a]=m-1-j=8-1-4=3
When j=5: P[5]=r; Table[r]=m-1-j=8-1-5=2
When j=6: P[6]=i; Table[i]=m-1-j=8-1-6=1

CRICOS No. 00213J

a university for the worldreal R

Horspool’s algorithm

18

CRICOS No. 00213J

a university for the worldreal R 19

Text = the_artic_sarcastic_barbaric_bar
Pattern = barbaric
Shift table
Table[a] = 3; Table[b] = 4; Table[i] = 1; Table[r] = 2; Table[*] = 8

–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[i]=) 1 character (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 3)

–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[t]=) 8 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbaric
Shift Pattern by (Table[a]=) 3 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbaric
Pattern found at position 21 (#comparisons = 8)

CRICOS No. 00213J

a university for the worldreal R 20

Text = the_artic_sarcastic_barbaric_bar
Pattern = barbarik
Shift table
Table[a] = 3; Table[b] = 4; Table[i] = 1; Table[r] = 2; Table[*] = 8

–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[i]=) 1 character (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[t]=) 8 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[a]=) 3 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbarik
Shift Pattern by (Table[c]=) 8 characters (#comparisons = 1)

–> the_artic_sarcastic_barbaric_bar
–> barbarik (Pattern not found)

CRICOS No. 00213J

a university for the worldreal R 21

Efficiency of Horspool’s algorithm

• The worst case efficiency of Horspool’s algorithm is O(mn).
• For random texts, it has been shown to be Θ(n).

CRICOS No. 00213J

a university for the worldreal R

BOYER-MOORE ALGORITHM

22

CRICOS No. 00213J

a university for the worldreal R 23

Boyer–Moore algorithm

• Another string matching algorithm using input enhancement
• It precomputes two tables

– one table calculates how many positions ahead to start the next
search based on the identity of the character that caused the
match attempt to fail.

– the other makes a similar calculation based on how many
characters were matched successfully before the match attempt
failed.

CRICOS No. 00213J

a university for the worldreal R

Bad-symbol shift

• Case 1: b does not occur in x

• Example:

24

CRICOS No. 00213J

a university for the worldreal R

Bad-symbol shift

• Case 2: b occurs in x

• Example:

25

CRICOS No. 00213J

a university for the worldreal R

Bad-symbol shift

26

CRICOS No. 00213J

a university for the worldreal R

Good-suffix shift

• Case 1: Suffix u preceded by a character c different from a in
pattern x re-occurs

27

CRICOS No. 00213J

a university for the worldreal R

Good-suffix shift

28

CRICOS No. 00213J

a university for the worldreal R

Good-suffix shift

29

CRICOS No. 00213J

a university for the worldreal R 30

ALGORITHM BoyerMooreMatching(P[0..m-1], T[0..n-1])
//Input: Pattern (P[0..m-1] and text T[0..n-1])
//Output: The index of the left end of the first matching substring or -1 if there are no matches

create bad-symbol shift table t1 and good-suffix shift table t2
i ← m – 1
while i ≤ n-1 do

k ← 0
while k ≤ m – 1 and P[m – 1 – k] = T[i – k] do

k ← k + 1
if k = m

return i – m + 1
else

if k = 0 // use bad-symbol shift table
i ← i + t1(T[i])

else
d1 ← max {t1(T[i – k]) – k, 1}
d2 ← t2(suff(k))
i ← i + max {d1, d2}

return -1

CRICOS No. 00213J

a university for the worldreal R 31

Text = tgacccttctatgggcgctccgatacgccgacttatccga
Pattern = taattaat
Bad symbol shift table
[a, 1] [t, 3] [*, 8]
Good suffix shift table
[t, 3] [at, 7] [aat, 7] [taat, 4] [ttaat, 4] [attaat, 4] [aattaat, 4]

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 3 characters

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 1 character

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 7 characters

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 7 characters

CRICOS No. 00213J

a university for the worldreal R 32

Text = tgacccttctatgggcgctccgatacgccgacttatccga
Pattern = taattaat
Bad symbol shift table
[a, 1] [t, 3] [*, 8]
Good suffix shift table
[t, 3] [at, 7] [aat, 7] [taat, 4] [ttaat, 4] [attaat, 4] [aattaat, 4]

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 8 characters

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
good suffix shift by 3 characters

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
bad symbol shift by 8 characters

–> tgacccttctatgggcgctccgatacgccgacttatccga
–> taattaat
Pattern not found

CRICOS No. 00213J

a university for the worldreal R 33

Text = bess_knew_about_baobabs
Pattern = baobab
Bad character shift table
[a, 1] [b, 2] [o, 3] [*, 6]
Good suffix shift table
[b, 2] [ab, 5] [bab, 5] [obab, 5] [aobab, 5]

–> bess_knew_about_baobabs
–> baobab
bad symbol shift by 6 characters

–> bess_knew_about_baobabs
–> baobab
good suffix shift by 5 characters

–> bess_knew_about_baobabs
–> baobab
bad symbol shift by 5 characters

–> bess_knew_about_baobabs
–> baobab
Pattern found at position 17

CRICOS No. 00213J

a university for the worldreal R 34

Efficiency of Boyer-Moore algorithm

• The time efficiency of Boyer-Moore algorithm is O(n).

CAB301 Algorithms and Complexity��Lecture 12 — String matching algorithms
Aims of this lecture
References
Introduction
BRUTE FORCE ALGORITHM
Brute force algorithm
Brute force algorithm
Slide Number 8
Efficiency of the brute force algorithm
HORSPOOL’S ALGORITHM
Horspool’s algorithm
How to determine the size of such a shift?
How to determine the size of such a shift?
How to determine the size of such a shift?
How to determine the size of such a shift?
Shift table
Slide Number 17
Horspool’s algorithm
Slide Number 19
Slide Number 20
Efficiency of Horspool’s algorithm
BOYER-MOORE ALGORITHM
Boyer–Moore algorithm
Bad-symbol shift
Bad-symbol shift
Bad-symbol shift
Good-suffix shift
Good-suffix shift
Good-suffix shift
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Efficiency of Boyer-Moore algorithm