COMP20007 Design of Algorithms
Input Enhancement Part 2: String Searching
Daniel Beck
Lecture 18
Semester 1, 2020
1
String Search – Recap
2
String Search – Recap
• Goal: given text of size n, find a string (pattern) of size m.
2
String Search – Recap
• Goal: given text of size n, find a string (pattern) of size m.
• Brute force algorithm: O(m × n).
2
String Search – Brute Force
JIM SAW ME IN A BARBERSHOP BARBER
3
Input Enhancement
• Longer shifts can be made if we know statistics about the pattern.
4
Input Enhancement
• Longer shifts can be made if we know statistics about the pattern.
• Same way we could sort arrays more efficiently by knowing statistics about the array (Counting Sort).
4
Input Enhancement
• Longer shifts can be made if we know statistics about the pattern.
• Same way we could sort arrays more efficiently by knowing statistics about the array (Counting Sort).
• The Horspool’s algorithm use this idea to make string search faster.
4
Input Enhancement
• Longer shifts can be made if we know statistics about the pattern.
• Same way we could sort arrays more efficiently by knowing statistics about the array (Counting Sort).
• The Horspool’s algorithm use this idea to make string search faster.
• Key idea: scan the text from left to right but scan the pattern from right to left.
4
Longer shifts – Case 1
The last character is not in the pattern.
JIM SAW ME IN A BARBERSHOP BARBER
Shift the whole pattern.
5
Longer shifts – Case 2
The last character does not match but it is in the pattern.
JIM SAW ME IN A BARBERSHOP BARBER
Shift the pattern until the last occurence of the character.
6
Longer shifts – Case 3
The last character matches but one of the m − 1 characters does not match and the last character is unique.
JIM SAW ME IN A BARBERSHOP SEESAW
Shift the whole pattern.
7
Longer shifts – Case 4
The last character matches but one of the m − 1 characters does not match and the last character is not uniques
JIM SAW ME IN A BARBERSHOP REORDER
Shift the pattern until the last occurence of the character.
8
Horspool – Preprocessing
• The number of allowed shifts depends on the character type only.
9
Horspool – Preprocessing
• The number of allowed shifts depends on the character type only.
• Horspool builds a dictionary with all characters in the alphabet and their corresponding allowed skips for a pattern.
9
Horspool – Preprocessing
• The number of allowed shifts depends on the character type only.
• Horspool builds a dictionary with all characters in the alphabet and their corresponding allowed skips for a pattern.
characterc A B C D E F … R … Z shiftt(c) 42661663666
9
Horspool – FindShifts
function FindShifts(P[0..m − 1]) for i ← 0 to a do
Shift[i] ← m
for j ← 0 to m − 2 do Shift[P[j]] ← m − (j + 1)
10
Horspool – Algorithm
function Horspool(P [0..m − 1], T [0..n − 1]) FindShifts(P )
i←m−1
while i < n do
k←0
while k < m and P[m − 1 − k] = T[i − k] do
k←k+1 if k = m then
return i −m+1 else
i ← i + Shift [T [i ]] return −1
◃ We have a match ◃ Start of the match
◃ Slide the pattern along
11
Horspool - Properties
• Worst-case still O(m × n).
• For random strings, it’s linear and faster in practice compared to the brute force version.
12
Other String Search algorithms
• Boyer-Moore: extends Horspool to allow shifts based on suffixes.
13
Other String Search algorithms
• Boyer-Moore: extends Horspool to allow shifts based on suffixes.
• Knuth-Morris-Pratt: also preprocess the patten but builds a finite-state automaton.
13
Other String Search algorithms
• Boyer-Moore: extends Horspool to allow shifts based on suffixes.
• Knuth-Morris-Pratt: also preprocess the patten but builds a finite-state automaton.
• Rabin-Karp: uses hash functions to filter negative matches.
• Needs a rolling hash function to be efficient.
13
Summary and Practical Uses
• String search algorithms can be sped up by input enhancement.
14
Summary and Practical Uses
• String search algorithms can be sped up by input enhancement.
• Allocates extra memory to preprocess the pattern.
14
Summary and Practical Uses
• String search algorithms can be sped up by input enhancement.
• Allocates extra memory to preprocess the pattern. • Horspool uses a dictionary of shifts.
14
Summary and Practical Uses
• String search algorithms can be sped up by input enhancement.
• Allocates extra memory to preprocess the pattern.
• Horspool uses a dictionary of shifts.
• The Boyer-Moore extension is one of the most used string searching algorithms, for instance in the “grep” Linx command line tool.
14
Summary and Practical Uses
• String search algorithms can be sped up by input enhancement.
• Allocates extra memory to preprocess the pattern.
• Horspool uses a dictionary of shifts.
• The Boyer-Moore extension is one of the most used string searching algorithms, for instance in the “grep” Linx command line tool.
Next week: trade memory for speed by storing intermediate solutions.
14