Text Processing
1
COMP9024: Data Structures
and Algorithms
Text Processing
2
Contents
• Pattern matching
• Tries
• Greedy method
• Text compression
• Dynamic programming
3
Pattern Matching
1
a b a c a a b
234
a b a c a b
a b a c a b
4
Strings
• A string is a sequence of
characters
• Examples of strings:
• Java program
• HTML document
• DNA sequence
• Digitized image
• An alphabet Σ is the set of
possible characters for a family of
strings
• Example of alphabets:
• ASCII
• Unicode
• {0, 1}
• {A, C, G, T}
• Let P be a string of size m
• A substring P[i .. j] of P is the
subsequence of P consisting of the
characters with ranks between i
and j
• A prefix of P is a substring of the
type P[0 .. i]
• A suffix of P is a substring of the
type P[i ..m − 1]
• Given strings T (text) and P
(pattern), the pattern matching
problem consists of finding a
substring of T equal to P
• Applications:
• Text editors
• Search engines
• Biological research
5
Brute-Force Pattern Matching
• The brute-force pattern matching
algorithm compares the pattern
P with the text T for each
possible shift of P relative to T,
until either
• a match is found, or
• all placements of the pattern
have been tried
• Brute-force pattern matching
runs in time O(nm)
• Example of worst case:
• T = aaa … ah
• P = aaah
• may occur in images and DNA
sequences
• unlikely in English text
Algorithm BruteForceMatch(T, P)
Input text T of size n and pattern
P of size m
Output starting index of a substring of
T equal to P or −1 if no such
substring exists
{ for ( i = 0; i < n − m+1; i++ )
{ // test shift i of the pattern
j = 0;
while ( j < m ∧ T[i + j] = P[j] )
j = j + 1;
if ( j = m )
return i; // match at i
}
return -1 // no match anywhere
}
6
Boyer-Moore Heuristics
• The Boyer-Moore’s pattern matching algorithm is based on two heuristics
Looking-glass heuristic: Compare P with a subsequence of T moving
backwards
Character-jump heuristic: When a mismatch occurs at T[i] = c
• If P contains c, shift P to align the last occurrence of c in P with T[i]
• Else, shift P to align P[0] with T[i + 1]
• Example
1
a p a t t e r n m a t c h i n g a l g o r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
2
3
4
5
6
7891011
7
Last-Occurrence Function
• Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet
Σ to build the last-occurrence function L mapping Σ to integers, where
L(c) is defined as
• the largest index i such that P[i] = c or
• −1 if no such index exists
• Example:
• Σ = {a, b, c, d}
• P = abacab
• The last-occurrence function can be represented by an array indexed by
the numeric codes of the characters
• The last-occurrence function can be computed in time O(m + s), where
m is the size of P and s is the size of Σ
c a b c d
L(c) 4 5 3 −1
8
m − j
i
j l
. . . . . . a . . . . . .
. . . . b a
. . . . b a
j
Case 1: j ≤ 1 + l
The Boyer-Moore Algorithm
Algorithm BoyerMooreMatch(T, P, Σ)
{ L = lastOccurenceFunction(P, Σ )
i = m − 1
j = m − 1
repeat
if ( T[i] = P[j] )
{ if ( j = 0 )
return i // match at i
else
{ i = i − 1;
j = j − 1; }
}
else // character-jump
{ l = L[T[i]];
i = i + m – min(j, 1 + l);
j = m − 1; }
until ( i > n − 1)
return −1 // no match
}
m − (1 + l)
i
jl
. . . . . . a . . . . . .
. a . . b .
. a . . b .
1 + l
Case 2: 1 + l ≤ j
9
Example
1
a b a c a a b a d c a b a c a b a a b b
234
5
6
7
891012
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
1113
10
Analysis
• Boyer-Moore’s algorithm runs in
time O(nm + s)
• Example of worst case:
• T = aaa … a
• P = baaa
• The worst case may occur in
images and DNA sequences but
is unlikely in English text
• Boyer-Moore’s algorithm is
significantly faster than the
brute-force algorithm on English
text
11
1
a a a a a a a a a
23456
b a a a a a
b a a a a a
b a a a a a
b a a a a a
7891012
131415161718
192021222324
11
The KMP Algorithm
• Knuth-Morris-Pratt’s
algorithm compares the
pattern to the text in left-to-
right, but shifts the pattern
more intelligently than the
brute-force algorithm.
• When a mismatch occurs,
what is the most we can shift
the pattern so as to avoid
redundant comparisons?
• Answer: the largest prefix of
P[0..j-1] that is a suffix of
P[1..j-1]
x
j
. . a b a a b . . . . .
a b a a b a
a b a a b a
No need to
repeat these
comparisons
Resume
comparing
here
12
KMP Failure Function
• Knuth-Morris-Pratt’s algorithm
preprocesses the pattern to find
matches of prefixes of the
pattern with the pattern itself
• The failure function F(j) is
defined as the size of the largest
prefix of P[0..j] that is also a
suffix of P[1..j]
• Knuth-Morris-Pratt’s algorithm
modifies the brute-force
algorithm so that if a mismatch
occurs at P[j] ≠ T[i] we set j ←
F(j − 1)
j 0 1 2 3 4 5
P[j] a b a a b a
F(j) 0 0 1 1 2 3
x
j
. . a b a a b . . . . .
a b a a b a
F(j − 1)
a b a a b a
13
The KMP Algorithm (1/2)
Algorithm KMPMatch(T, P)
{ F = failureFunction(P);
i = 0;
j = 0;
while ( i < n )
if ( T[i] = P[j] )
{ if ( j = m − 1 )
return i − j ; // match
else
{ i = i + 1; j = j + 1;}
}
else
if ( j > 0 )
j = F[j − 1];
else
i = i + 1;
return −1; // no match
}
14
The KMP Algorithm (2/2)
• The failure function can be represented by an array and can be computed in
O(m) time
• At each iteration of the while-loop, either
• i increases by one, or the shift amount i − j increases by at least one (observe
that F(j − 1) < j)
• Consider that the loop body is executed 2n times. Since the initial values of
both i and j are 0, we have i + i-j ≥ 2n after 2n iterations. Therefore, 2i ≥ 2n+j
and i ≥ n+j/2. Since j is at least 0, i ≥ n holds, which means the loop must
terminate after 2n iterations.
• Hence, there are no more than 2n iterations of the while-loop
• Thus, KMP’s algorithm runs in optimal time O(m + n)
15
Computing the Failure
Function
• The failure function can be
represented by an array and can
be computed in O(m) time
• The construction is similar to the
KMP algorithm itself
• At each iteration of the while-
loop, either
• i increases by one, or
• the shift amount i − j increases
by at least one (observe that F(j
− 1) < j)
• Hence, there are no more than
2m iterations of the while-loop
Algorithm failureFunction(P)
{ F[0] = 0;
i = 1;
j = 0;
while ( i < m )
if ( P[i] = P[j] )
{ // we have matched j + 1 char
F[i] = j + 1;
i = i + 1;
j = j + 1; }
else if ( j > 0 )
// use failure function to shift P
j = F[j − 1];
else
{ F[i] = 0; // no match
i = i + 1;}
}
16
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
j 0 1 2 3 4 5
P[j] a b a c a b
F(j) 0 0 1 0 1 2
17
Tries
e nimize
nimize ze
zei mi
mize nimize ze
18
Preprocessing Strings
• Preprocessing the pattern speeds up pattern matching
queries
• After preprocessing the pattern, KMP’s algorithm performs pattern
matching in time proportional to the text size
• If the text is large, immutable and searched for often (e.g.,
works by Shakespeare), we may want to preprocess the text
instead of the pattern
• A trie is a compact data structure for representing a set of
strings, such as all the words in a text
• A tries supports pattern matching queries in time proportional to
the pattern size
19
Standard Tries
• The standard trie for a set of strings S is an ordered tree such that:
• Each node but the root is labeled with a character
• The children of a node are alphabetically ordered
• The paths from the external nodes to the root yield the strings of S
• Example: standard trie for the set of strings
S = { bear, bell, bid, bull, buy, sell, stock, stop }
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
20
Analysis of Standard Tries
• A standard trie uses O(n) space and supports searches,
insertions and deletions in time O(dm), where:
n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
21
Word Matching with a Trie
• We insert the
words of the text
into a trie
• Each leaf stores
the occurrences
of the associated
word in the text
s e e b e a r ? s e l l s t o c k !
s e e b u l l ? b u y s t o c k !
b i d s t o c k !
a
a
h e t h e b e l l ? s t o p !
b i d s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r
87 88
a
e
b
l
s
u
l
e t
e
0, 24
o
c
i
l
r
6
l
78
d
47, 58
l
30
y
36
l
12
k
17, 40,
51, 62
p
84
h
e
r
69
a
22
Compressed Tries
• A compressed trie has
internal nodes of degree at
least two
• It is obtained from standard
trie by compressing chains of
“redundant” nodes
e
b
ar ll
s
u
ll y
ell to
ck p
id
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
23
Compact Representation
• Compact representation of a compressed trie for an array of strings:
• Stores at the nodes ranges of indices instead of substrings
• Uses O(s) space, where s is the number of strings in the array
• Serves as an auxiliary index structure
s e e
b e a r
s e l l
s t o c k
b u l l
b u y
b i d
h e
b e l l
s t o p
0 1 2 3 4
a rS[0] =
S[1] =
S[2] =
S[3] =
S[4] =
S[5] =
S[6] =
S[7] =
S[8] =
S[9] =
0 1 2 3 0 1 2 3
1, 1, 1
1, 0, 0 0, 0, 0
4, 1, 1
0, 2, 2
3, 1, 2
1, 2, 3 8, 2, 3
6, 1, 2
4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3
7, 0, 3
0, 1, 1
24
Suffix Trie
• The suffix trie of a string X is the compressed trie of all the suffixes of X
e nimize
nimize ze
zei mi
mize nimize ze
m i n i z em i
0 1 2 3 4 5 6 7
25
Pattern Matching Using Suffix Trie
(1/2)
Algorithm suffixTrieMatch(T, P)
{ p = P.length; j = 0; v = T.root();
repeat
{ for each child w of v do
{ // we have matched j + 1 char
childTraversed=false; i = start(w); // start(w) is the start index of w
if ( P[j] = X[i] ) // process child w
{ childTraversed=true;
x = end(w) − i +1; // end(w) is the end index of w
if ( p ≤ x )
// suffix is shorter than or of the same length of the node label
{ if ( P[j:j+p−1] = X[i:i+p−1] ) return i – j ;
else return “P is not a substring of X” ; }
else // the pattern goes beyond the substring stored at w
{ if ( P[j:j+x−1] = X[i:i+x−1] )
{ p = p − x; // update suffix length
j = j + x; // update suffix start index
v = w; break ; }
else return “P is not a substring of X” ; }}}}
until childTraversed=false or T.isExternal(v);
return “P is not a substring of X” ; }
26
Pattern Matching Using Suffix Trie (2/2)
• Input of the algorithm:
• Compact suffix trie T for a text X and pattern P.
• Output of the algorithm:
• Starting index of a substring of X matching P or an indication that P is not
a substring.
• The algorithm assumes the following additional property on the labels
of the nodes in the compact representation of the suffix trie:
• If node v has label (i, j) and Y is the string of length y associated with the
path from the root to v (included), then X[j-y+1..j]=Y.
• This property ensures that we can easily compute the start index of the
pattern in the text when a match occurs.
27
Analysis of Suffix Tries
• Compact representation of the suffix trie for a string X of
size n from an alphabet of size d
• Uses O(n) space
• Supports arbitrary pattern matching queries in X in O(dm) time,
where m is the size of the pattern
• Can be constructed in O(n) time
7, 7 2, 7
2, 7 6, 7
6, 7
4, 7 2, 7 6, 7
1, 1 0, 1
m i n i z em i
0 1 2 3 4 5 6 7
28
Greedy Method and Text
Compression
29
The Greedy Method Technique
• The greedy method is a general algorithm design
paradigm, built on the following elements:
• configurations: different choices, collections, or values to
find
• objective function: a score assigned to configurations,
which we want to either maximize or minimize
• It works best when applied to problems with the
greedy-choice property:
• a globally-optimal solution can always be found by a series
of local improvements from a starting configuration.
30
Text Compression
• Given a string X, efficiently encode X into a smaller
string Y
• Saves memory and/or bandwidth
• A good approach: Huffman encoding
• Compute frequency f(c) for each character c.
• Encode high-frequency characters with short code
words
• No code word is a prefix for another code
• Use an optimal encoding tree to determine the code
words
31
Encoding Tree Example
• A code is a mapping of each character of an alphabet to a binary code-
word
• A prefix code is a binary code such that no code-word is the prefix of
another code-word
• An encoding tree represents a prefix code
• Each external node stores a character
• The code word of a character is given by the path from the root to the
external node storing the character (0 for a left child and 1 for a right child)
a
b c
d e
00 010 011 10 11
a b c d e
32
Encoding Tree Optimization
• Given a text string X, we want to find a prefix code for the characters of X that
yields a small encoding for X
• Frequent characters should have short code-words
• Rare characters should have long code-words
• Example
• X = abracadabra
• T1 encodes X into 29 bits
• T2 encodes X into 24 bits
c
a r
d b a
c d
b r
T1 T2
33
Huffman’s Algorithm
• Given a string X,
Huffman’s algorithm
construct a prefix code
the minimizes the size
of the encoding of X
• It runs in time
O(n + d log d), where n
is the size of X and d is
the number of distinct
characters of X
• A heap-based priority
queue is used as an
auxiliary structure
Algorithm HuffmanEncoding(X)
Input string X of size n
Output optimal encoding trie for X
{
C = distinctCharacters(X);
computeFrequencies(C, X);
Q = new empty heap;
for all c ∈ C
{ T = new single-node tree storing c;
Q.insert(getFrequency(c), T); }
while ( Q.size() > 1 )
{ f1 = Q.minKey();
T1 = Q.removeMin();
f2 = Q.minKey();
T2 = Q.removeMin();
T = join(T1, T2);
Q.insert(f1 + f2, T);
}
return Q.removeMin();
}
34
Example
a b c d r
5 2 1 1 2
X = abracadabra
Frequencies
ca rdb
5 2 1 1 2
ca rdb
2
5 2 2
ca bd r
2
5
4
ca bd r
2
5
4
6
c
a
bd r
2 4
6
11
35
Extended Huffman Tree Example
36
The Fractional Knapsack Problem
• Given: A set S of n items, with each item i having
• bi – a positive benefit
• wi – a positive weight
• Goal: Choose items with maximum total benefit but with
weight at most W.
• If we are allowed to take fractional amounts, then this is the
fractional knapsack problem.
• In this case, we let xi denote the amount we take of item i
• Objective: maximize
• Constraint:
∑
∈Si
iii wxb )/(
∑
∈
≤
Si
i Wx
37
Example
• Given: A set S of n items, with each item i having
• bi – a positive benefit
• wi – a positive weight
• Goal: Choose items with maximum total benefit but with
weight at most W.
Weight:
Benefit:
1 2 3 4 5
4 ml 8 ml 2 ml 6 ml 1 ml
$12 $32 $40 $30 $50
Items:
Value: 3
($ per ml)
4 20 5 50 10 ml
Solution:
• 1 ml of 5
• 2 ml of 3
• 6 ml of 4
• 1 ml of 2
“knapsack”
38
The Fractional Knapsack Algorithm
• Greedy choice: Keep taking item
with highest value (benefit to
weight ratio)
• Since
• Run time: O(n log n). Why?
• Correctness: Suppose there is a
better solution
• there is an item i with higher
value than a chosen item j, but
xi