CS计算机代考程序代写 data structure chain DNA Java AI algorithm Text Processing

Text Processing

COMP9024: Data Structures
and Algorithms

Text Processing

Contents

• Pattern matching
• Tries
• Greedy method
• Text compression
• Dynamic programming

Pattern Matching

a b a c a a b

234

a b a c a b

Strings
• A string is a sequence of

characters
• Examples of strings:

• Java program
• HTML document
• DNA sequence
• Digitized image

• An alphabet Σ is the set of
possible characters for a family of
strings

• Example of alphabets:
• ASCII
• Unicode
• {0, 1}
• {A, C, G, T}

• Let P be a string of size m
• A substring P[i .. j] of P is the

subsequence of P consisting of the
characters with ranks between i
and j

• A prefix of P is a substring of the
type P[0 .. i]

• A suffix of P is a substring of the
type P[i ..m − 1]

• Given strings T (text) and P
(pattern), the pattern matching
problem consists of finding a
substring of T equal to P

• Applications:
• Text editors
• Search engines
• Biological research

Brute-Force Pattern Matching
• The brute-force pattern matching

algorithm compares the pattern
P with the text T for each
possible shift of P relative to T,
until either

• a match is found, or
• all placements of the pattern

have been tried

• Brute-force pattern matching
runs in time O(nm)

• Example of worst case:
• T = aaa … ah
• P = aaah
• may occur in images and DNA

sequences
• unlikely in English text

Algorithm BruteForceMatch(T, P)
Input text T of size n and pattern

P of size m
Output starting index of a substring of

T equal to P or −1 if no such
substring exists

{ for ( i = 0; i < n − m+1; i++ ) { // test shift i of the pattern j = 0; while ( j < m ∧ T[i + j] = P[j] ) j = j + 1; if ( j = m ) return i; // match at i } return -1 // no match anywhere } 6 Boyer-Moore Heuristics • The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i] = c • If P contains c, shift P to align the last occurrence of c in P with T[i] • Else, shift P to align P[0] with T[i + 1] • Example 1 a p a t t e r n m a t c h i n g a l g o r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m r i t h m 2 3 4 5 6 7891011 7 Last-Occurrence Function • Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet Σ to build the last-occurrence function L mapping Σ to integers, where L(c) is defined as • the largest index i such that P[i] = c or • −1 if no such index exists • Example: • Σ = {a, b, c, d} • P = abacab • The last-occurrence function can be represented by an array indexed by the numeric codes of the characters • The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of Σ c a b c d L(c) 4 5 3 −1 8 m − j i j l . . . . . . a . . . . . . . . . . b a . . . . b a j Case 1: j ≤ 1 + l The Boyer-Moore Algorithm Algorithm BoyerMooreMatch(T, P, Σ) { L = lastOccurenceFunction(P, Σ ) i = m − 1 j = m − 1 repeat if ( T[i] = P[j] ) { if ( j = 0 ) return i // match at i else { i = i − 1; j = j − 1; } } else // character-jump { l = L[T[i]]; i = i + m – min(j, 1 + l); j = m − 1; } until ( i > n − 1)
return −1 // no match

}

m − (1 + l)

. . . . . . a . . . . . .

. a . . b .

1 + l

Case 2: 1 + l ≤ j

Example

a b a c a a b a d c a b a c a b a a b b

234

891012

a b a c a b

a b a c a b
1113

Analysis

• Boyer-Moore’s algorithm runs in
time O(nm + s)

• Example of worst case:
• T = aaa … a
• P = baaa

• The worst case may occur in
images and DNA sequences but
is unlikely in English text

• Boyer-Moore’s algorithm is
significantly faster than the
brute-force algorithm on English
text

a a a a a a a a a

23456
b a a a a a

b a a a a a

7891012

131415161718

192021222324

The KMP Algorithm

• Knuth-Morris-Pratt’s
algorithm compares the
pattern to the text in left-to-
right, but shifts the pattern
more intelligently than the
brute-force algorithm.

• When a mismatch occurs,
what is the most we can shift
the pattern so as to avoid
redundant comparisons?

• Answer: the largest prefix of
P[0..j-1] that is a suffix of
P[1..j-1]

. . a b a a b . . . . .

a b a a b a

No need to
repeat these
comparisons

Resume
comparing

here

KMP Failure Function

• Knuth-Morris-Pratt’s algorithm
preprocesses the pattern to find
matches of prefixes of the
pattern with the pattern itself

• The failure function F(j) is
defined as the size of the largest
prefix of P[0..j] that is also a
suffix of P[1..j]

• Knuth-Morris-Pratt’s algorithm
modifies the brute-force
algorithm so that if a mismatch
occurs at P[j] ≠ T[i] we set j ←
F(j − 1)

j 0 1 2 3 4 5
P[j] a b a a b a
F(j) 0 0 1 1 2 3

. . a b a a b . . . . .

a b a a b a

F(j − 1)

a b a a b a

The KMP Algorithm (1/2)

Algorithm KMPMatch(T, P)
{ F = failureFunction(P);

i = 0;
j = 0;
while ( i < n ) if ( T[i] = P[j] ) { if ( j = m − 1 ) return i − j ; // match else { i = i + 1; j = j + 1;} } else if ( j > 0 )

j = F[j − 1];
else

i = i + 1;
return −1; // no match

}

The KMP Algorithm (2/2)

• The failure function can be represented by an array and can be computed in
O(m) time

• At each iteration of the while-loop, either
• i increases by one, or the shift amount i − j increases by at least one (observe

that F(j − 1) < j) • Consider that the loop body is executed 2n times. Since the initial values of both i and j are 0, we have i + i-j ≥ 2n after 2n iterations. Therefore, 2i ≥ 2n+j and i ≥ n+j/2. Since j is at least 0, i ≥ n holds, which means the loop must terminate after 2n iterations. • Hence, there are no more than 2n iterations of the while-loop • Thus, KMP’s algorithm runs in optimal time O(m + n) 15 Computing the Failure Function • The failure function can be represented by an array and can be computed in O(m) time • The construction is similar to the KMP algorithm itself • At each iteration of the while- loop, either • i increases by one, or • the shift amount i − j increases by at least one (observe that F(j − 1) < j) • Hence, there are no more than 2m iterations of the while-loop Algorithm failureFunction(P) { F[0] = 0; i = 1; j = 0; while ( i < m ) if ( P[i] = P[j] ) { // we have matched j + 1 char F[i] = j + 1; i = i + 1; j = j + 1; } else if ( j > 0 )
// use failure function to shift P

j = F[j − 1];
else

{ F[i] = 0; // no match
i = i + 1;}

}

Example

a b a c a a b a c a b a c a b a a b b

19181715

a b a c a b

1614

2 3 4 5 6

9
a b a c a b

a b a c a b

10 11 12

j 0 1 2 3 4 5
P[j] a b a c a b
F(j) 0 0 1 0 1 2

Tries

e nimize

nimize ze

zei mi

mize nimize ze

Preprocessing Strings

• Preprocessing the pattern speeds up pattern matching
queries

• After preprocessing the pattern, KMP’s algorithm performs pattern
matching in time proportional to the text size

• If the text is large, immutable and searched for often (e.g.,
works by Shakespeare), we may want to preprocess the text
instead of the pattern

• A trie is a compact data structure for representing a set of
strings, such as all the words in a text

• A tries supports pattern matching queries in time proportional to
the pattern size

Standard Tries
• The standard trie for a set of strings S is an ordered tree such that:

• Each node but the root is labeled with a character
• The children of a node are alphabetically ordered
• The paths from the external nodes to the root yield the strings of S

• Example: standard trie for the set of strings
S = { bear, bell, bid, bull, buy, sell, stock, stop }

e t

Analysis of Standard Tries

• A standard trie uses O(n) space and supports searches,
insertions and deletions in time O(dm), where:

n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet

e t

Word Matching with a Trie
• We insert the

words of the text
into a trie

• Each leaf stores
the occurrences
of the associated
word in the text

s e e b e a r ? s e l l s t o c k !

s e e b u l l ? b u y s t o c k !

b i d s t o c k !

h e t h e b e l l ? s t o p !

b i d s t o c k !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r

87 88

e t

0, 24

6
l

47, 58
l

36
l

12
k

17, 40,
51, 62

Compressed Tries
• A compressed trie has

internal nodes of degree at
least two

• It is obtained from standard
trie by compressing chains of
“redundant” nodes

ar ll

ll y

ell to

ck p

e t

Compact Representation
• Compact representation of a compressed trie for an array of strings:

• Stores at the nodes ranges of indices instead of substrings
• Uses O(s) space, where s is the number of strings in the array
• Serves as an auxiliary index structure

s e e
b e a r
s e l l
s t o c k

b u l l
b u y
b i d

h e
b e l l
s t o p

0 1 2 3 4
a rS[0] =

S[1] =

S[2] =

S[3] =

S[4] =

S[5] =

S[6] =

S[7] =

S[8] =

S[9] =

0 1 2 3 0 1 2 3

1, 1, 1

1, 0, 0 0, 0, 0

4, 1, 1

0, 2, 2

3, 1, 2

1, 2, 3 8, 2, 3

6, 1, 2

4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3

7, 0, 3

0, 1, 1

Suffix Trie

• The suffix trie of a string X is the compressed trie of all the suffixes of X

e nimize

nimize ze

zei mi

mize nimize ze

m i n i z em i
0 1 2 3 4 5 6 7

Pattern Matching Using Suffix Trie
(1/2)

Algorithm suffixTrieMatch(T, P)
{ p = P.length; j = 0; v = T.root();

repeat
{ for each child w of v do

{ // we have matched j + 1 char
childTraversed=false; i = start(w); // start(w) is the start index of w
if ( P[j] = X[i] ) // process child w

{ childTraversed=true;
x = end(w) − i +1; // end(w) is the end index of w
if ( p ≤ x )
// suffix is shorter than or of the same length of the node label

{ if ( P[j:j+p−1] = X[i:i+p−1] ) return i – j ;
else return “P is not a substring of X” ; }

else // the pattern goes beyond the substring stored at w
{ if ( P[j:j+x−1] = X[i:i+x−1] )

{ p = p − x; // update suffix length
j = j + x; // update suffix start index
v = w; break ; }

else return “P is not a substring of X” ; }}}}
until childTraversed=false or T.isExternal(v);
return “P is not a substring of X” ; }

Pattern Matching Using Suffix Trie (2/2)

• Input of the algorithm:
• Compact suffix trie T for a text X and pattern P.

• Output of the algorithm:
• Starting index of a substring of X matching P or an indication that P is not

a substring.

• The algorithm assumes the following additional property on the labels
of the nodes in the compact representation of the suffix trie:

• If node v has label (i, j) and Y is the string of length y associated with the
path from the root to v (included), then X[j-y+1..j]=Y.

• This property ensures that we can easily compute the start index of the
pattern in the text when a match occurs.

Analysis of Suffix Tries
• Compact representation of the suffix trie for a string X of

size n from an alphabet of size d
• Uses O(n) space
• Supports arbitrary pattern matching queries in X in O(dm) time,

where m is the size of the pattern
• Can be constructed in O(n) time

7, 7 2, 7

2, 7 6, 7

6, 7

4, 7 2, 7 6, 7

1, 1 0, 1

m i n i z em i
0 1 2 3 4 5 6 7

Greedy Method and Text
Compression

The Greedy Method Technique

• The greedy method is a general algorithm design
paradigm, built on the following elements:

• configurations: different choices, collections, or values to
find

• objective function: a score assigned to configurations,
which we want to either maximize or minimize

• It works best when applied to problems with the
greedy-choice property:

• a globally-optimal solution can always be found by a series
of local improvements from a starting configuration.

Text Compression

• Given a string X, efficiently encode X into a smaller
string Y

• Saves memory and/or bandwidth
• A good approach: Huffman encoding

• Compute frequency f(c) for each character c.
• Encode high-frequency characters with short code

words
• No code word is a prefix for another code
• Use an optimal encoding tree to determine the code

words

Encoding Tree Example
• A code is a mapping of each character of an alphabet to a binary code-

word

• A prefix code is a binary code such that no code-word is the prefix of
another code-word

• An encoding tree represents a prefix code
• Each external node stores a character
• The code word of a character is given by the path from the root to the

external node storing the character (0 for a left child and 1 for a right child)

b c

d e

00 010 011 10 11
a b c d e

Encoding Tree Optimization
• Given a text string X, we want to find a prefix code for the characters of X that

yields a small encoding for X
• Frequent characters should have short code-words
• Rare characters should have long code-words

• Example
• X = abracadabra
• T1 encodes X into 29 bits
• T2 encodes X into 24 bits

a r

d b a

c d

b r

T1 T2

Huffman’s Algorithm

• Given a string X,
Huffman’s algorithm
construct a prefix code
the minimizes the size
of the encoding of X

• It runs in time
O(n + d log d), where n
is the size of X and d is
the number of distinct
characters of X

• A heap-based priority
queue is used as an
auxiliary structure

Algorithm HuffmanEncoding(X)
Input string X of size n
Output optimal encoding trie for X

{
C = distinctCharacters(X);
computeFrequencies(C, X);
Q = new empty heap;
for all c ∈ C

{ T = new single-node tree storing c;
Q.insert(getFrequency(c), T); }

while ( Q.size() > 1 )
{ f1 = Q.minKey();

T1 = Q.removeMin();
f2 = Q.minKey();
T2 = Q.removeMin();
T = join(T1, T2);
Q.insert(f1 + f2, T);

}
return Q.removeMin();

}

Example

a b c d r
5 2 1 1 2

X = abracadabra
Frequencies

ca rdb
5 2 1 1 2

ca rdb

5 2 2
ca bd r

ca bd r

bd r

2 4

Extended Huffman Tree Example

The Fractional Knapsack Problem

• Given: A set S of n items, with each item i having
• bi – a positive benefit
• wi – a positive weight

• Goal: Choose items with maximum total benefit but with
weight at most W.

• If we are allowed to take fractional amounts, then this is the
fractional knapsack problem.

• In this case, we let xi denote the amount we take of item i

• Objective: maximize

• Constraint:

∑
∈Si

iii wxb )/(

∑
∈

≤
Si

i Wx

Example

• Given: A set S of n items, with each item i having
• bi – a positive benefit
• wi – a positive weight

• Goal: Choose items with maximum total benefit but with
weight at most W.

Weight:
Benefit:

1 2 3 4 5

4 ml 8 ml 2 ml 6 ml 1 ml
$12 $32 $40 $30 $50

Items:

Value: 3
($ per ml)

4 20 5 50 10 ml

Solution:
• 1 ml of 5
• 2 ml of 3
• 6 ml of 4
• 1 ml of 2

“knapsack”

The Fractional Knapsack Algorithm

• Greedy choice: Keep taking item
with highest value (benefit to
weight ratio)

• Since
• Run time: O(n log n). Why?

• Correctness: Suppose there is a
better solution

• there is an item i with higher
value than a chosen item j, but
xi0 and vi

Related Posts