Lecture 3 – Linked Lists
Lecture 13
Strings and Sequences
EECS 281: Data Structures & Algorithms
Strings and Sequences
Data Structures & Algorithms
Why Study String Algorithms?
• Bird’s-eye view: strings are character sequences
– Characters taken from an “alphabet”
– Algorithms on strings are array/sequence algorithms
• What makes those arrays/sequences special?
– Typical tasks to solve, defined by typical applications
• Applications of interest
– Human-readable text (ASCII, Unicode)
– Names and labels (people, files, license plates, etc)
– DNA analysis
• Out of scope: sequences of objects, doubles
3
Working With Strings/Sequences
• Given two strings/sequences
– Are they the equal? (==)
– Which would go first in a dictionary?
(<=)
– What do they have in common?
(substrings)
– Find where the strings differ
• Given many strings, order them in
the lexicographic (dictionary) order
– Ordering helps with search
(fast look-up)
4
Working With Strings and Texts
• Given strings p and s
– “Needle” and “haystack”
• Find instances of p in
s: first, last, next, all
– Variant: preprocess s
before p is known,
optimize search time for
many indep. queries (p)
• Text: a string with a
separator character,
which delimits words
– Given a word (or a
phrase), find instances
in text
5
Working With Strings and Texts
• A text corpus: a collection of documents, each containing a
text (+ title and other attributes, such as URL)
– Given a search string consisting of 1+ words, find all documents
containing those words
– Exact matches (names, ID #s)
– Approx matches (misspelled)
– A huge industry is built on this
• Given Project 1 submissions
– Find pairs of similar programs
• Text compression
– Find repeated words
– Replace them by numbers
6
DNA and Genomic Sequences
• DNA structure
– Defined by sequences of
nucleotide base-pairs
– Alphabet: A, C, G, T
– In people: 23x2 chromosomes,
3B base-pairs (characters)
– Genes == subsequences
• String algorithms for
– Comparing chromosomes
– Looking up genes
– Assembling long DNA strands
from short sequences 7
String-related Data Structures
• Individual strings: C vs C++
• Sequences (iterator ranges)
– C++ strings support begin() and end()
• Specialized containers for multiple strings
– Dictionaries: add, remove, look up words
– Strings with shared fragments (many words starting
with “anti-”, “pre-”, “over-”, “semi-”, etc)
Null-terminated Object-oriented
Overhead 1 ptr + 1 char 2+ ptrs: begin/end
Complexity of .size() O(n) O(1)
Alphabet ASCII Configurable
Operations, algorithms pointer arithmetics,
stdlibc functions
methods, operators,
stdlibc++ functions, STL
8
Sequence Equality in STL
• Returns true if the range [first1, last1) is equal to the
range [first2, first2 + (last1 - first1)),
and false otherwise
• Two ranges are considered equal if for every iterator
i in the range [first1, last1), *i == *(first2 + (i - first1))
9
A std::equal() Implementation
1 template
2 bool equal(InputIt1 first1, InputIt1 last1,
3 InputIt2 first2) {
4 for ( ; first1 != last1; ++first1, ++first2) {
5 if (!(*first1 == *first2))
6 return false;
7 } // for
8 return true;
9 } // equal()
From http://en.cppreference.com/w/cpp/algorithm/equal
Job interview question: implement a variant of this
function that takes a customizable object==() comparator
10
http://en.cppreference.com/w/cpp/algorithm/equal
http://en.cppreference.com/w/cpp/algorithm/equal
http://en.cppreference.com/w/cpp/algorithm/equal
Example: Using Sequence Equality
1 #include
2 #include
3 #include
4
5 bool is_palindrome(const std::string &s) {
6 return std::equal(begin(s), begin(s) + s.size() / 2, rbegin(s));
7 } // is_palindrome()
8 void test(const std::string &s) {
9 std::cout << '"' << s << '"'
10 << (is_palindrome(s) ? " is" : " is not") << " a palindrome\n";
11 } // test()
12 int main() {
13 test("radar");
14 test("hello");
15 } // main()
11
A Simple Job Interview Question
• You are shown one fruit at a time: apples,
bananas, oranges, pears (nothing else)
• You can stop when you detect a repeat
• How long does it take in the worst case for n
fruits?
– O(n) time
– O(log n) time
– O(1) time
– Something else?
12
Strings and Sequences
Data Structures & Algorithms
Lexicographic Comparison
Data Structures & Algorithms
Rules of Lex-compare
• Two ranges are compared element by element
• The first mismatching element defines which range is
lexicographically less or greater than the other
• If one range is a prefix of another, the shorter range is
lexicographically less than the other
• If two ranges have equivalent elements and are of the
same length, then the ranges are lexicographically equal
• An empty range is lexicographically less than any non-
empty range
• Two empty ranges are lexicographically equal
15
Dictionary Order
• "" < "a" < "ab" < "b" < "ba" < "bc" < "bc0"
• Overloading comparison operators in C++
– Don’t implement all 6 as independent operators
– Suffices to implement < and == (STL uses these)
– (a != b) is the same as !(a == b)
– (a > b) is the same as (b < a)
– (a <= b) is the same as !(b < a)
– (a >= b) is the same as !(a < b)
• All other comparisons use 1 operator (plus !)
– Bad idea: implementing <= using < and ==
16
Applied Strings
• Common implementation trick for
– int compareHelper(const T& x, const T& y)
– Returns 0 for x == y, -1 for x < y, 1 for x > y
– Comp. operators are implemented by post-processing
• Optimizations for strings
– Short strings (<16B) don’t need dynamic memory – In C, null-termination simplifies op< : "abc" < "abcd" – In C++, strings of different size are != – Fertile ground for vectorized CPU instructions 17 Three-way String Compare 1 int compareHelper(const string& s0, const string& s1) { 2 const size_t len0 = s0.length(), len1 = s1.length(); 3 for(size_t i = 0; i != std::min(len0, len1); ++i) { 4 if (s0[i] < s1[i]) 5 return -1; 6 if (s0[i] > s1[i])
7 return 1;
8 } // for
9 if (len0 < len1) 10 return -1; 11 if (len0 > len1)
12 return 1;
13 return 0;
14 } // compareHelper()
18
Runtime Analysis
• O(n) worst-case time, O(1) extra space
• Cannot do better
• Avg-case complexity for random strings
• O(1) time: only need the first few characters
• Strings are often not random
“c:\Program Files\MySQL\MySQL Server 5.0\bin”
“c:\Program Files\MySQL\MySQL Server 5.5\bin”
“versunken” “versuchen”
“befehlen” “empfehlen”
• What if we compare nearby words in a dictionary?
19
Lex-comparisons in STL
20
Lex-compare: Implementation
1 template
2 bool lexicographical_compare(ForIt1 first1, ForIt1 last1,
3 ForIt2 first2, ForIt2 last2) {
4 while ((first1 != last1) && (first2 != last2)) {
5 if (*first1 < *first2)
6 return true;
7 if (*first2 < *first1)
8 return false;
9 ++first1, ++first2;
10 } // for
11 return (first1 == last1) && (first2 != last2);
12 } // lexicographical_compare()
From http://en.cppreference.com/w/cpp/algorithm/lexicographical_compare
21
http://en.cppreference.com/w/cpp/algorithm/lexicographical_compare
1 #include
2 #include
3 #include
4 #include
5 using namespace std;
6 void print_chars(const vector
7 for_each(begin(vec), end(vec), [] (auto c) { cout << c << ' '; });
8 cout << sep;
9 } // print_chars()
10
11 int main() {
12 vector
13 vector
14 mt19937 g{random_device{}()};
15 while (!lexicographical_compare(begin(v1), end(v1), begin(v2), end(v2))) {
16 print_chars(v1, “>= “); print_chars(v2, “\n”);
17 shuffle(begin(v1), end(v1), g);
18 shuffle(begin(v2), end(v2), g);
19 } // while
20 print_chars(v1, “< "); print_chars(v2, "\n");
21 return 0;
22 } // main()
Lex-compare: Example
22
Application: Removing Duplicates
• Given: a container of string objects
• Need: leave only one copy of each string
• Return all strings, if no duplicates present
• Sort given strings
• Use STL std::sort() from STL with operator<()
• Duplicates will be next to each other
• Compare neighboring strings using operator==()
• Copy only the first of duplicate strings
• Use std::unique() from STL with operator==()
23
Lexicographic Comparison
Data Structures & Algorithms
Searching and String Fingerprints
Data Structures & Algorithms
Finding a Needle in a Haystack
26
String Searching
• Find the first/next occurrence of pattern p
in string s
27
Brute-force String Searching
• Two nested loops – sliding window
• Worst-case time: O(len_needle * len_haystack)
– Finding “aaaaab” in “aaaaaaaaaaaaaaaaaaaaaba”
• Avg-case: O(len_haystack)
– The inner loop performs O(1) iterations
28
Brute-force String Searching
• STL implementations often choose this algorithm
– Simple, lean implementation
– Usually fast in practice: O(len_haystack) time
• However,
– O(len_needle * len_haystack) worst-case time
• Worst cases do appear in realistic strings
– O(len_needle + len_haystack) worst-case time
possible with pre-processing (not as fast on most
strings), e.g., the Knuth-Morris-Pratt (KMP) algorithm
29
Better Than Brute-force
• Idea: speed up the inner loop for known worst
cases
– Perform most eq-comparisons for strings in O(1) time
– O(len_needle * len_haystack) becomes
O(len_haystack)
• Worst-case complexity remains similar, but
worst-case inputs are not as obvious and rare in
practice
• How do we speed up eq-comparisons?
30
Big Idea: String Fingerprints
• For each string, compute a number (int)
– When fingerprints differ, strings must differ
– When fingerprints match, strings rarely differ
– The actual numbers don’t mean anything
– Many different fingerprint functions exist
• Ex: simple, but poor fingerprint function, F()
– Replace each character by its ASCII code
– Add up all codes
31
F("tom marvolo riddle ") == F("i am lord voldemort")
Digression: A Sample Application
• If we only had fingerprint functions with few
collisions (that don’t show up in practice), we
could solve the following problem
• Given n strings, find duplicates by comparing
fingerprints
– When FPs match, we must check strings for equality
(in case FPs match by luck)
– If most strings are different, FPs help a lot
– We can sort FPs to find duplicates
32
Rabin Fingerprint
• Instead of adding up character codes, view
strings as decimal numbers
– Characters: ‘T’, ‘O’, ‘M’, ‘ ’, ‘M’, …
– ASCII codes: 84, 79, 77, 32, 77, …
– Running fingerprints:
T 84
TO 10 * 84 + 79
TOM 10 * (10 * 84 + 79) + 77
TOM🁢 10 * (10 * (10 * 84 + 79) + 77) + 32, …
– Base 10 is used for illustration only (use larger numbers)
• Shuffling the chars usually changes result
33
Sliding Rabin Fingerprint
• Calculate fingerprint fp for first m characters
• Removing a character on the left takes O(1) time
– Precompute/store value of 10m-1 once as p
– Subtract left-most character multiplied by p from fp
• Adding a character on the right takes O(1) time
– Multiply fp by 10†
– Add ASCII code of new right-most character
• Initial calculation of fp takes O(m) time
• Each of n - m slides takes O(1) time
34
† Using 10 for illustration only
Rabin-Karp String Search
1. Compute the FP of the needle in linear time
2. Check if the haystack starts with the needle
• window = prefix_of_the_haystack
• If (FP(window) != FP(needle)) go to Step 3
• If (window == needle), then done
3. Incrementally update the FP of window
• Remove one character on the left
• Add one new character on the right
4. If (FP(window) == FP(needle)),
if (window == needle), then done
5. If more chars remain in haystack, go to Step 3 35
• For long strings, performing +, * operations will
result in overflows
• A common trick when constructing a fingerprint
• Replace conventional arithmetic (+, *)
with modular arithmetic:
• Pick some “largish prime”
(eg. 3355439)
• (x * y) becomes (x * y) % prime
(x + y) becomes (x + y) % prime
• Important: conventional arithmetic rules carry over
(x + y = y + x, x * y = y * x, x * (y + z) = x * y + x * z, etc)
Technicality: Avoiding Overflows
36
Rabin-Karp: Time Complexity
• If different substrings never produce equal FPs, runtime
is O(len_haystack)
• Every FP collision incurs O(len_needle) time to perform
equality check of substrings
• In practice, collisions are very rare and don’t correspond
to meaningful pairs of strings
– It helps to choose larger primes
• Choose base = 2k
– Coprime with the original prime number
– Multiplication by base simplifies to a binary shift (faster)
37
1 int rkSearch(const string &needle, const string &haystack, size_t prime) {
2 constexpr int base = 128;
3 const size_t N = needle.length(), H = haystack.length();
4 const size_t z = static_cast
5 int n = 0, h = 0;
6 for (size_t i = 0; i < N; ++i) {
7 n = (base * n + needle[i]) % prime; // calculate needle fingerprint
8 h = (base * h + haystack[i]) % prime; // calc. window fingerprint
9 } // for
10 for (size_t i = 0; i <= H - N; ++i) {
11 if (n == h) { // check needle fp vs. current window
12 for (size_t j = 0; j < N; ++j) if (haystack[i + j] != needle[j]) break;
13 if (j == N) return i;
14 } // if
15 if (i < H - N) { // slide window
16 h = ((h - haystack[i] * z) % prime * base + haystack[i + N]) % prime;
17 if (h < 0) h = (h + prime);
18 } // if
19 } // for
20 return -1; // needle not found
21 } // rkSearch() 38
Searching and String Fingerprints
Data Structures & Algorithms