CS代考 COMP90038 Algorithms and Complexity

COMP90038 Algorithms and Complexity
Hashing

(Slides from Harald Søndergaard)
Lecture 17
Semester 2, 2021
Algorithms and Complexity
(Sem 2, 2021) Hashing © University of Melbourne 1 / 31

Announcements
Assignment 2 is out on LMS.
Olya’s consultation sessions (Zoom links via LMS):
Tuesday October 5, 2:30 – 3:30 pm Tuesday October 12, 2:30 – 3:30 pm
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 2 / 31

Previous lecture
Precomputation/Preprocessing technique Spending space to save time. Examples: 1. Fibonacci Numbers
2. Sorting by counting
3. Horspool’s String Search algorithm
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 3 / 31

Today: Hashing and Hash Tables
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 4 / 31

Hashing
Hashing is a standard way of implementing the abstract data type “dictionary”.
Implemented well, it makes data retrieval very fast.
A key can be anything, as long as we can map it efficiently to a
positive integer. In particular, the set K of keys need not be bounded.
Assume we have a table of size m (the hash table).
The idea is to have a function h : K → {1..m} (the hash function) determine where records are stored: A record with key k should be stored in location h(k).
The address h(k) is the hash address.
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 5 / 31

The Hash Table
We can think of the hash table as an abstract data structure supporting operations:
find
insert
lookup (search and insert if not there) initialize
delete
rehash
The challenges:
Design of hash functions Collision handling
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 6 / 31

The Hash Function
We need to choose the table size m so that it is large enough to allow efficient operation, without taking up excessive memory.
If keys are integers, we could define h(n) = n mod m.
But keys could be other things, such as strings of characters, and the hash function should apply to these and still be easy (cheap) to compute.
The hash function should ideally distribute keys evenly across the hash table.
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 7 / 31

Hashing of Strings
For simplicity assume A 􏱹→ 0, B 􏱹→ 1, etc.
Also assume we have, say, 26 characters, and m = 101.
Then we can think of a string as a long binary string:
M Y K E Y 􏱹→ 0110011000010100010011000 (= 13379736)
13379736 mod 101 = 64
So 64 is the position of string M Y K E Y in the hash table.
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 8 / 31

Hashing of Strings
For simplicity assume A 􏱹→ 0, B 􏱹→ 1, etc.
Also assume we have, say, 26 characters, and m = 101.
Then we can think of a string as a long binary string:
M Y K E Y 􏱹→ 0110011000010100010011000 (= 13379736)
13379736 mod 101 = 64
So 64 is the position of string M Y K E Y in the hash table.
We deliberately chose m to be prime.
13379736=12×324 +24×323 +10×322 +4×32+24
With m = 32, the hash value of any key is the last character’s value!
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 9 / 31

Handling Long Strings as Keys
More precisely, let chr be the function that gives a character’s number (between 0 and 25 under our simple assumptions), so for example, chr(C) = 2.
Then we have hash(s) = 􏰊|s|−1 chr(si ) × 32|s|−i−1. i=0
For example,
hash(V E R Y L O N G K E Y) = (21×3210+4×329+···) mod 101
The stuff between parentheses quickly becomes an impossibly large number!
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 10 / 31

Horner’s Rule
Fortunately there is a trick, Horner’s rule, that allows us to avoid large numbers in the hash calculations. Instead of
21×3210 +4×329 +17×328 +24×327 ··· factor out repeatedly:
( ··· ((21×32+4)×32+17)×32+···)+24 Now utilize these properties of modular arithmetic:
(x+y)modm = ((x modm)+(y modm))modm (x×y)modm = ((x modm)×(y modm))modm
So for each sub-expression it suffices to take values modulo m.
Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 11 / 31

Hashing of Strings
Horner’s rule is easily implemented. For example, here it is in C:
unsigned hash(char *v) { int h;
for (h = 0; *v != ’\0’; v++) h = (h<<5 + *v) % m; return h; } Here v is a pointer traversing the string. C uses a special symbol, \0, to mark the end of a string. The h<<5 means “shift the bits in h 5 places to the left and fill with zeroes”—a quick way of calculating 32*h. The ‘%’ is the modulus operation in C. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 12 / 31 The Hash Function and Collisions What is a collision ... Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 13 / 31 The Hash Function and Collisions Key Address 19 392 179 359 663 262 639 321 97 468 814 19 1 18 14 19 9 18 22 5 8 9 The hash function should be as random as possible. Here we assume m = 23 and h(k) = k mod m. In some cases different keys will be mapped to the same hash table address. When this happens we have a collision. Different hashing methods resolve collisions differently. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 14 / 31 Separate Chaining Element k of the hash table is a list of keys with the hash value k. 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122 392 97 468 262 359 179 19 321 814 639 663 This gives easy collision handling. The load factor α = n/m, where n is the number of items stored. Number of probes in successful search ≈ (1 + α)/2. Number of probes in unsuccessful search ≈ α. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 15 / 31 Separate Chaining Pros and Cons Compared with sequential search, reduces the number of comparisons by a factor of m. Good in a dynamic environment, when (number of) keys are hard to predict. The chains can be ordered, or records may be “pulled up front” when accessed. Deletion is easy. However, separate chaining uses extra storage for links. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 16 / 31 Open-Addressing Methods With open-addressing methods (also called closed hashing) all records are stored in the hash table itself (not in linked lists hanging off the table). There are many methods of this type. We only discuss two: linear probing double hashing For these methods, the load factor α ≤ 1. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 17 / 31 Linear Probing: Example In case of collision, try the next cell, then the next, and so on. With m = 5 and after the arrival of a (4), b (5), c (5), d (1), e (1): Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 18 / 31 Linear Probing In case of collision, try the next cell, then the next, and so on. After the arrival of 19, 392 (1), 179 (18), 663 (19), 639 (18), 321 (22): Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 19 / 31 Linear Probing In case of collision, try the next cell, then the next, and so on. After the arrival of 19, 392 (1), 179 (18), 663 (19), 639 (18), 321 (22): 0 1 2 3 16 17 18 19 20 21 22 392 179 19 663 639 321 Search proceeds in a similar fashion. If we get to the end of the hash table, we wrap around. For example, if key 20 now arrives, it will be placed in cell 0. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 20 / 31 Linear Probing Again let m be the table size, and n be the number of records stored. As before, α = n/m is the load factor. Average number of probes: Successful search: 1 + 1 2 2(1−α) For successful search: α #probes 0.1 0.25 0.5 0.75 0.9 0.95 1.06 1.17 1.50 2.50 5.50 10.50 Unsuccessful: 1 + 2 1 2(1−α)2 Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 21 / 31 Linear Probing Pros and Cons Space-efficient. Worst-case performance miserable; must be careful not to let the load factor grow beyond 0.9. Comparative behaviour, m = 11113, n = 10000, α = 0.9: Linear probing: 5.5 probes on average (success) Binary search: 12.3 probes on average (success) Linear search: 5000 probes on average (success) Clustering is a major problem: The collision handling strategy leads to clusters of contiguous cells being occupied. Deletion is almost impossible. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 22 / 31 Double Hashing To alleviate the clustering problem in linear probing, there are better ways of resolving collisions. One is double hashing which uses a second hash function s to determine an offset to be used in probing for a free cell. For example, we may choose s(k) = 1 + k mod 97. By this we mean, if h(k) is occupied, next try h(k) + s(k), then h(k) + 2s(k), and so on. This is another reason why it is good to have m being a prime number. That way, using h(k) as the offset, we will eventually find a free cell if there is one. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 23 / 31 Rehashing The standard approach to avoiding performance deterioration in hashing is to keep track of the load factor and to rehash when it reaches, say, 0.9. Rehashing means allocating a larger hash table (typically about twice the current size), revisiting each item, calculating its hash address in the new table, and inserting it. This “stop-the-world” operation will introduce long delays at unpredictable times, but it will happen relatively infrequently. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 24 / 31 Rabin- Search The Rabin-Karp string search algorithm is based on string hashing. To search for a string p (of length m) in a larger string s, we can calculate hash(p) and then check every substring si · · · si +m−1 to see if it has the same hash value. Of course, if it has, the strings may still be different; so we need to compare them in the usual way. If p = si · · · si +m−1 then the hash values are the same; otherwise the values are almost certainly going to be different. Since false positives will be so rare, the O(m) time it takes to actually compare the strings can be ignored. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 25 / 31 Rabin- Search Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 26 / 31 Rabin- Search Repeatedly hashing strings of length m seems like a bad idea. However, the hash values can be calculated incrementally. The hash value of the length-m substring of s that starts at position j is: m􏰋−1 i=0 chr(sj+i ) × am−i−1 hash(s, j) = where a is the alphabet size. From that we can get the next hash value, for the substring that starts at position j + 1, quite cheaply: hash(s, j + 1) = (hash(s, j) − am−1chr(sj )) × a + chr(sj+m) modulo m. Effectively we just subtract the contribution of sj and add the contribution of sj+m, for the cost of two multiplications, one addition and one subtraction. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 27 / 31 Why Not Always Use Hashing? Some drawbacks: If an application calls for traversal of all items in sorted order, a hash table is no good. Also, unless we use separate chaining, deletion is virtually impossible. It may be hard to predict the volume of data, and rehashing is an expensive “stop-the-world” operation. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 28 / 31 When to Use Hashing? All sorts of information retrieval applications involving thousands to millions of keys. Typical example: Symbol tables used by compilers. The compiler hashes all (variable, function, etc.) names and stores information related to each—no deletion in this case. When hashing is applicable, it is usually superior; a well-tuned hash table will outperform its competitors. Unless you let the load factor get too high, or you botch up the hash function. It is a good idea to print statistics to check that the function really does spread keys uniformly across the hash table. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 29 / 31 Summary Hashing and hash tables: pros and cons Ways to avoid/reduce collisions: separate chaining and open-addressing Hash table load factor: space vs. efficiency Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 30 / 31 Next Up Dynamic programming and optimization. A reminder that Assignment 2 is out and due on October 13. Algorithms and Complexity (Sem 2, 2021) Hashing © University of Melbourne 31 / 31