COMP90038 Algorithms and Complexity
COMP90038
Algorithms and Complexity
Lecture 17: Hashing
(with thanks to Harald Søndergaard & Michael Kirley)
Andres Munoz-Acosta
munoz.m@unimelb.edu.au
Peter Hall Building G.83
mailto:munoz.m@unimelb.edu.au
Recap
• We talked about using some memory space (in the form of extra
tables, arrays, etc.) to speed up our computation.
• Memory is cheap, time is not.
• Sorting by counting
• Horspool’s Algorithm
Sorting by counting
• Lets go through this example carefully:
• The keys are: [1 2 3 4 5]
• The data is: [5 5 1 5 4 1 2 3 5 5 1 5 5 3 5 1 3 5 4 5]
• Lets count the appearances of each key:
• Lets add up the occurrences
Key 1 2 3 4 5
Occurrences 4 1 3 2 10
Occurrences 4 1 3 2 10
0+4 4+1 5+3 8+2 10+10
Cumulation 4 5 8 10 20
Sorting by counting
• Lets sort the data:
• Note that from P[1..4] = 1, P[5] = 2, P[6..8] = 3, P[7..10] = 4 and P[11..20] = 5
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Unsorted 5 5 1 5 4 1 2 3 5 5 1 5 5 3 5 1 3 5 4 5
Sorted 1 1 1 1 2 3 3 3 4 4 5 5 5 5 5 5 5 5 5 5
Key 1 2 3 4 5
Cumulation 4 5 8 10 20
P[20] 19
P[10] 9
P[19] 18
P[8] 7
Horspool’s algorithm
• Lets go through this example carefully:
• The pattern is TAACG (A=1, T=2, G=3, C=4 → P[.] = [2 1 1 4 3], m =5)
• The string is GACCGCGTGAGATAACGTCA
• This algorithm creates the table of shifts:
A T G C
After first loop 5 5 5 5
j=0 5 4 5 5
j=1 3 4 5 5
j=2 2 4 5 5
j=3 2 4 5 1
Horspool’s algorithm
• We append a sentinel at the end of the data to guarantee completion
• The string is now GACCGCGTGAGATAACGTCATAACG
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
STRING G A C C G C G T G A G A T A A C G T C A T A A C G
T[.] 3 1 4 4 3 4 3 2 3 1 3 1 2 1 1 4 3 2 4 1 2 1 1 4 3
FAILED (C!=A) T A A C G
IS ‘CG’ SOMEWHERE ELSE? T A A C G
(NO, SHIFT BY G)
FAILED (A!=G, SHIFT BY A) T A A C G
FAILED (A!=G, SHIFT BY A) T A A C G
FAILED (A!=G, SHIFT BY A) T A A C G
FAILED (C!=G, SHIFT BY C) T A A C G
FOUND AT 16 T A A C G
Horspool’s algorithm
• For this algorithm, at the end of while True do iteration, [i k] are:
i k
9 2
11 0
13 0
15 0
16 0
Hashing
• Hashing is a standard way of implementing the abstract data type
“dictionary”, a collection of
student record:
• Attributes: Student ID, Name, data of birth, address, major, etc…
• Implemented well, it makes data retrieval very fast.
• A key identifies each record. It can be anything: integers, alphabetical
characters, even strings
• It should map efficiently to a positive integer.
• The set K of keys need not be bounded.
Hashing
• We will store our records in a hash table of size m.
• m should be large enough to allow efficient operation, without taking up
excessive memory.
• The idea is to have a function h that takes the key k, and determines
an index in the hash table. This is the hash function.
• A record with key k should be stored in location h(k).
• The hash address is the value of h(k).
• Two different keys could have the same hash address (a collision).
Hashing
• Few example application are:
• The MD5 algorithm used for data
integrity verification.
• The blockchain structure used in
crypto currencies
HASH
Text
E83H6F20
Data of
arbitrary
length
Fixed length
Hash (digest)
A
mathematical
function
The Hash Table
• We can think of the hash table as an abstract data structure supporting
operations:
• Find
• Insert
• Lookup (search and insert if not found)
• Initialize
• Delete
• Rehash
• The challenges in implementing a table are:
• Design a robust hash function
• Handling of same addresses (collisions) for different key values
The Hash Function
• The hash function:
• Must be easy (cheap) to compute.
• Ideally distribute keys evenly across the hash table.
• Examples:
• If the keys are integers, we could define h(n) = n mod m. If m=23:
• If the keys are strings, we could define a more complex function.
n 19 392 179 359 262 321 97 468
h(n) 19 1 18 14 9 22 5 8
Hashing of strings
• Assume:
• this table of 26 characters.
• a hash table of size 13
• the hash function:
• and the following list of keys:
[A, FOOL, AND, HIS, MONEY, ARE,
SOON, PARTED]
char a char a char a
A 0 J 9 S 18
B 1 K 10 T 19
C 2 L 11 U 20
D 3 M 12 V 21
E 4 N 13 W 22
F 5 O 14 X 23
G 6 P 15 Y 24
H 7 Q 16 Z 25
I 8 R 17
Calculating the addresses
SUM h(s)
A
0 0 0
F O O L
5 14 14 11 44 5
A N D
0 13 3 16 3
H I S
7 8 18 33 7
M O N E Y
12 14 13 4 24 67 2
A R E
0 17 4 21 8
S O O N
18 14 14 13 59 7
P A R T E D
15 0 17 19 4 3 58 6
The same
address!!!
A more complex hash function
• Assume a binary representation
of the 26 characters
• We need 5 bits per character (0
to 31)
• Instead of adding, we
concatenate the binary strings
• Our hash table is of size 101 (m
is prime)
• Our key will be ‘MYKEY’
char a bin(a) char a bin(a) char a bin(a)
A 0 00000 J 9 01001 S 18 10010
B 1 00001 K 10 01010 T 19 10011
C 2 00010 L 11 01011 U 20 10100
D 3 00011 M 12 01100 V 21 10101
E 4 00100 N 13 01101 W 22 10110
F 5 00101 O 14 01110 X 23 10111
G 6 00110 P 15 01111 Y 24 11000
H 7 00111 Q 16 10000 Z 25 11001
I 8 01000 R 17 10001
A more complex hash function
• By concatenating the strings, we are basically multiplying by 32
• Note that the hash function is a polynomial:
STRING
KEY
KEY mod
101M Y K E Y
int 12 24 10 4 24
bin(int) 01100 11000 01010 00100 11000
Index 4 3 2 1 0
32^(index) 1048576 32768 1024 32 1
a*(32^index) 12582912 786432 10240 128 24 13379736 64
Handling Long Strings as Keys
• What would happen if our key is the longer string ‘VERYLONGKEY’
• The stuff between parentheses quickly becomes a very large number
quickly
• DEC: 23804165628760600
• BIN: 1010100100100011100001100110100011110100001001000011000
• Calculating this polynomial by brute force is very expensive
Horner’s rule
• Fortunately there is a trick, Horner’s rule, that simplifies polynomial
calculation.
• By factorizing x we have that:
• If we apply the modulus we have:
Horner’s rule
• We then can use the following properties of modular arithmetic:
• Given that modulus distributes across all operations, then we have:
• The results of each operation will not exceed m.
Handling collisions
• The hash function should be as random as possible.
• However, in some cases different keys will be mapped to the same hash
table address. For example h(n) = n mod 23
• When this happens we have a collision.
• Different hashing methods resolve collisions differently.
KEY 19 392 179 359 663 262 639 321 97 468 814
ADDRESS 19 1 18 14 19 9 18 22 5 8 9
Separate Chaining
• Each element k of the hash table is a linked list, which makes collision
handling very easy
• Exercise: add to this table [83 110 14]
• The load factor α = n/m, where n is the number of items stored.
• Number of probes in successful search ~(1 + α)/2.
• Number of probes in unsuccessful search ~α.
ADDRESS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
LIST 392 97 468 262 359 179 19
814 639 663
Separate chaining: advantages and
disadvantages
• Compared with sequential search, reduces the number of comparisons by the
size of the table (a factor of m).
• Good in a dynamic environment, when (number of) keys are hard to predict.
• The chains can be ordered, or records may be “pulled up front” when accessed.
• Deletion is easy.
• However, separate chaining uses extra storage for links.
Open-Addressing Methods
• With open-addressing methods (also called closed hashing) all
records are stored in the hash table itself (not in linked lists hanging
off the table).
• There are many methods of this type. We focus on two:
• linear probing
• double hashing
• For these methods, the load factor α ≤ 1.
Linear probing
• In case of collision, try the next cell, then the next, and so on.
• Assume the following data (and its keys) arriving one at the time:
[19(19) 392(1) 179(18) 663(19→20) 639(18→21) 321(22) …]
• Search proceeds in similar fashion
• If we get to the end of the table, we wrap around.
• For example, if key 20 arrives, it will be placed in cell 0.
Linear probing
• Exercise: Add [83(14) 110(18) 497(14)] to the table
• Again let m be the table size, and n be the number of records stored.
• As before, α = n/m is the load factor. Then, the average number of
probes:
• Successful search: 0.5 + 1/(2(1- α))
• Unsuccessful: 0.5 + 1/(2(1- α)2)
ADDRESS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
LIST 392 97 468 262 814 359 179 19 663 639 321
Linear probing: advantages and disadvantages
• Space-efficient.
• Worst-case performance miserable; must be careful not to let the load factor grow beyond 0.9.
• Comparative behavior, m = 11113, n = 10000, α = 0.9:
• Linear probing: 5.5 probes on average (success)
• Binary search: 12.3 probes on average (success)
• Linear search: 5000 probes on average (success)
• Clustering (large groups of contiguous keys) is a major problem:
• The collision handling strategy leads to clusters of contiguous cells being occupied.
• Deletion is almost impossible.
Double Hashing
• To alleviate the clustering problem in linear probing, there are better ways of resolving
collisions.
• One is double hashing which uses a second hash function s to determine an offset to be
used in probing for a free cell.
• For example, we may choose s(k) = 1 + k mod 97.
• By this we mean, if h(k) is occupied, next try h(k) + s(k), then h(k) + 2s(k), and so on.
• This is another reason why it is good to have m being a prime number. That way, using
h(k) as the offset, we will eventually find a free cell if there is one.
Rehashing
• The standard approach to avoiding performance deterioration in
hashing is to keep track of the load factor and to rehash when it
reaches, say, 0.9.
• Rehashing means allocating a larger hash table (typically about twice
the current size), revisiting each item, calculating its hash address in
the new table, and inserting it.
• This “stop-the-world” operation will introduce long delays at
unpredictable times, but it will happen relatively infrequently.
An exam question type
• With the hash function h(k) = k mod 7. Draw the hash table that
results after inserting in the given order, the following values
• When collisions are handled by:
• separate chaining
• linear probing
• double hashing using h’(k) = 5 – (k mod 5)
Solution
Index 0 1 2 3 4 5 6
Separate Chaining
17 19 13
26 48
Linear Probing 13 48 17 19 26
Double Hashing 48 26 17 19 13
Rabin-Karp String Search
• The Rabin-Karp string search algorithm is based on string hashing.
• To search for a string p (of length m) in a larger string s, we can calculate hash(p)
and then check every substring si … si+m-1 to see if it has the same hash value. Of
course, if it has, the strings may still be different; so we need to compare them in
the usual way.
• If p = si … si+m-1 then the hash values are the same; otherwise the values are
almost certainly going to be different.
• Since false positives will be so rare, the O(m) time it takes to actually compare the
strings can be ignored.
Rabin-Karp String Search
• Repeatedly hashing strings of length m seems like a bad idea. However, the hash
values can be calculated incrementally. The hash value of the length-m substring
of s that starts at position j is:
• where a is the alphabet size. From that we can get the next hash value, for the
substring that starts at position j+1, quite cheaply:
• modulo m. Effectively we just subtract the contribution of sj and add the
contribution of sj+m, for the cost of two multiplications, one addition and one
subtraction.
An example
• The data ‘31415926535’
• The hash function h(k) = k mod 11
• The pattern ’26’
STRING 3 1 4 1 5 9 2 6 5 3 5
31 MOD 11 9
14 MOD 11 3
41 MOD 11 8
15 MOD 11 4
59 MOD 11 4
92 MOD 11 4
26 MOD 11 4
Why Not Always Use Hashing?
• Some drawbacks:
• If an application calls for traversal of all items in sorted order, a hash table is
no good.
• Also, unless we use separate chaining, deletion is virtually impossible.
• It may be hard to predict the volume of data, and rehashing is an expensive
“stop-the-world” operation.
When to Use Hashing?
• All sorts of information retrieval applications involving thousands to millions of
keys.
• Typical example: Symbol tables used by compilers. The compiler hashes all
(variable, function, etc.) names and stores information related to each – no
deletion in this case.
• When hashing is applicable, it is usually superior; a well-tuned hash table will
outperform its competitors.
• Unless you let the load factor get too high, or you botch up the hash function. It is
a good idea to print statistics to check that the function really does spread keys
uniformly across the hash table.
Next lecture
• Dynamic programming and optimization.