COMP2012 Object-Oriented Programming and Data Structures
Topic 9: Hashing
Dr. Desmond Tsoi
Department of Computer Science & Engineering The Hong Kong University of Science and Technology Hong Kong SAR, China
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 1 / 33
Motivation
How would you find a student record given just the student’s name?
How does an electronic dictionary look up for a word, say, “computer”?
Each machine has an IP address in the Internet. How will an internet service look up an IPv6 address?
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 2 / 33
Part I
Hashing
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 3 / 33
General Idea
data item
A hash table is an array of some fixed size, containing all the data items.
Each item has a key; search is performed based on the keys.
Each key is mapped into some position in the array in the range 0 to m−1, where m is the array size.
The mapping is called hash function. Example applications:
Compilers use hash tables, called symbol tables, to keep track of declared identifiers in a program.
On-line spell checkers: After hashing the entire dictionary, one can check each word in constant time and print out the mis-spelled words in order of their appearance in the document.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 4 / 33
Hash Table
Hash table is a data structure that supports: search, insertion, and deletion (deletion may be unnecessary in some applications).
The implementation of hash tables is called hashing. Hashing is a technique which allows the executions of above
operations in constant time on average.
Unlike other data structures such as linked lists or binary trees, data items are generally not ordered in hash tables.
As a consequence, hash tables don’t support the following operations
find min and find max
finding successor and predecessor
reporting data within a given range
listing out the data in order
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 5 / 33
Unrealistic Solution
Universe of keys U is the set of all possible values of the keys.
Each position, also called a slot, in the hash table T corresponds to a key in U.
T[k] corresponds to a data item with key k.
If the table contains no data with key k, then T[k] = nil.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 6 / 33
Unrealistic Solution ..
Insertion, deletion and search all take constant time.
Problem: it wastes too much space if the universe of keys is large
compared with the actual number of data to be stored.
E.g., in HKUST, student IDs are 8-digit integers. So the key universe has a size of 108, but we only have ∼7000 students (not counting the alumni)!
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 7 / 33
Hash Function
Hash function, h maps the universe of keys U into the slots of a hash table T [0, 1, …, m − 1].
Several keys may be mapped to the same slot.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 8 / 33
Hash Function ..
possible keys hash function hash table
h
{ k 0 , k 1 , . . . , k N − 1 } − −−−−−−→ { 0 , 1 , . . . , m − 1 }
Usually, m ≪ N.
The keys ki are assumed to be natural numbers.
If they are not, they can always be converted or interpreted in natural numbers.
h(ki) = an integer in [0,…,m−1] is called the hash value of ki. With hashing, an item of key k is stored in T[h(k)].
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 9 / 33
Collision Problem
Two keys may be hashed to the same slot.
Question: Can we ensure that any two distinct keys are hashed to different slots? No! If N > m, where
m = size of the hash table, and
N = number of data
Solution:
1. Design a good hash function that is
fast to compute, and
can minimize the number of collisions
2. Design a method to resolve the collisions when they occur.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 10 / 33
Hash Function Design
A simple and reasonable strategy: h(k) = k mod m. e.g. m = 12,k = 100,h(k) = 4
It requires only a single division operation (quite fast).
Certain values of m should be avoided: e.g., if m = 2p, then h(k) is just the p lowest-order bits of k; thus, the hash function does not depend on all the bits.
Similarly, if the keys are decimal numbers, m should not be set to be a power of 10.
It’s a good practice to set the table size m to a prime number.
Good values for m: primes not too close to exact powers of 2
e.g., for a hash table to hold 2,000 numbers, if we don’t mind an
average of 3 numbers being hashed to the same slot, choose m = 701.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 11 / 33
How to Deal with String Keys: Method 1
Add up the ASCII values of the characters in the string.
(For simplicity, we only use their positions in the alphabets here.)
Most hash functions assume that keys are natural numbers. If keys are not natural numbers, a way must be found to interpret them as natural numbers.
Problems:
Different permutations of the same set of characters would have the
same hash value.
If the table size is large, the keys do not distribute well.
e.g., m = 10,007 (a prime number) and all string keys have ≤ 8 characters. Since ASCII values ≤ 127, their numeric keys are in the range between 0 and 127 × 8 = 1, 016.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 12 / 33
How to Deal with String Keys: Method 2
h(key) = key[0] + 27·key[1] + 272·key[2] mod m where m is hash table size.
If the first 3 characters are random and the table size m is 10,007, then it is a reasonably equitable distribution.
Problems:
letters in an English word are not random;
according to some dictionary, there are only 2,851 different combinations for the first 3 letters of English words;
therefore, only at most 28% of the table can actually be hashed to (if m = 10,007).
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 13 / 33
How to Deal with String Keys: Method 3
h(key)
L−1
= 37(L−1−i)·key[i] mod m i=0
= 37(L−1)·key[0]+37(L−2)·key[1]+…+key[L−1] mod m where L is the length of a key.
This hash function involves all characters in the key and the hash values are expected to distribute well.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 14 / 33
Part II
Collision Handling
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 15 / 33
Separate Chaining
Keys: the set of squared numbers {1,4,9,16,…}.
Hash function: h(k) = k mod 10.
Using the idea of equivalence classes.
The hash table is more than a simple array, but a table of linked lists. Keys having the same hash values are chained on a separate linked
list.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 16 / 33
Separate Chaining Operations
To insert a key k: Compute h(k).
If T[h(k)] contains a null pointer, the list (or chain) is empty. Initialize this table entry to point to a linked list with a single node containing k alone.
If T[h(k)] points to a non-empty list, add k to the beginning of the list.
To delete a key k
Compute h(k) to determine which list to traverse.
Search for the key k in the list that T[h(k)] points to. Delete the item with key k if it is found.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 17 / 33
Separate Chaining Features
If the hash function works well, the number of keys in each linked list will be a small constant.
Therefore, we expect that each search, insertion, and deletion can be done in constant time.
Disadvantage: Memory allocations and de-allocations in linked list manipulations slow down the operations.
Advantage: deletion is easy.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 18 / 33
Open Addressing
Instead of putting keys of the same hash value into a chain, open addressing will relocate the key k to be inserted if it collides with an existing key.
Open addressing needs to determine the sequence of slots to be examined for key relocation.
Key k may be stored at an entry different from T[h(k)].
Two issues arise:
what is the relocation scheme?
how to search for k later?
Three common methods for resolving collisions in open addressing 1. linear probing
2. quadratic probing
3. double hashing
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 19 / 33
Basic Strategy of Open Addressing
To insert a key k, compute h0(k).
If T[h0(k)] is empty, insert it there.
If collision occurs, probe alternative cell in the following order: h1(k), h2(k), . . ., until an empty cell is found.
hi(k)=(hash(k)+f(i)) modm, where the function f determines the collision resolution strategy and f(0) = 0.
Different open addressing methods differ in the definition of f ().
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 20 / 33
Linear Probing: Insertion
f(i) = i
hi (k) = (hash(k) + i) mod m
Basic strategy: Table cells are probed sequentially (with wrap-around) until an empty slot is found.
Again let m be the table size and N be the number of items. Let k be the new key to be inserted; compute hash(k). Fori=0tom−1,computej=(hash(k)+i) modm.
If T[j] is empty, then we put k there and stop.
If no empty slot can be found to put k, the table is full; report an error.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 21 / 33
Linear Probing: Example
hash(k) = k mod 10
Insert the following keys: 89, 18, 49, 58, 69 To insert 58, probe T[8],T[9],T[0],T[1] To insert 69, probe T[9],T[0],T[1],T[2]
Table Index
Insert 89
Insert 18
Insert 49
Insert 58
Insert 69
0
1
2
3
4
5
6
7
8
9
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 22 / 33
Primary Clustering
We call a block of contiguously occupied table entries a cluster.
On average, when we insert a new key k, we may hit the middle of a cluster. Therefore, the time to insert k would be proportional to half the size of a cluster.
⇒ the larger the cluster, the slower the performance.
Linear probing has the following disadvantages:
Once h(k) falls into a cluster, this cluster will definitely grow in size by
one. Thus, this may worsen the performance of insertions in the future.
If two clusters are only separated by one entry, they may be merged
together by an insertion to that entry.
⇒ cluster size can increase drastically by 1 insertion.
This means that the performance of insertion can deteriorate
drastically after a single insertion.
Large clusters are easy targets for collisions.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 23 / 33
Quadratic Probing
f(i) = i2
hi (k) = (hash(k) + i2) mod m
Table Index
Insert 89
Insert 18
Insert 49
Insert 58
Insert 69
0
1
2
3
4
5
6
7
8
9
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 24 / 33
Quadratic Probing: Example
Example:
hash(k) = k mod 10
Insert the following keys: 89, 18, 49, 58, 69
Toinsert58,probeT[8],T[9],T[(8+4)mod10]
To insert 69, probe T [9], T [(9 + 1) mod 10],
T[(9+4) mod10]
Two keys with different home positions will have different probe sequences. E.g.,
m = 101,h(k1) = 30,h(k2) = 29
probe sequence for k1: 30,30+1,30+4,30+9
probe sequence for k2: 29,29+1,29+4,29+9
If the table size is prime, then a new key can always be inserted if the table is at least half empty (see proof in reference book by Weiss).
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 25 / 33
Secondary Clustering
Keys that hashed to the same home position will probe the same alternative cells.
Simulation results suggest that it generally causes less than an extra half probe per search.
To avoid secondary clustering, the probe sequence need to be a function of the original key value, not the home position.
Rm 3553, desmond@ust.hk COMP2012 (Fall 2020) 26 / 33
Double Hashing
To alleviate the problem of clustering, the sequence of probes for a key should be independent of its primary home position.
Thus, use two hash functions: hash( ) and hash2( ). f (i) = i × hash2(k)
hi (k) = (hash(k) + i × hash2(k)) mod m e.g.,hash2(k)=R−(k modR),whereR