Hash Tables (Chapter 11 in the textbook)
Based on slides of Dan Suciu
1
Dictionary ADT
create : insert : find : delete :
dictionary
dictionary key values dictionary dictionary key values
dictionary key dictionary
kim chi
spicy cabbage
Kreplach
tasty stuffed dough
Kiwi
Australian fruit
insert(kohlrabi, upscale tuber)
find(kreplach) kreplach: tasty stuffed dough
2
Implementations So Far
unsorted list
sorted array
Binary Search Trees;
Red-Black – worst case
Array of size n where keys are 0,…,n-1
insert
find+O(1)
O(n)
O(log n)
find
O(n)
O(log n)
O(log n)
delete
find+ O(1)
O(n)
O(log n)
If the keys are 0, 1, …, n-1 then we can do all three in O(1) !
But typically the “space” of keys is much larger than n.
3
Hash Tables: Basic Idea
• Useakey(arbitrarystringornumber)toindex directly into an array – O(1) time to access records
– A[h(“kreplach”)] = “tasty stuffed dough”
– Need a hash function, h, to convert the key to an integer
Key
Data
0 = h(“kim chi”)
“kim chi”
spicy cabbage
1 = h(“kreplach”)
“kreplach”
tasty stuffed dough
2 = h(“kiwi”)
“kiwi”
Australian fruit
4
•
Applications
When log(n) is just too big…
– Symbol tables in interpreters
– Real-time databases
• air traffic control
• packet routing
When associative memory is needed…
(standard memory: give location, get value at that location;
associative memory: give value, get locations where the value is stored.)
– Dynamic programming
• cache results of previous computation
• Chess endgames
– Many text processing applications – e.g. Web
•
5
Properties of Good Hash Functions
• Mustreturnnumber0,…,tablesize-1
• Shouldbeefficientlycomputable:O(1)time
• Shouldnotwastespaceunnecessarily
– For every index, there is at least one key that hashes to it
• Load factor lambda = (number of keys / TableSize)
• Shouldminimizecollisions
= different keys hashing to same index
6
Integer Keys
• Hash(x) = x % TableSize (if the key x is a number)
• In theory it is a good idea to make TableSize prime. Why? Keys often have some pattern
• mostly even
• mostly multiples of 10
• in general: mostly multiples of some k
If k is a factor of TableSize, then only (TableSize/k) slots will ever be used!
To be safe: choose TableSize = a prime.
7
String Keys – converting to integers
• If keys are strings, can get an integer by adding up ASCII values of characters
in key
• Problem 1: What if TableSize is 10,000 and all keys are 8 or less characters long?
• Problem 2: What if keys often contain the same characters (“abc”, “bca”, etc.)?
8
for (i=0;i
22
Closed Hashing II: Quadratic Probing • MainIdea:Spreadoutthesearchforanemptyslot–
Increment by i2 instead of i
• hi(X)=(Hash(X)+i2)%TableSize h0(X) = Hash(X) % TableSize
h1(X) = Hash(X) + 1 % TableSize h2(X) = Hash(X) + 4 % TableSize h3(X) = Hash(X) + 9 % TableSize
23
Quadratic Probing Example
insert(14) insert(8) insert(21) insert(2) 14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2
0000 1111 2222 3333 4444 5555 6666
probes: 1 1 3 1 24
14
14
8
14
8
21
14
8
2
21
Problem With Quadratic Probing
insert(14) insert(8) insert(21) insert(2) 14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2
insert(7) 7%7 = 0
0 1 2 3 4 5 6
14
14
8
14
8
21
14
8
2
21
14
8
2
21
0000 1111 2222 3333 4444 5555 6666
probes:1 1 3 1 ??25
Load Factor in Quadratic Probing
• The problem is called secondary clustering (the set of
filled slots ‘bounces’ around the array in a fixed pattern).
• Theorem: If TableSize is prime and 1⁄2, quadratic probing
will find an empty slot; for greater , might not
• With load factors near 1⁄2 the expected number of probes is empirically near optimal – no exact analysis known
26
Closed Hashing III: Double Hashing • Idea:Spreadoutthesearchforanemptyslotby
using a second hash function
– No primary or secondary clustering
• hi(X)=(Hash1(X)+(i-1)*Hash2(X))mod
TableSize
for i = 0, 1, 2, …
• Good choice of Hash2(X) can guarantee does not get “stuck” as long as < 1
– Integer keys:
Hash2(X) = R – (X mod R)
where R is a prime smaller than TableSize
27
Double Hashing Example
insert(14) insert(8) insert(21) insert(2) 14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2
insert(7) 7%7 = 0 5-(7%5)=3
0 1 2 3 4 5 6
5-(21%5)=4 0000 1111 2222 3333 4444 5555 6666
14
14
8
14
8
21
14
8
2
21
14
8
2
21
probes:1 1 2 1 ??28
Load Factor in Double Hashing
• For any < 1, double hashing will find an empty slot (given appropriate table size and hash2)
• Search cost approaches optimal (random re-hash):
– unsuccessful search:
– successful search:
1
1ln 1 1
Note natural logarithm!
1
• No primary clustering and no secondary clustering
• Still becomes costly as nears 1.
29
What to do when the hash table is too full:
Rehash:
• Build a new table with size > 2 * size of old table, and a prime number.
•Take a new hash function (appropriate for the new size). •Insert all the elements from the old table in the new table.
30
Deletion with Separate Chaining
No problem – simply delete element from the linked list
31
Deletion in Closed Hashing
delete(2) find(7) 00
11 22 33 44 55 66
Where is it?!
0
1
2
7
0
1
7
What should we do instead?
32
Lazy Deletion
delete(2) find(7) 00
11 22 33 44 55 66
But now what is the problem?
Indicates deleted value: if you find it, probe again
0
1
2
7
0
1
#
7
33