15_Hash_Functions.pdf
Lecture 15
Dictionaries and Hash Tables
EECS 281: Data Structures & Algorithms
Outline
• Dictionary ADT
• Containers with key lookup
– Bucket-based data structures
• Hashing
– Hash functions
– Compression
• Collision Resolution
2
Dictionary ADT
Data Structures & Algorithms
Dictionary ADT
A container of items (key/value pairs) that
supports two basic operations
• Insert a new item
• Search (retrieve) an item with a given key
Two primary uses
– A set of things: check if something is in the set
– Key-Value storage: look up values by keys
4
Dictionary ADT Operations
• Desirable Operations
– Insert a new item
– Search for item(s) having a given key
– Remove a specified item
– Sort the dictionary
– Select the kth largest item in a dictionary
– Join two dictionaries
• Other basic container operations: construct, test
if empty, destroy, copy…
5
Containers with key lookup
Identifying a container with fast search and
fast insert for arbitrary
• Sorted Vectors: Insert will be O(n)
• Unsorted Vectors: Search will be O(n)
• Sorted/Unsorted Linked Lists: Search is O(n)
• Binary Search Tree (STL map<>)
– Search: average O(log n), worst-case O(n)
– Insert: average O(log n), worst-case O(n)
• Hash Table (STL unordered_map<>)
– Search: average O(1), worst case O(n)
– Insert: average O(1), worst case O(n)
6
What if the set of keys is small?
• Example: calendar for 1..n days
– n could be 365 or 366
– Can look up a particular day in O(1) time
– Every day is represented by a bucket,
i.e., some container
• If we have a range of integers that fits into
memory, everything is easy
– What if we don’t?
7
Dictionary ADT
Data Structures & Algorithms
Hash Table Fundamentals
Data Structures & Algorithms
Hashing Defined
Locate items in a table by key
• Use arithmetic operations to calculate a table index
(bucket) from a given key
Need
• Translation: converts a search key into an integer
• Compression: limits an integer to a valid index
• Collision resolution: resolves search keys that hash to
same table index
10
Direct Addressing
Each key maps to an element of the array
11
Hashed Addressing
Only actual keys map to an element of the array
12
Hash Function
Translation: t(key) Þ hashint
• Converts the key into an integer
Compression: c(hashint) Þ table index
• Maps the hashed integer into the range [0, M)
A hash function combines translation and
compression:
h(key) Þ c(t(key)) Þ table index
Note: std::hash<> provides only translation, not
compression
13
banana
Index
Hashing Example
Hash Function
NULL
NULL
NULL
NULL
NULL
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Key
int translate(const char *c) {
return (*c – ‘a’);
}
14
[1]
3
Translating Floating Points
• key between 0 and 1: [0, 1)
h(key) = ëkey * Mû
• key between s and t: [s, t)
h(key) = ë(key – s) / (t – s) * Mû
• Try: range = [1.38, 6.75), M = 13
compute h(3.65)
15
Translating Integers
Modular hash function
t(key) = key
h(key) = c(t(key)) = key mod M
• Great if keys randomly distributed
– Often, keys are not randomly distributed
– Example: midterm bubbles scores all multiples of 2.5
– Example: pick a number from 1 to 10
• Don’t want to pick a bad M
– BAD: M and key have common factors
17
Translating Strings
• Simple hash sums ASCII character codes
• Problem Case: stop, tops, pots, spot
– Sum of each is equivalent
– All will map to same hash table address
– Multiple collisions
• Solution: Character position is important
– Consider decimal numbers
123 = 1 * 102 + 2 * 101 + 3 * 100
321 = 3 * 102 + 2 * 101 + 1 * 100
18
Better Translation: Rabin Fingerprint
• Instead of adding up character codes, view
strings as decimal numbers
– Characters: ‘T’, ‘O’, ‘M’, ‘ ’, ‘M’, …
– ASCII codes: 84, 79, 77, 32, 77, …
– Running fingerprints:
T 84
TO 10 * 84 + 79
TOM 10 * (10 * 84 + 79) + 77
TOM! 10 * (10 * (10 * 84 + 79) + 77) + 32, …
– Base 10 is used for illustration only (use larger numbers)
• Shuffling the chars usually changes result
19
Compression
c(hashint) Þ index in range [0, M)
– hashint may be < 0 or >= M
Division Method
|hashint| mod M, where M is prime
MAD (multiply and divide) Method
|a * hashint + b| mod M, where a and b are prime
Use this method when you can’t control M
Note: a mod M must not equal 0!
20
Hash Function
21
Translation: t(key) Þ hashint
• Converts the key into an integer
Compression: c(hashint) Þ table index
• Maps the hashed integer into the range [0, M)
A hash function combines translation and
compression:
h(key) Þ c(t(key)) Þ table index
lemon
Index
Hashing Example
Hash Function
NULL
NULL
NULL
NULL
NULL
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Key
int translate(const char *c) {
return (*c – ‘a’);
}
22
[11]
Need compression!
Hash Table Size
• Table of size M
– About the number of elements expected
– If unsure, guess high
• Hash function must return keys as integers
in range [0, M)
23
lemon
Index
Index Compression
Hash Function
NULL
NULL
NULL
NULL
NULL
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Key
int translate(const char *c) {
return (*c – ‘a’);
}
int compression(int i) {
return i % table_size;
}
24
[3]
orange
Index
Collision
Hash Function
NULL
NULL
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Key
int translate(const char *c) {
return (*c – ‘a’);
}
int compression(int i) {
return i % table_size;
}
25
[6]
[!!!]
Hash Tables Summarized
• Efficient ADT for insert, search, and
remove
• Hashing a key h(key) Þ table index
– Maps key to table index in two steps
– Translates key into an integer
t(key) Þ hashint
– Compression maps integer into range [0, M)
c(hashint) Þ table index
• Therefore, h(key) ≡ c(t(key))
26
Hash Table Fundamentals
Data Structures & Algorithms
Hash Table Analysis & Applications
Data Structures & Algorithms
Good Hash Functions
• Benefits of hash tables depend on having “good”
hash functions
• Must be easy to compute
– Will compute a hash for every key
– Will compute same hash for same key
• Should distribute keys evenly in table
– Will minimize collisions
• Collision: two keys map to same index
• Trivial, poor hash function: h(key) { return 0; }
– Easy to compute, but poor distribution maximizes
collisions
29
Complexity of Hashing
For simplicity, assume perfect hashing
(no collisions)
• What is cost of insertion? O( )
• What is cost of search? O( )
• What is cost of removal? O( )
Wouldn’t it be nice to live in a perfect world?
• Recall the Pigeonhole Principle
1
1
1
30
Real-World Hash Tables
• C++ hash table containers (STL, C++11+)
• unordered_set<>, unordered_multiset<>
• unordered_map<>, unordered_multimap<>
• Database indices are built on hash tables
• Compilers use a hash table for identifiers
31
Sample Program (1/2)
1 #include
2 #include
3 #include
4 using namespace std;
5
6 int main() {
7 unordered_map
8 string month;
9 // The simple way to add items
10 // [] causes it to exist!
11 months[“January”] = 31;
12 months[“February”] = 28;
13 months[“March”] = 31;
14 months[“April”] = 30;
15 months[“May”] = 31;
16 months[“June”] = 31;
17 months[“July”] = 31;
18 months[“August”] = 31;
19 months[“September”] = 30;
20 months[“October”] = 31;
21 months[“November”] = 30;
22 months[“December”] = 31; 32
Sample Program (2/2)
23 cout << "Enter a month name: ";
24 cin >> month;
25
26 // The simple (but wrong way) to look up an item
27 // Causes “month” to exist, with 0 days
28 // cout << month << " has " << months[month] << " days\n";
29
30 // The right way to look up an item
31 auto it = months.find(month); // An iterator to a pair
32 if (it == months.end())
33 cout << month << " not found\n";
34 else
35 cout << it->first << " has " << it->second << " days\n";
36
37 return 0;
38 } // main() 33
Major Uses of Hash Tables
• Sets of ints, strings, images, class objects
– Check set membership
– Detect/filter duplicates
– Find unique elements
– Find matching elements, e.g., a + b = 0
• Key-value storage: look up values by keys
– Count things: value_type = int
– Maintain sparse vectors: key_type = int
– Maintain linked structures: key_type = value_type
34
Index
Sorting a Hash Table?
Hash Function
NULL
NULL
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Key
int translate(const char *c) {
return (*c – ‘a’);
}
int compression(int i) {
return i % table_size;
}
35
Dictionary ADTs & Hashing
Hashing is an efficient implementation of:
• Insert a new item
• Search for an item (or items) having a given key
• Remove a specified item
Hashing is an inefficient implementation of:
• Select the kth largest item in a dictionary
• Sort the items in the dictionary
36
Data Structures for a Book Index
• For each term/phrase store page number(s)
where it is introduced
• Sorted entries?
• How does one
construct such
an index?
37
Composite Hash Functions
• Example: hang some
info over geo-locations
– Hash {long, lat} pairs
• How do we combine
two hash functions?
• H({x, y})
= (h(x) + p * h(y)) % M
• How do we hash
class objects?
38
Is (h1 + p * h2) % M a good hash combiner?
• Recall properties of good hash functions
– Fast computation
– Even distribution of values
– No easily-predictable collisions
((h1 + p*h2) + p*h3) % M = ((h1 + p*h3) + p*h2) % M
• A better combiner for n values (i = 0 .. n – 1)
seed ^= hi + C + (seed << 6) + (seed >> 2) 39
When Not To Use Hash Tables
• In many applications, it is tempting to use
unordered_set<> or unordered_map<>, but
– Significant space overhead of nested
containers
– Every access computes hash function
– O(n) worst-case time (STL implementations)
40
When Not To Use Hash Tables
• When keys are small ints,
use bucket arrays
– Some keys can be coerced
to small integers: enums,
simple fractions, months
• For static data, set membership, or lookup
– Consider sorting + binary search
• For key-value storage where traversals
are needed but not lookup (e.g., sparse vec)
– Store key-value pairs in a list (or vector)
41
Hash Table Analysis & Applications
Data Structures & Algorithms
Word Count Demo
From a web browser:
bit.ly/eecs281-wordcount-demo
From a terminal:
wget bit.ly/eecs281-wordcount-demo -O wordcount.cpp
At the command line:
g++ -std=c++1z wordcount.cpp -o wordcount
./wordcount
Enter filename: wordcount.cpp