CS计算机代考程序代写 data structure database compiler c++ algorithm 15_Hash_Functions.pdf

15_Hash_Functions.pdf

Lecture 15
Dictionaries and Hash Tables

EECS 281: Data Structures & Algorithms

Outline
• Dictionary ADT

• Containers with key lookup

– Bucket-based data structures
• Hashing

– Hash functions
– Compression

• Collision Resolution

Dictionary ADT

Data Structures & Algorithms

Dictionary ADT
A container of items (key/value pairs) that
supports two basic operations

• Insert a new item
• Search (retrieve) an item with a given key

Two primary uses

– A set of things: check if something is in the set
– Key-Value storage: look up values by keys

Dictionary ADT Operations
• Desirable Operations

– Insert a new item
– Search for item(s) having a given key
– Remove a specified item
– Sort the dictionary
– Select the kth largest item in a dictionary
– Join two dictionaries

• Other basic container operations: construct, test
if empty, destroy, copy…

Containers with key lookup
Identifying a container with fast search and
fast insert for arbitrary pairs

• Sorted Vectors: Insert will be O(n)
• Unsorted Vectors: Search will be O(n)
• Sorted/Unsorted Linked Lists: Search is O(n)
• Binary Search Tree (STL map<>)

– Search: average O(log n), worst-case O(n)
– Insert: average O(log n), worst-case O(n)

• Hash Table (STL unordered_map<>)
– Search: average O(1), worst case O(n)
– Insert: average O(1), worst case O(n)

What if the set of keys is small?
• Example: calendar for 1..n days

– n could be 365 or 366
– Can look up a particular day in O(1) time
– Every day is represented by a bucket,

i.e., some container
• If we have a range of integers that fits into

memory, everything is easy

– What if we don’t?

Dictionary ADT

Data Structures & Algorithms

Hash Table Fundamentals

Data Structures & Algorithms

Hashing Defined
Locate items in a table by key
• Use arithmetic operations to calculate a table index

(bucket) from a given key
Need
• Translation: converts a search key into an integer
• Compression: limits an integer to a valid index
• Collision resolution: resolves search keys that hash to

same table index

Direct Addressing

Each key maps to an element of the array

Hashed Addressing

Only actual keys map to an element of the array

Hash Function
Translation: t(key) Þ hashint
• Converts the key into an integer
Compression: c(hashint) Þ table index
• Maps the hashed integer into the range [0, M)

A hash function combines translation and
compression:

h(key) Þ c(t(key)) Þ table index

Note: std::hash<> provides only translation, not
compression

banana
Index

Hashing Example

Hash Function

NULL

NULL
NULL

NULL

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]

Key

int translate(const char *c) {
return (*c – ‘a’);

}

[1]

Translating Floating Points
• key between 0 and 1: [0, 1)

h(key) = ëkey * Mû

• key between s and t: [s, t)
h(key) = ë(key – s) / (t – s) * Mû

• Try: range = [1.38, 6.75), M = 13
compute h(3.65)

Translating Integers
Modular hash function

t(key) = key
h(key) = c(t(key)) = key mod M

• Great if keys randomly distributed
– Often, keys are not randomly distributed
– Example: midterm bubbles scores all multiples of 2.5
– Example: pick a number from 1 to 10

• Don’t want to pick a bad M
– BAD: M and key have common factors

Translating Strings
• Simple hash sums ASCII character codes

• Problem Case: stop, tops, pots, spot

– Sum of each is equivalent
– All will map to same hash table address
– Multiple collisions

• Solution: Character position is important

– Consider decimal numbers
123 = 1 * 102 + 2 * 101 + 3 * 100

321 = 3 * 102 + 2 * 101 + 1 * 100

Better Translation: Rabin Fingerprint
• Instead of adding up character codes, view

strings as decimal numbers
– Characters: ‘T’, ‘O’, ‘M’, ‘ ’, ‘M’, …
– ASCII codes: 84, 79, 77, 32, 77, …
– Running fingerprints:

T 84
TO 10 * 84 + 79
TOM 10 * (10 * 84 + 79) + 77
TOM! 10 * (10 * (10 * 84 + 79) + 77) + 32, …

– Base 10 is used for illustration only (use larger numbers)
• Shuffling the chars usually changes result

Compression
c(hashint) Þ index in range [0, M)

– hashint may be < 0 or >= M
Division Method

|hashint| mod M, where M is prime
MAD (multiply and divide) Method

|a * hashint + b| mod M, where a and b are prime
Use this method when you can’t control M

Note: a mod M must not equal 0!

Hash Function

Translation: t(key) Þ hashint
• Converts the key into an integer
Compression: c(hashint) Þ table index
• Maps the hashed integer into the range [0, M)

A hash function combines translation and
compression:

h(key) Þ c(t(key)) Þ table index

lemon
Index

Hashing Example

Hash Function

NULL

NULL
NULL

NULL

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]

Key

int translate(const char *c) {
return (*c – ‘a’);

}

[11]

Need compression!

Hash Table Size
• Table of size M

– About the number of elements expected
– If unsure, guess high

• Hash function must return keys as integers
in range [0, M)

lemon
Index

Index Compression

Hash Function

NULL

NULL
NULL

NULL

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]

Key

int translate(const char *c) {
return (*c – ‘a’);

}
int compression(int i) {
return i % table_size;

}

[3]

orange
Index

Collision

Hash Function

NULL

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]

Key

int translate(const char *c) {
return (*c – ‘a’);

}
int compression(int i) {
return i % table_size;

}

[6]

[!!!]

Hash Tables Summarized
• Efficient ADT for insert, search, and

remove

• Hashing a key h(key) Þ table index
– Maps key to table index in two steps
– Translates key into an integer

t(key) Þ hashint
– Compression maps integer into range [0, M)

c(hashint) Þ table index
• Therefore, h(key) ≡ c(t(key))

Hash Table Fundamentals

Data Structures & Algorithms

Hash Table Analysis & Applications

Data Structures & Algorithms

Good Hash Functions
• Benefits of hash tables depend on having “good”

hash functions
• Must be easy to compute

– Will compute a hash for every key
– Will compute same hash for same key

• Should distribute keys evenly in table
– Will minimize collisions

• Collision: two keys map to same index

• Trivial, poor hash function: h(key) { return 0; }
– Easy to compute, but poor distribution maximizes

collisions

Complexity of Hashing
For simplicity, assume perfect hashing

(no collisions)

• What is cost of insertion? O( )
• What is cost of search? O( )
• What is cost of removal? O( )

Wouldn’t it be nice to live in a perfect world?

• Recall the Pigeonhole Principle

1
1
1

Real-World Hash Tables
• C++ hash table containers (STL, C++11+)

• unordered_set<>, unordered_multiset<>
• unordered_map<>, unordered_multimap<>

• Database indices are built on hash tables

• Compilers use a hash table for identifiers

Sample Program (1/2)
1 #include
2 #include
3 #include
4 using namespace std;
5

6 int main() {
7 unordered_map months;
8 string month;
9 // The simple way to add items
10 // [] causes it to exist!
11 months[“January”] = 31;
12 months[“February”] = 28;
13 months[“March”] = 31;
14 months[“April”] = 30;
15 months[“May”] = 31;
16 months[“June”] = 31;

17 months[“July”] = 31;
18 months[“August”] = 31;
19 months[“September”] = 30;
20 months[“October”] = 31;
21 months[“November”] = 30;
22 months[“December”] = 31; 32

Sample Program (2/2)
23 cout << "Enter a month name: "; 24 cin >> month;
25

26 // The simple (but wrong way) to look up an item
27 // Causes “month” to exist, with 0 days
28 // cout << month << " has " << months[month] << " days\n"; 29 30 // The right way to look up an item 31 auto it = months.find(month); // An iterator to a pair
32 if (it == months.end())
33 cout << month << " not found\n"; 34 else 35 cout << it->first << " has " << it->second << " days\n"; 36 37 return 0; 38 } // main() 33 Major Uses of Hash Tables • Sets of ints, strings, images, class objects – Check set membership – Detect/filter duplicates – Find unique elements – Find matching elements, e.g., a + b = 0 • Key-value storage: look up values by keys – Count things: value_type = int – Maintain sparse vectors: key_type = int – Maintain linked structures: key_type = value_type 34 Index Sorting a Hash Table? Hash Function

NULL

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]

Key

int translate(const char *c) {
return (*c – ‘a’);

}
int compression(int i) {
return i % table_size;

}

Dictionary ADTs & Hashing
Hashing is an efficient implementation of:
• Insert a new item
• Search for an item (or items) having a given key
• Remove a specified item
Hashing is an inefficient implementation of:
• Select the kth largest item in a dictionary
• Sort the items in the dictionary

Data Structures for a Book Index
• For each term/phrase store page number(s)

where it is introduced

• Sorted entries?

• How does one
construct such
an index?

Composite Hash Functions
• Example: hang some

info over geo-locations

– Hash {long, lat} pairs
• How do we combine

two hash functions?
• H({x, y})

= (h(x) + p * h(y)) % M
• How do we hash

class objects?
38

Is (h1 + p * h2) % M a good hash combiner?
• Recall properties of good hash functions

– Fast computation
– Even distribution of values
– No easily-predictable collisions

((h1 + p*h2) + p*h3) % M = ((h1 + p*h3) + p*h2) % M
• A better combiner for n values (i = 0 .. n – 1)
seed ^= hi + C + (seed << 6) + (seed >> 2) 39

When Not To Use Hash Tables
• In many applications, it is tempting to use

unordered_set<> or unordered_map<>, but
– Significant space overhead of nested

containers
– Every access computes hash function
– O(n) worst-case time (STL implementations)

When Not To Use Hash Tables
• When keys are small ints,

use bucket arrays
– Some keys can be coerced

to small integers: enums,
simple fractions, months

• For static data, set membership, or lookup
– Consider sorting + binary search

• For key-value storage where traversals
are needed but not lookup (e.g., sparse vec)
– Store key-value pairs in a list (or vector)

Hash Table Analysis & Applications

Data Structures & Algorithms

Word Count Demo
From a web browser:

bit.ly/eecs281-wordcount-demo

From a terminal:
wget bit.ly/eecs281-wordcount-demo -O wordcount.cpp

At the command line:
g++ -std=c++1z wordcount.cpp -o wordcount

./wordcount

Enter filename: wordcount.cpp

Related Posts