Hash Tables and Hashing
Dictionary data structures, with lookup times
• Sorted array
Copyright By PowCoder代写 加微信 powcoder
Binary search — O(log n)
• Association list
Linear search — O(n)
Nowwhatifwehadaspecialcase? Keys∈{0,1,…,k−1} Could we do better?
• An array of size k, using keys as indices Array indexing — O(1) !
This technique is known as direct addressing
A direct addressing example
Suppose we want to map digits to their names in English:
let digits = [‘zero’, ‘one’, ‘two’, ‘three’, ‘four’,
‘five’, ‘six’, ‘seven’, ‘eight’, ‘nine’]
def get_digit_name(name: nat?) -> str?:
return digits[name]
If these are the keys you need, excellent data structure.
Non-direct addressing example: phone book
A phone book is a dictionary where the keys are names of people and the values are their phone numbers.
How can we use names (i.e., strings) as keys for an array? We’ll need to map these strings to integers ∈ {0, 1, …, k − 1}.
We’ll call a function that maps arbitrary keys to integers a hash function, and the array a hash table.
Idea: use the position of the first letter of the person’s name (A=0, B=1, …)
The “first letter” hash table
We have an array of “buckets”, indexed by the hash of the name, each of which can hold a single key-value association.
Q: If we insert Charles, which bucket would he map to?
To look up a key: hash the key, look in the correct bucket. If it has the right key, we found the value.
To insert a key: hash the key, add key and value to the correct bucket.
Hash collision!
Using our “first letter” hash function h1:
h1(“Allison”) = 0 h1(“Carol”) = 2 h1(“Yves”) = 24
When the hash function gives the same value for two keys, that’s called a hash collision:
h1(“Charles”) = 2
And we need some way to resolve the collision.
Note: while h1 is a terrible hash function (stay tuned), collisions are inevitable regardless of hash function.
Solutions for hash collisions
1. Store a linked list in each bucket (separate chaining)
• That’s in homework 3
2. Use some other free bucket instead (open addressing)
• Use the next free bucket (linear probing) – we’ll see today • Try further and further buckets (quadratic probing)
• Do more hashing to find another bucket (double hashing)
Lots of different options, all with different tradeoffs! Plus variants on top of these options:
• hashing • 2-choice hashing
• Fibonacci hashing
Separate chaining hash table
key val next
Each bucket is a list! All the keys that map to that bucket go into the list.
Linear search in the bucket to find the right key.
key key val val next next
key val next
Linear probing hash table
If your bucket is taken, try the next one! (then the next, …) → Number of elements limited by the number of buckets.
Let’s insert Charles. Bucket 2 is taken. Next bucket is 3, it’s free. Let’s put Charles there!
Lookup: start at “correct” bucket, keep looking until we find the
right key (or find an empty spot → key must be absent) Fun fact: deletion becomes much more complicated…
Linear probing hash table
If your bucket is taken, try the next one! (then the next, …) → Number of elements limited by the number of buckets.
Let’s look up Charles. Bucket 2 has wrong key. Bucket 3 has correct key! We have Charles’s number!
Lookup: start at “correct” bucket, keep looking until we find the
right key (or find an empty spot → key must be absent) Fun fact: deletion becomes much more complicated…
Linear probing hash table
If your bucket is taken, try the next one! (then the next, …) → Number of elements limited by the number of buckets.
Let’s look up Carlos.
Lookup: start at “correct” bucket, keep looking until we find the
right key (or find an empty spot → key must be absent) Fun fact: deletion becomes much more complicated…
Linear probing hash table
If your bucket is taken, try the next one! (then the next, …) → Number of elements limited by the number of buckets.
Let’s look up Carlos. Bucket 2 has wrong key. Bucket 3 has wrong key. Bucket 4 is empty!
No Carlos in the dictionary! Worst case: try all buckets!
Lookup: start at “correct” bucket, keep looking until we find the
right key (or find an empty spot → key must be absent) Fun fact: deletion becomes much more complicated…
Linear probing hash table
If your bucket is taken, try the next one! (then the next, …) → Number of elements limited by the number of buckets.
Let’s insert Carlos. Bucket 2 is taken. Bucket 3 is taken. Bucket 4 is empty! Let’s put Carlos there!
Lookup: start at “correct” bucket, keep looking until we find the
right key (or find an empty spot → key must be absent) Fun fact: deletion becomes much more complicated…
For good performance, we can’t let the table get too full • Separate chaining: the length of the chains is O(n) • Linear probing: the length of the scan is O(n)
For good performance, we can’t let the table get too full • Separate chaining: the length of the chains is O(n) • Linear probing: the length of the scan is O(n)
One way to measure how full the table is is the load factor: load factor = nk
• n: number of entries • k: number of buckets
For separate chaining, we should keep the load factor < 2 For linear probing, we should keep the load factor < 0.75 (Derived from practice)
Our h1 hash function sucks
• Using the first letter limits us to 26 buckets
For a big phonebook, need more buckets to keep load low
• Some letters have more names than others
Buckets are likely to be uneven
So most keys will be in buckets with lots of other keys!
• Etc. (It’s really bad.) Let’s try to do better.
Hash Function Criteria
So what makes a good hash function?
• Large output range: to allow a large number of buckets Say,0...264 −1ratherthan0...25
Use modulo to “scale” it down to # of buckets.
E.g., 100 buckets, h("Charles") = 172350 → 172350 modulo 100 → bucket 50
• Uniform: hashes are spread uniformily over their range To spread things out across all buckets
• Avalanche: similar inputs produce very different hashes E.g. h("George") = 34 but h("Georges") = 165293787
Ideally, 1 bit diff. in input → half the bits diff. in output Inputs are likely to have similarities between each other E.g., a lot of “Smith”s in the phone book
• Etc. (This is a very deep topic.)
Good news!
DSSL2 comes with a way to generate excellent hash functions.
#lang dssl2
import sbox_hash
# careful: want the method itself, NOT its return value
let my_hash_function = SboxHash64().hash
my_hash_function("George") # => 12954263217209680626
my_hash_function(“Georges”) # => 17471623022190201144
# can create multiple different hash functions
# e.g., for multiple hash tables
# each SboxHash64 creation generates a new, random
# hash function
let my_hash_function2 = SboxHash64().hash
my_hash_function2(“George”) # => 910100432649087933
my_hash_function2(“Georges”) # => 16264795133863970133
Complexity of lookup (and insertion)
More good news!
• If you have enough buckets (i.e., load factor is reasonable).
• And your hash function is good (i.e, elements are spread uniformily across buckets).
Then you’ll have O(1) elements per bucket on average. Thus, lookup and insertion will have an average case
asymptotic complexity of O(1)! Pretty great.
• But, worst case is still O(n). It’s very unlikely, though, so we won’t worry too much about it.
• The actual analysis is complicated, so out of scope.
Intuition: expected load (elements per bucket) is constant
When the load factor gets too high, we need to grow the table
• Requires rehashing everything!
With new # of buckets, elements may not map to the same
bucket as before!
And we still want to be able to find them
Even worse than copying all the elements of an array!
• Start with 100 buckets. h(“George”) = 2134
George goes to bucket: 2134 modulo 100 = 34
• Grow to 200 buckets
Where do we look for George now? 2134 modulo 200 = 134 Look for George in bucket 134. Oops, not there!
• Need to hash the key “George” again
Then put George in bucket 134 of the new table
Doubling the size strikes a good balance
• Not wasting too much space on unused buckets
• Not constantly having to grow and rehash
• Turns out doubling is also what we want for dynamic arrays!
We’ll see the math for that one in a few weeks
See hash tables in action
• Separate chaining: https://www.cs.usfca.edu/ ~galles/visualization/OpenHash.html
• Open addressing: https://www.cs.usfca.edu/ ~galles/visualization/ClosedHash.html
He calls them different names, but they’re the same thing.
Next time: Graphs
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com