Hashes
Quite often we modify how we represent our data for convenience
For example, in the Huffman code we took a string and made a tree (input=string, output=tree)
Hashes: who are they?
A hash function takes any input, but then outputs something a fixed size
If done well, the hash can be a fair approximation for the original (typically larger) input
Hashes: who are they?
Let¡¯s take a look at a very simple hash function: f(n) = n mod 10
5 mod 10 = 5
9 mod 10 = 9
10 mod 10 = 0 9856492059823457 mod 10 = 7 (always get a single digit as hash)
Hashes: who are they?
You¡¯ve probably used ¡°hashes¡± in your life without the fancy term:
Name initials (always 2 letters) Verifying identity (last 4 digits)
The same idea applies: not ¡°unique¡± but normally unique enough
Hashes: who are they?
Typically hashes (the output of hash functions) are used to differentiate or verify the inputs
Hash functions are typically one-way meaning performing the hash is easy but un-hashing is hard/impossible (how do you un-hash initials ¡°J.P.¡±?)
Hashes: what¡¯re they good for?
Hashes are used many places:
– Data structures (our focus)
– Twitter (¡°hashtag¡± is a hash) – Security (verification)
– File verification (ensure
downloaded files not corrupted)
Hashes: what¡¯re they good for?
Security often takes advantage that many hash functions are one-way
¡°Bypass codes¡± can utilize hashes
(backwards)
2359234
verific-
9361649
enter in order
hash
5040283
ation
h(9361649) = 2359234 h(5040283) = 9361649
Hashes: what¡¯re they good for?
When downloading large files, you may have seen ¡°SHA¡± or ¡°MD5¡±
This is a short (couple dozen hexadecimal numbers) hash of all the 1s and 0s in the large file
Hashes: what¡¯re they good for?
Suppose we want to store (key,value) pairs (called tables/maps/dictionaries)
These tables use the ¡°key¡± to look up the associated ¡°value¡± efficiently Examples:
key=umnID, value=your name key=house address, value=zip code
Tables
For a table we want 3 (or 2) things: -insert (put/add or update a key) -search (get the value of a key) -delete (remove an existing key)
To be useful, all of these commands must be quite fast (better than O(n), if ¡°n¡± keys are stored)
Tables
Let¡¯s make a table for (address, zip) We¡¯ll assume address are 4 numbers followed by a street name
Pad the left with 1, numbers with
zeros and letters as order in alphabet: 234 Main St –>1 0234 13 01 09 14 19 20
234 M A I N S T
Direct tables
Let¡¯s assume the zipcode of 234 Main St is 12345
We¡¯ll probably need to store many of these (key,value) pairs
The simplest ¡°multiple storage¡± unit is an array… so let¡¯s try that
Direct tables
We assume the keys are unique (no two houses with same address…)
So we could just directly use the key (a number) as the index in the array So for 234 Main St at zip 12345: table[10234130109141920] = 12345 … problems?
Direct tables
Problem 1: size of array
Based on the last index (from a short street name), you will need a very large array (what¡¯s the longest name)
Problem 2: Memory inefficient
The vast majority of indexes will be blank and wasted (no street ¡°aaaaa¡±)
Direct tables
Ignoring these memory issues… (1) How would you code each of
these operations?
(2) What are their runtime?
-search (get the value of a key) -insert (put/add or update a key) -delete (remove an existing key)
Direct tables
search(key)
return table[key]
insert(key, value) table[key] = value
delete(key) table[key] = null
All O(1) very nice!
Direct tables
You may have seen where this is going already…
We have a very large ¡°key¡± that we want to keep unique but find a more efficient indexing system….
Use a hash function!
(Fixed output length, semi-unique)
Hash tables
The only issue we need to resolve is how to handle hashes that ¡°overlap¡± (collide is technical term)
If this were initials ¡°James Parker¡±
and ¡°Jean-Luc Picard¡± are both ¡°J.P.¡± Need to still remember(and not lose one): (James Parker, teacher)
(Jean-Luc Picard, captain)
Hash tables
If there is a hash collision/overlap,
just store them all! (sorta like bucket
sort… but without the sorting)
stored as double linked list
(James Parker, teacher) (Jean-Luc Picard, captain)
Index: 1 2 3 250 676 Hash: aa ab ac … jp … zz Array:
(real key,value)
Hash tables
Book has a nice picture of the full double linked list implementation (double linked list makes easy insert and delete)
array index
Hash tables
How do the functions change when using hash tables?
search(key) insert(key, value) delete(key)
Hash tables: implementation
search(key)
index = hash(key) data = table[index] while data not null:
if data.key = key return data
data = data.next return null
Hash tables: implementation
insert(key, value) object = search(key) if object is null
index = hash(key)
data = (key,value)
data.next = table[index] table[index].previous = data table[index] = data
else
data.value = value
Hash tables: implementation
delete(key)
object = search(key) if object not null
object.previous.next = object.next object.next.previous = object.previous delete object
Note: need to add checks for null
Hash tables: implementation
Runtimes in hash table?
search(key) insert(key, value) delete(key)
Hash tables: analysis
Runtimes in hash table?
search(key) = O(max collision) insert(key, value) = O(1 + search) delete(key) = O(1 + search)
Worst case max collision = n (everyone¡¯s initials are ¡°J.P.¡±)
Hash tables: analysis
Average runtime of search()?
Math! For now we¡¯ll assume the hash function is good at mixing the input
If there are ¡°m¡± possible outputs of the has (for initials aa-zz, m=676), each key will have a probability 1/m of landing in any index
Hash tables: analysis
Average runtime of search() = average over keys:
length of each list with that key
Let Xij = 1 if keyi and keyj
have the same hash (h(ki)=h(kj)) otherwise Xij = 0
(called ¡°indicator function¡±)
Hash tables: analysis
if m proportional to n, this is O(1)
(like bucket sort)
i=1, j sums n-1 i=2, j sums n-2 i=3, j sums n-3 …
i=n-1, j sums 1 i=n, j sums 0 Total: 0+1+2+…+(n-1) =n(n-1)/2
Hash tables: analysis