CS计算机代考程序代写 data structure Hashes

Hashes

Quite often we modify how we represent our data for convenience
For example, in the Huffman code we took a string and made a tree (input=string, output=tree)
Hashes: who are they?

A hash function takes any input, but then outputs something a fixed size
If done well, the hash can be a fair approximation for the original (typically larger) input
Hashes: who are they?

Let¡¯s take a look at a very simple hash function: f(n) = n mod 10
5 mod 10 = 5
9 mod 10 = 9
10 mod 10 = 0 9856492059823457 mod 10 = 7 (always get a single digit as hash)
Hashes: who are they?

You¡¯ve probably used ¡°hashes¡± in your life without the fancy term:
Name initials (always 2 letters) Verifying identity (last 4 digits)
The same idea applies: not ¡°unique¡± but normally unique enough
Hashes: who are they?

Typically hashes (the output of hash functions) are used to differentiate or verify the inputs
Hash functions are typically one-way meaning performing the hash is easy but un-hashing is hard/impossible (how do you un-hash initials ¡°J.P.¡±?)
Hashes: what¡¯re they good for?

Hashes are used many places:
– Data structures (our focus)
– Twitter (¡°hashtag¡± is a hash) – Security (verification)
– File verification (ensure
downloaded files not corrupted)
Hashes: what¡¯re they good for?

Security often takes advantage that many hash functions are one-way
¡°Bypass codes¡± can utilize hashes
(backwards)
2359234
verific-
9361649
enter in order
hash
5040283
ation
h(9361649) = 2359234 h(5040283) = 9361649
Hashes: what¡¯re they good for?

When downloading large files, you may have seen ¡°SHA¡± or ¡°MD5¡±
This is a short (couple dozen hexadecimal numbers) hash of all the 1s and 0s in the large file
Hashes: what¡¯re they good for?

Suppose we want to store (key,value) pairs (called tables/maps/dictionaries)
These tables use the ¡°key¡± to look up the associated ¡°value¡± efficiently Examples:
key=umnID, value=your name key=house address, value=zip code
Tables

For a table we want 3 (or 2) things: -insert (put/add or update a key) -search (get the value of a key) -delete (remove an existing key)
To be useful, all of these commands must be quite fast (better than O(n), if ¡°n¡± keys are stored)
Tables

Let¡¯s make a table for (address, zip) We¡¯ll assume address are 4 numbers followed by a street name
Pad the left with 1, numbers with
zeros and letters as order in alphabet: 234 Main St –>1 0234 13 01 09 14 19 20
234 M A I N S T
Direct tables

Let¡¯s assume the zipcode of 234 Main St is 12345
We¡¯ll probably need to store many of these (key,value) pairs
The simplest ¡°multiple storage¡± unit is an array… so let¡¯s try that
Direct tables

We assume the keys are unique (no two houses with same address…)
So we could just directly use the key (a number) as the index in the array So for 234 Main St at zip 12345: table[10234130109141920] = 12345 … problems?
Direct tables

Problem 1: size of array
Based on the last index (from a short street name), you will need a very large array (what¡¯s the longest name)
Problem 2: Memory inefficient
The vast majority of indexes will be blank and wasted (no street ¡°aaaaa¡±)
Direct tables

Ignoring these memory issues… (1) How would you code each of
these operations?
(2) What are their runtime?
-search (get the value of a key) -insert (put/add or update a key) -delete (remove an existing key)
Direct tables

search(key)
return table[key]
insert(key, value) table[key] = value
delete(key) table[key] = null
All O(1) very nice!
Direct tables

You may have seen where this is going already…
We have a very large ¡°key¡± that we want to keep unique but find a more efficient indexing system….
Use a hash function!
(Fixed output length, semi-unique)
Hash tables

The only issue we need to resolve is how to handle hashes that ¡°overlap¡± (collide is technical term)
If this were initials ¡°James Parker¡±
and ¡°Jean-Luc Picard¡± are both ¡°J.P.¡± Need to still remember(and not lose one): (James Parker, teacher)
(Jean-Luc Picard, captain)
Hash tables

If there is a hash collision/overlap,
just store them all! (sorta like bucket
sort… but without the sorting)
stored as double linked list
(James Parker, teacher) (Jean-Luc Picard, captain)
Index: 1 2 3 250 676 Hash: aa ab ac … jp … zz Array:
(real key,value)
Hash tables

Book has a nice picture of the full double linked list implementation (double linked list makes easy insert and delete)
array index
Hash tables

How do the functions change when using hash tables?
search(key) insert(key, value) delete(key)
Hash tables: implementation

search(key)
index = hash(key) data = table[index] while data not null:
if data.key = key return data
data = data.next return null
Hash tables: implementation

insert(key, value) object = search(key) if object is null
index = hash(key)
data = (key,value)
data.next = table[index] table[index].previous = data table[index] = data
else
data.value = value
Hash tables: implementation

delete(key)
object = search(key) if object not null
object.previous.next = object.next object.next.previous = object.previous delete object
Note: need to add checks for null
Hash tables: implementation

Runtimes in hash table?
search(key) insert(key, value) delete(key)
Hash tables: analysis

Runtimes in hash table?
search(key) = O(max collision) insert(key, value) = O(1 + search) delete(key) = O(1 + search)
Worst case max collision = n (everyone¡¯s initials are ¡°J.P.¡±)
Hash tables: analysis

Average runtime of search()?
Math! For now we¡¯ll assume the hash function is good at mixing the input
If there are ¡°m¡± possible outputs of the has (for initials aa-zz, m=676), each key will have a probability 1/m of landing in any index
Hash tables: analysis

Average runtime of search() = average over keys:
length of each list with that key
Let Xij = 1 if keyi and keyj
have the same hash (h(ki)=h(kj)) otherwise Xij = 0
(called ¡°indicator function¡±)
Hash tables: analysis

if m proportional to n, this is O(1)
(like bucket sort)
i=1, j sums n-1 i=2, j sums n-2 i=3, j sums n-3 …
i=n-1, j sums 1 i=n, j sums 0 Total: 0+1+2+…+(n-1) =n(n-1)/2
Hash tables: analysis