Hash Functions
The primary desire of a hash function is to ¡°distribute¡± all data equally
Helps to avoid hash collisions:
¡°Michael¡± and ¡°Toby¡± collide as the function (hash) has the same output for both
Hash functions: properties
Our proof for O(1) average runtime depends on the function output being spread out
Specifically, if the ¡°fixed length¡± of the hash has ¡°m¡± possible outputs, … then two random inputs have probability 1/m of same output
Hash functions: properties
Our name initials hash function does not have this property…
¡°James Parker¡± ¡ú JP
This has m=26*26 = 676 possible outputs (with the English alphabet)
Hash functions: properties
However, names are not equally distributed across this space (Do you know anyone with the initials QQ? Or OI?)
The most common initial in USA is: JR, so this will have more collisions than QQ (thus not a good hash func)
Hash functions: properties
We might also want additional properties like: locality
(formally: locality-sensitive hashing)
This means if the function ¡°inputs¡± are close, then so are the ¡°outputs¡± Formally: |a-b|<|c-d| means |h(a)-h(b)| < |h(c)-h(d)|
Hash functions: properties
h( )
Hash functions: properties
h( ) h( )
Hash functions: properties
h( ) h( ) h( )
Hash functions: properties
As hash functions try to put things in ¡°buckets¡± (like bucket sort), you can actually use the same method (assuming your key ¡°k¡± is 0<=k<1) (¡°m¡± is possible outputs of function)
h(k) = floor(k*m) =roundDown(k*m)
Hash functions: decimals
Two problems with this approach:
- Might need to transform numbers
to decimals (can be difficult)
(for initials BZ = )
- Assumes inputs are randomly
distributed between 0 and 1
(not true for initials... and would be largely based off first initial)
Hash functions: decimals
The division hash function uses
the remainder(like example last time)
h(k) = k mod 10 (= k % 10)
So...
h(412) = 412 mod 10 = 2 h(32428374) = 32428374 mod 10 = 4
Hash functions: division
While this might work, there is one major issue:
- Might only use part of the number for identification
Using:
m=10 only looks at last digit m=2 only looks at odd/even
Hash functions: division
Depending on your choice of ¡°m¡±, you can get very bad results
You could accidentally have all anagrams (different word order) have the same hash output... h(tag) = 9 h(gat) = 9
h(gta) = 9 h(tga) = 9
Hash functions: division
Prime numbers (as by definition they are not divisible by many numbers) provide a good ¡°m¡±
Note: we will talk later about how to efficiently find prime numbers,
for now just assume they are fairly easy to find
Hash functions: division
Small example:
Bad: m=4 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,4,8}
h(k) = 1 for {} (nothing)
h(k) = 2 for {2,6,10}
h(k) = 3 for {} (nothing)
This is collision-tastic!
Hash functions: division
Small example:
Good: m=5 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,10}
h(k) = 1 for {6}
h(k) = 2 for {2}
h(k) = 3 for {8}
h(k) = 4 for {4}
Much better!
Hash functions: division
The division function is annoying as your choice for ¡°m¡± is picky
The multiplication hash function
adds another constant to get rid of
this issue:
Abuse of notation! ¡°mod 1¡± means take the decimal part... so: 3.1415 mod 1 = 0.1415
24598.62345 mod 1 = 0.62345
Hash functions: multiplication
Picking A=0.25 runs into much the same issue as our ¡°mod 4¡± division: Example: A=0.25, m=10
input = {0,2,4,6,8,10}
h(k) = 0 for inputs {0, 4, 8} h(k) = 5 for inputs {2, 6, 10} all other outputs are unused
Hash functions: multiplication
However, you can pick A to be some irrational number (math def) and this should prevent this issue
(This irrational number should have a ¡°variety¡± of digits, so not 0.1212...)
A full analysis needs number theory
Hash functions: multiplication
The book suggests:
Let¡¯s just go with that... ¡°m¡±=4, inputs = {0,2,4,6,8,10} h(k) = 0 for inputs {0, 2, 10} h(k) = 1 for inputs {4}
h(k) = 2 for inputs {6}
h(k) = 3 for inputs {8}
Hash functions: multiplication
More robust hashing:
m = desired hash length (any num) p = prime (where p > m)
Any A and B:
0 <= B < p 1 <= A < p
Hash functions: multiplication