CS计算机代考程序代写 Hash Functions

Hash Functions

The primary desire of a hash function is to ¡°distribute¡± all data equally
Helps to avoid hash collisions:
¡°Michael¡± and ¡°Toby¡± collide as the function (hash) has the same output for both
Hash functions: properties

Our proof for O(1) average runtime depends on the function output being spread out
Specifically, if the ¡°fixed length¡± of the hash has ¡°m¡± possible outputs, … then two random inputs have probability 1/m of same output
Hash functions: properties

Our name initials hash function does not have this property…
¡°James Parker¡± ¡ú JP
This has m=26*26 = 676 possible outputs (with the English alphabet)
Hash functions: properties

However, names are not equally distributed across this space (Do you know anyone with the initials QQ? Or OI?)
The most common initial in USA is: JR, so this will have more collisions than QQ (thus not a good hash func)
Hash functions: properties

We might also want additional properties like: locality
(formally: locality-sensitive hashing)
This means if the function ¡°inputs¡± are close, then so are the ¡°outputs¡± Formally: |a-b|<|c-d| means |h(a)-h(b)| < |h(c)-h(d)| Hash functions: properties h( ) Hash functions: properties h( ) h( ) Hash functions: properties h( ) h( ) h( ) Hash functions: properties As hash functions try to put things in ¡°buckets¡± (like bucket sort), you can actually use the same method (assuming your key ¡°k¡± is 0<=k<1) (¡°m¡± is possible outputs of function) h(k) = floor(k*m) =roundDown(k*m) Hash functions: decimals Two problems with this approach: - Might need to transform numbers to decimals (can be difficult) (for initials BZ = ) - Assumes inputs are randomly distributed between 0 and 1 (not true for initials... and would be largely based off first initial) Hash functions: decimals The division hash function uses the remainder(like example last time) h(k) = k mod 10 (= k % 10) So... h(412) = 412 mod 10 = 2 h(32428374) = 32428374 mod 10 = 4 Hash functions: division While this might work, there is one major issue: - Might only use part of the number for identification Using: m=10 only looks at last digit m=2 only looks at odd/even Hash functions: division Depending on your choice of ¡°m¡±, you can get very bad results You could accidentally have all anagrams (different word order) have the same hash output... h(tag) = 9 h(gat) = 9 h(gta) = 9 h(tga) = 9 Hash functions: division Prime numbers (as by definition they are not divisible by many numbers) provide a good ¡°m¡± Note: we will talk later about how to efficiently find prime numbers, for now just assume they are fairly easy to find Hash functions: division Small example: Bad: m=4 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,4,8} h(k) = 1 for {} (nothing) h(k) = 2 for {2,6,10} h(k) = 3 for {} (nothing) This is collision-tastic! Hash functions: division Small example: Good: m=5 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,10} h(k) = 1 for {6} h(k) = 2 for {2} h(k) = 3 for {8} h(k) = 4 for {4} Much better! Hash functions: division The division function is annoying as your choice for ¡°m¡± is picky The multiplication hash function adds another constant to get rid of this issue: Abuse of notation! ¡°mod 1¡± means take the decimal part... so: 3.1415 mod 1 = 0.1415 24598.62345 mod 1 = 0.62345 Hash functions: multiplication Picking A=0.25 runs into much the same issue as our ¡°mod 4¡± division: Example: A=0.25, m=10 input = {0,2,4,6,8,10} h(k) = 0 for inputs {0, 4, 8} h(k) = 5 for inputs {2, 6, 10} all other outputs are unused Hash functions: multiplication However, you can pick A to be some irrational number (math def) and this should prevent this issue (This irrational number should have a ¡°variety¡± of digits, so not 0.1212...) A full analysis needs number theory Hash functions: multiplication The book suggests: Let¡¯s just go with that... ¡°m¡±=4, inputs = {0,2,4,6,8,10} h(k) = 0 for inputs {0, 2, 10} h(k) = 1 for inputs {4} h(k) = 2 for inputs {6} h(k) = 3 for inputs {8} Hash functions: multiplication More robust hashing: m = desired hash length (any num) p = prime (where p > m)
Any A and B:
0 <= B < p 1 <= A < p Hash functions: multiplication