CS计算机代考程序代写 python data structure chain c++ algorithm COMP20007 Design of Algorithms

COMP20007 Design of Algorithms
Hashing
Daniel Beck
Lecture 16
Semester 1, 2021
1

Dictionaries – Recap
• Abstract Data Structure: collection of (key, value) pairs.
2

Dictionaries – Recap
• Abstract Data Structure: collection of (key, value) pairs. • Required operations: Search, Insert, Delete
2

Dictionaries – Recap
• Abstract Data Structure: collection of (key, value) pairs. • Required operations: Search, Insert, Delete
• Last lecture: Binary Search Trees (and extensions)
2

Dictionaries – Recap
• Abstract Data Structure: collection of (key, value) pairs. • Required operations: Search, Insert, Delete
• Last lecture: Binary Search Trees (and extensions)
• This lecture: Hash Tables.
2

Hash Tables
• A hash table is a continuous data structure with m preallocated entries.
3

Hash Tables
• A hash table is a continuous data structure with m preallocated entries.
• Average case performance for Search, Insert and Delete: Θ(1)
3

Hash Tables
• A hash table is a continuous data structure with m preallocated entries.
• Average case performance for Search, Insert and Delete: Θ(1)
• Requires a hash function: h(K ) → i ∈ [0, m − 1].
3

Hash Tables
• A hash table is a continuous data structure with m preallocated entries.
• Average case performance for Search, Insert and Delete: Θ(1)
• Requires a hash function: h(K ) → i ∈ [0, m − 1].
• A hash function should:
• Be efficient (Θ(1)).
• Distribute keys evenly (uniformly) along the table.
3

Identity Hash Function
Question: if keys are integers, why do I need a hash function? I could just use the key as the index, no?
4

Identity Hash Function
Question: if keys are integers, why do I need a hash function? I could just use the key as the index, no?
• This is the identity hash function: h(K) = K.
4

Identity Hash Function
Question: if keys are integers, why do I need a hash function? I could just use the key as the index, no?
• This is the identity hash function: h(K) = K.
• NotethatK ∈[0,m−1]. Inotherwordsweneedto
know the maximum number of keys in advance.
4

Identity Hash Function
Question: if keys are integers, why do I need a hash function? I could just use the key as the index, no?
• This is the identity hash function: h(K) = K.
• NotethatK ∈[0,m−1]. Inotherwordsweneedto
know the maximum number of keys in advance.
• Sometimes this is possible: postcodes, for example.
4

Identity Hash Function
Question: if keys are integers, why do I need a hash function? I could just use the key as the index, no?
• This is the identity hash function: h(K) = K.
• NotethatK ∈[0,m−1]. Inotherwordsweneedto
know the maximum number of keys in advance.
• Sometimes this is possible: postcodes, for example. • Many times it is not:
• m is too large (need to preallocate) • Unbounded integers (student IDs) • Non-integer keys (games)
4

Hashing Integers
• For large/unbounded integers, an alternative function is h(K)=K modm
5

Hashing Integers
• For large/unbounded integers, an alternative function is h(K)=K modm
• Allow us to set the size m.
5

Hashing Integers
• For large/unbounded integers, an alternative function is h(K)=K modm
• Allow us to set the size m.
• Small m results in lots of collisions, large m takes excessive memory. Best m will vary.
5

Hashing Strings
• AssumeA 􏰩→0,B 􏰩→1,etc.
• Assume 26 characters and m = 101.
• Each character can be mapped to a binary string of
length 5 (25 = 32).
6

Hashing Strings
• AssumeA 􏰩→0,B 􏰩→1,etc.
• Assume 26 characters and m = 101.
• Each character can be mapped to a binary string of
length 5 (25 = 32).
We can think of a string as a long binary number:
M Y K E Y 􏰩→ 0110011000010100010011000 (= 13379736)
13379736 mod 101 = 64
So 64 is the position of string M Y K E Y in the hash table.
6

Hashing Strings
We deliberately chose m to be prime.
13379736=12×324 +24×323 +10×322 +4×32+24
With m = 32, the hash value of any key is the last character’s value!
7

Hashing Long Strings
Assume chr be the function that gives a character’s number, so for example, chr(c) = 2.
8

Hashing Long Strings
Assume chr be the function that gives a character’s number, so for example, chr(c) = 2.
Then we have
h(s) = (􏰎|s|−1 chr(si) × 32|s|−i−1) mod m,
i=0
where m is a prime number. For example,
h(V E R Y L O N G K E Y) = (21×3210+4×329+· · · ) mod 101
8

Hashing Long Strings
Assume chr be the function that gives a character’s number, so for example, chr(c) = 2.
Then we have
h(s) = (􏰎|s|−1 chr(si) × 32|s|−i−1) mod m,
i=0
where m is a prime number. For example,
h(V E R Y L O N G K E Y) = (21×3210+4×329+· · · ) mod 101
The term between parenthesis can become quite large and result in overflow.
8

Horner’s Rule
Instead of
21×3210 +4×329 +17×328 +24×327 ···
factor out repeatedly:
( ··· ((21×32+4)×32+17)×32+···)+24
9

Horner’s Rule
Instead of
21×3210 +4×329 +17×328 +24×327 ···
factor out repeatedly:
( ··· ((21×32+4)×32+17)×32+···)+24
Now utilize these properties of modular arithmetic: (x+y)modm = ((x modm)+(y modm))modm
(x×y)modm = ((x modm)×(y modm))modm So for each sub-expression it suffices to take values modulo m.
9

Collisions
Happens when the hash function give identical results to two different keys.
10

Collisions
Happens when the hash function give identical results to two different keys.
We saw two solutions:
• Separate Chaining • Linear Probing
10

Collisions
Happens when the hash function give identical results to two different keys.
We saw two solutions: • Separate Chaining
• Linear Probing
Practical efficiency will depend on the table load factor: α = n/m
10

Separate Chaining
Assign multiple records per cell (usually through a linked list)
11

Separate Chaining
Assign multiple records per cell (usually through a linked list) • Assuming even distribution of the n keys.
11

Separate Chaining
Assign multiple records per cell (usually through a linked list) • Assuming even distribution of the n keys.
• A sucessful search requires 1 + α/2 operations on average.
11

Separate Chaining
Assign multiple records per cell (usually through a linked list)
• Assuming even distribution of the n keys.
• A sucessful search requires 1 + α/2 operations on average. • An unsucessful search requires α operations on average.
11

Separate Chaining
Assign multiple records per cell (usually through a linked list)
• Assuming even distribution of the n keys.
• A sucessful search requires 1 + α/2 operations on average. • An unsucessful search requires α operations on average.
• Almost same numbers for Insert and Delete.
11

Separate Chaining
Assign multiple records per cell (usually through a linked list)
• Assuming even distribution of the n keys.
• A sucessful search requires 1 + α/2 operations on average.
• An unsucessful search requires α operations on average.
• Almost same numbers for Insert and Delete.
• Worst case Θ(n) only with a bad hash function (load factor is more of an issue).
11

Separate Chaining
Assign multiple records per cell (usually through a linked list)
• Assuming even distribution of the n keys.
• A sucessful search requires 1 + α/2 operations on average.
• An unsucessful search requires α operations on average.
• Almost same numbers for Insert and Delete.
• Worst case Θ(n) only with a bad hash function (load factor is more of an issue).
• Requires extra memory.
11

Linear Probing
Populate successive empty cells.
12

Linear Probing
Populate successive empty cells.
• Much harder analysis, simplified results show:
• A sucessful search requires (1/2) × (1 + 1/(1 − α)) operations on average.
• An unsucessful search requires (1/2) × (1 + 1/(1 − α)2) operations on average.
12

Linear Probing
Populate successive empty cells.
• Much harder analysis, simplified results show:
• A sucessful search requires (1/2) × (1 + 1/(1 − α)) operations on average.
• An unsucessful search requires (1/2) × (1 + 1/(1 − α)2) operations on average.
• Similar numbers for Insert. Delete virtually impossible.
12

Linear Probing
Populate successive empty cells.
• Much harder analysis, simplified results show:
• A sucessful search requires (1/2) × (1 + 1/(1 − α)) operations on average.
• An unsucessful search requires (1/2) × (1 + 1/(1 − α)2) operations on average.
• Similar numbers for Insert. Delete virtually impossible.
• Does not require extra memory.
12

Linear Probing
Populate successive empty cells.
• Much harder analysis, simplified results show:
• A sucessful search requires (1/2) × (1 + 1/(1 − α)) operations on average.
• An unsucessful search requires (1/2) × (1 + 1/(1 − α)2) operations on average.
• Similar numbers for Insert. Delete virtually impossible.
• Does not require extra memory.
• Worst case Θ(n) with a bad hash function and/or clusters.
12

Double Hashing
A generalisation of Linear Probing.
Apply a second hash function in case of collision.
13

Double Hashing
A generalisation of Linear Probing.
Apply a second hash function in case of collision.
• First try: h(K)
• Second try: (h(K) + s(K)) mod m • Third try: (h(K) + 2s(K)) mod m • …
13

Double Hashing
A generalisation of Linear Probing.
Apply a second hash function in case of collision.
• First try: h(K)
• Second try: (h(K) + s(K)) mod m • Third try: (h(K) + 2s(K)) mod m • …
Another reason to use prime m in h(K): will guarantee to find a free cell if there is one.
13

Double Hashing
A generalisation of Linear Probing.
Apply a second hash function in case of collision.
• First try: h(K)
• Second try: (h(K) + s(K)) mod m • Third try: (h(K) + 2s(K)) mod m • …
Another reason to use prime m in h(K): will guarantee to find a free cell if there is one.
Both Linear Probing and Double Hashing are sometimes referred as Open Addressing methods.
13

Rehashing
• High load factors deteriorate the performance of a hash table (for linear probing, ideally we should have α < 0.9). 14 Rehashing • High load factors deteriorate the performance of a hash table (for linear probing, ideally we should have α < 0.9). • Rehashing allocates a new table (usually around double the size) and move every item from the previous table to the new one. 14 Rehashing • High load factors deteriorate the performance of a hash table (for linear probing, ideally we should have α < 0.9). • Rehashing allocates a new table (usually around double the size) and move every item from the previous table to the new one. • Very expensive operation, but happens infrequently. 14 Summary Hash Tables: 15 Summary Hash Tables: • Implement dictionaries. 15 Summary Hash Tables: • Implement dictionaries. • Allow Θ(1) Search, Insert and Delete in the average case. 15 Summary Hash Tables: • Implement dictionaries. • Allow Θ(1) Search, Insert and Delete in the average case. • Preallocates memory (size m). 15 Summary Hash Tables: • Implement dictionaries. • Allow Θ(1) Search, Insert and Delete in the average case. • Preallocates memory (size m). • Requires good hash functions. 15 Summary Hash Tables: • Implement dictionaries. • Allow Θ(1) Search, Insert and Delete in the average case. • Preallocates memory (size m). • Requires good hash functions. • Requires good collision handling. 15 Pros and Cons If Hash Tables are so good, why bother with BSTs? 16 Pros and Cons If Hash Tables are so good, why bother with BSTs? • Hash Tables ignore key ordering, unlike BSTs. • Queries like “give me all records with keys between 100 and 200” are easy within a BST but much less efficient in a hash table. 16 Pros and Cons If Hash Tables are so good, why bother with BSTs? • Hash Tables ignore key ordering, unlike BSTs. • Queries like “give me all records with keys between 100 and 200” are easy within a BST but much less efficient in a hash table. • Also: memory requirements of a hash table are much higher. 16 Pros and Cons If Hash Tables are so good, why bother with BSTs? • Hash Tables ignore key ordering, unlike BSTs. • Queries like “give me all records with keys between 100 and 200” are easy within a BST but much less efficient in a hash table. • Also: memory requirements of a hash table are much higher. That being said, if hashing is applicable, a well-tuned hash table will typically outperform BSTs. 16 In Practice Python dictionaries (dict type) 17 In Practice Python dictionaries (dict type) • Open addressing using pseudo-random probing • Rehashing happens when α = 2/3 17 In Practice Python dictionaries (dict type) • Open addressing using pseudo-random probing • Rehashing happens when α = 2/3 C++ unordered maps 17 In Practice Python dictionaries (dict type) • Open addressing using pseudo-random probing • Rehashing happens when α = 2/3 C++ unordered maps • Uses chaining. • Rehashing happens when α = 1 17 In Practice Python dictionaries (dict type) • Open addressing using pseudo-random probing • Rehashing happens when α = 2/3 C++ unordered maps • Uses chaining. • Rehashing happens when α = 1 Next lecture: what happens if records/data is too large? 17