The University of 1
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969
This material has been reproduced and communicated to you by or on behalf of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further copying or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice.
Copyright By PowCoder代写 加微信 powcoder
Data structures and Algorithms
Lecture 6: Hash tables [GT 6.1-4]
Dr. André van Renssen School of Computer Science
Some content is taken from material provided by the textbook publisher Wiley.
The University of 2
Application: Network Routers
Network routers process multiple streams of packets at high speed. To process a packet with destination k and data payload x, a router must determine which outgoing link to send the packet along
Such a system needs to support:
– destination-based lookups, i.e., get(k) operations that return the outgoing link for destination k
– updates to the routing table, i.e., put(k, c) operations, where c is the new outgoing link for destination k.
Ideally, we would like to achieve O(1) time for both operations.
The University of 3
Recall: Maps
A map models a searchable collection of key-value pairs (a.k.a., items or entries)
The main operations of a map are for searching, inserting, and deleting items
At most one item per key is allowed
The University of 4
Recall: Map Operations
– get(k): if the map M has an entry with key k, return its associated value; else, return null
– put(k, v): insert entry (k, v) into the map M; if key k is not already in M, then return null; else, return old value associated with k
– remove(k): if the map M has an entry with key k, remove it from M and return its associated value; else, return null
– is_empty()
The University of 5
Map Operations (extended)
– entries(): return an iterable collection of the entries in M – keys(): return an iterable collection of the keys in M
– values(): return an iterable collection of the values in M
The University of 6
Operation Output
is_empty() true
put(5,A) null
put(7,B) null
put(2,C) null
put(8,D) null
put(2,E) C
get(4) null
size() 4 remove(5) A
remove(2) E
get(2) null
is_empty() false The University of
(5,A), (7,B)
(5,A), (7,B), (2,C)
(5,A), (7,B), (2,C), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D) (7,B), (2,E), (8,D)
(7,B), (8,D) (7,B), (8,D) (7,B), (8,D)
A Simple List-Based Map
We can implement a map using an unsorted list
– Store the items of the map in a list S (based on a doubly- linked list), in arbitrary order
header nodes/positions
9c6c5c8c entries
The University of 8
Performance of a List-Based Map
Performance:
– put takes O(1) time if the key doesn’t exist yet since we can insert the new item at the beginning or at the end of the sequence
– put, get and remove take O(n) time since in the worst case we must traverse the entire sequence to look for an item with the given key
The unsorted list implementation is effective only for maps of small size or for maps in which puts are the most common operations, while searches and removals are rare (e.g., historical record of logins to a workstation)
The University of 9
Simple Implementation with restricted keys
Maps support the abstraction of using keys as addresses to get items Consider a restricted setting in which a map with n items with keys in
a range from 0 to N − 1, for some N ≥ n.
– ImplementwithanarrayofsizeN
– Keycanbeindexsoentriescanbelocateddirectly – O(1)operations(get,put,remove)
Drawback is that usually N ≫ n, e.g. StudentID is 9 digits, so a Map with StudentID key can be stored in array of 1,000,000,000 entries (way more than the number of students).
The University of 10
Evaluation of this structure
Really good worst-case runtime
Often, bad space utilization when key set is sparse in the space of possible keys, as in StudentID example
Unable to handle more general keys like strings
The University of 11
Hash Functions and Hash Tables
To get around these issues, we use a hash function h to map keys to corresponding indices in an array A.
– h is a mathematical function (always gives same answer for any particular x)
– h is fairly efficient to compute
We call this data structure a hash table
h(x) index in table
The University of 12
Hash function
Hash Functions and Hash Tables
A hash function h maps keys of a given type to integers in a fixed interval [0, N – 1].
– Example: h(x) = x mod N is a hash function for integer keys – Theintegerh(x)iscalledthehashvalueofkeyx
A hash table for a given key type K consists of – Hashfunctionh:K→[0,N-1]
– Array(calledtable)ofsizeN
– Ideally,item(x,o)isstoredatA[h(x)]
The University of 13
We design a hash table for a map storing entries as SIDs (student ids, a nine-digit positive integer).
Our hash table uses an array of size N = 10,000 and the hash function h(x) = last four digits of x
The University of 14
Choice of Hash functions
Choosing a good hash function is not straightforward (to be discussed)
For our examples, we usually use very simple (and not good) choices that can be calculated by hand
– e.g.forunboundedintegerkeyinarrayofsize11,wemightuse remainder mod 11 as hash function
– e.g.forStringkey,inarrayofsize10wemightdoanexample where h(S) = (position in alphabet of first character of S) mod 10, so h(“Mary”) = 3 since M is 13-th character in alphabet
The University of 15
Arithmetic modulo N
x mod N is mathematical notation for remainder
– Ifx=c×N+rwith0<=r
[See Theorem 6.3 in GT for a proof] The University of 23
Collision Handling
Collisions occur when two or more elements are hashed to the same location in our array
A good hash function makes collisions rare
However, when collisions do happen we need to have a method for dealing with them:
– Separatechaining – Linearprobing
– Cuckoohashing
The University of 24
Separate Chaining
Let each cell in the table point to a linked list holding the entries that map there
Get, put, and remove operations are delegated to the appropriate list
def get(k):
return A[h(k)].get(k)
def put(k,v):
A[h(k)].put(k,v)
def remove(k):
return A[h(k)].remove(k)
Separate chaining is simple, but requires additional memory outside the table
The University of 25
Performance of Separate Chaining
Assume that our hash function, maps n keys to independent uniform random values in the range [0, N−1].
Let X be a random variable representing the number of items that map to a bucket in the array A, then E(X) = n/N
The parameter n/N, is called the load factor of the hash table, usually written as a.
The expected time for hash table operations is O(1+a) when collisions are handled with separate chaining.
But the worst case time is O(n), which happens when all the items collide into a single chain
The University of 26
Open addressing using Linear Probing
Open addressing: the colliding item is placed in a different cell of the table
Linear probing: handles collisions by placing the colliding item in the next (circularly) available cell
– Eachtablecellinspectedis referred to as a probe
– Collidingitemslumptogether, causing future collisions to cause a longer sequence of probes
Example with h(x) = x mod 13. Suppose we sequentially insert:
18 [5], 41 [2], 22 [9], 44 [5], 59 [7], 32 [6], 31 [5], 63 [11]
The University of 27
0 1 2 3 4 5 6 7 8 9 10 11 12
41 18 44 59 32 22 31 63
0 1 2 3 4 5 6 7 8 9 10 11 12
Chaining versus probing
The University of 28
Search with Linear Probing
How to implement get(k) in a hash table with linear probing:
– Startatcellh(k)
– Probeconsecutivelocations until one of the following occurs
– Anitemwithkeykisfound,or – Anemptycellisfound,or
– N cells have been probed
def get(k): i ¬ h(k) p¬0 repeat
c ¬ A[i] ifc= Æ then
return null
else if c.get_key()=k then
returnc.get_value() else
i ¬ (i+1)modN
p¬p+1 until p=N
return null
The University of 29
Updates with Linear Probing
To handle insertions and deletions, we introduce a special object, called DEFUNCT, which replaces deleted elements
– get(k):mustpassovercells with DEFUNCT and keep probing until the element is found, or until reaching an empty cell
– remove(k):searchforthe entry as in get(k). If found, replace it with the special item DEFUNCT and return element o
The University of Sydney
– put(k,o):searchfortheentryasin get(k), but we also remember the index j of the first cell we find that has DEFUNCT or empty.
If we find key k, we replace the value there with o and return the previous value. If we don’t find k, we store (k, o) in cell with index j
Throw exception if table is full Page 30
Performance of Linear Probing
In the worst case, get, put, and remove take O(n) time.
Fact: Assuming hash values are uniformly randomly distributed, expected number of probes for each get and put is 1/(1-a) where a = n/N is the load factor of the hash table.
Thus, if the load factor is a constant < 1 then the expected running time for the get and put operations is O(1)
In practice, hashing is very fast provided the load factor is not close to 100%, but removals complicate the implementation and degrade the performance.
The University of 31
Hash tables implementations
Recall that load factor of a hash table is defined as a = n/N Experiments and theory suggest that a should be kept not too high:
– Java’s HashSet uses chaining with a < 0.75 and switches from a linked list to a binary search tree if bucket gets too large
– Python’s dict uses open addressing with a < 0.66
When a hash table reaches its load factor, the table is replaced with a larger table (e.g., twice the size) and the elements are hashed over to the new table
The University of 32
Cuckoo hashing
Main problem with the methods we’ve seen so far is that operations take O(n) time in the worst-case.
Cuckoo hashing achieves worst-case O(1) time for lookups and removals, and expected O(1) time for insertions.
In practice Cuckoo hashing is 20-30% slower than linear probing but is still often used due to its worst case guarantees on lookups and its ability to handle removal more gracefully.
The University of 33
The Power of Two Choices
Use two lookup tables, T1 and T2, each of size N
Use two hash functions, h1 and h2, for T1 and T2 respectively
For an item with key k, there are only two possible places where we are allowed to store the item: T1[h1(k)] or T2[h2(k)]
This restriction, simplifies lookup dramatically, while still allowing worst-case expected O(1) running time for get and remove.
The University of 34
An Example of Cuckoo Hashing
Each key in the set S = {2, 3, 5, 8, 9} has two possible locations it can go, one in the table T1 and one in the table T2.
Note that 2 and 8 collide in T2, but that is okay, since there is no collision for 2 in its alternative location in T1.
The University of 35
Pseudo-code for get and remove
def get(k):
if T1[h1(k)] ≠ null and T1[h1(k)].key = k then
return T1[h1(k)].value
if T2[h2(k)] ≠ null and T2[h2(k)].key = k then
return T2[h2(k)].value
return null
def remove(k): temp ← null
if T1[h1(k)] ≠ null and T1[h1(k)].key = k then temp ← T1[h1(k)].value
T1[h1(k)] = null
if T2[h2(k)] ≠ null and T2[h2(k)].key = k then temp ← T2[h2(k)].value
T2[h2(k)] = null
The University of 36
return temp Both are simple and run in O(1) time
High level idea behind put
If a collision occurs in the insertion operation in the cuckoo hashing scheme, then we evict the previous item in that cell and insert the new one in its place.
This forces the evicted item to go to its alternate location in the other table and be inserted there, which may repeat the eviction process with another item, and so on.
Eventually, we either find an empty cell and stop or we repeat a previous eviction, which indicates an eviction cycle.
If we discover an eviction cycle, then we bail out or rehash all the items into larger tables
The University of 37
Intuition for the Name
The name “cuckoo hashing” comes from the way the put(k, v) operation is performed in this scheme, because it mimics the breeding habits of the Common Cuckoo bird.
The Common Cuckoo is a brood parasite—it lays its egg in the nest of another bird after first evicting an egg out of that nest.
The University of 38
Example Eviction Sequence
put(7) generated an eviction sequence of length 3:
The University of 39
Example Eviction Sequence
put(7) generated an eviction sequence of length 3:
The University of 40
Example Eviction Sequence
put(7) generated an eviction sequence of length 3:
The University of 41
Pseudo-code for put
def put(k, v):
# try to fit item into T1
if T1[h1(k)] ≠ null and T1[h1(k)].key = k then
T1[h1(k)] ← (k, v)
# try to fit item into T2
if T2[h2(k)] ≠ null and T2[h2(k)].key = k then T2[h2(k)] ← (k, v)
# start eviction sequence
i←1 repeat
if Ti[hi(k)] = null then Ti[hi(k)] ← (k, v) return
temp ← Ti[hi(k)] Ti[hi(k)] ← (k, v)
(k, v) ← temp
i ← 1ifi=2else2 until a cycle occurs rehash elements
The University of 42
How to detect eviction cycles
Use a counter to keep track of the number of evictions. If we iterate enough times we are guaranteed to have a cycle.
Keep an additional flag for each entry. Every time we evict an entry, we flag it. After a successful put, we need to unflag the entries flagged.
The details of these strategies are not complicated and are left as an exercise for the tutorials.
The University of 43
Performance of Cuckoo Hashing
One can show that “long eviction sequences” happen with very low probability.
Fact: Assuming hash values are uniformly randomly distributed, expected work of n put operations is O(n) provided N > 2n
Fact: Cuckoo hashing achieves worst-case O(1) time for lookups and removals
The University of 44
Another ADT: Set
A set is an unordered collection of elements, without duplicates that typically supports efficient membership tests.
Elements of a set are like keys of a map, but without any auxiliary values.
The University of 45
add(e): add the element e to S (if not already present) remove(e): remove the element e from S (if present) contains(e): returns whether e is present in S iterator(): returns an iterator of the elements of S
There is also support for the traditional mathematical set operations union, intersection, and subtraction of two sets S and T:
S ∪ T = {e ∶ e in S or e in T} S ∩ T = {e ∶ e in S and e in T}
S − T = {e ∶ e in S and e not in T}
addAll(T): Updates S to include all elements of T, effectively replacing S by S ∪ T retainAll(T): Updates S to keep only elements that are also elements in T, effectively replacing S by S ∩ T
removeAll(T): Updates S to remove any elements that are also elements in T,
effectively replacing S by S − T
The University of 46
Set implemented via Map
– Use a Map to store the keys, and ignore the value.
– Allows contains(k) to be answered by get(k)
– Similarly for add and remove
– Using HashMap for Map, gives main Set operations that usually can be performed in O(1) time.
The University of 47
– Like a Set, but allows duplicates
– also called a Bag
– operation count(e) says how many occurrences of e in collection
– remove(e) removes ONE occurrence (provided e is in the collection already)
– Implement by Map where the element is the key, and the associated value is the number of occurrences
The University of 48
Practice vs Theory
In practice hash tables implementation are usually fast and people use them as though put, get, and remove take O(1) time.
The analyses we covered in lecture assume uniformly random hash values, which are not possible to implement in practice. Removing these assumptions is an active area of research well beyond the scope of this class.
In theory we do not know of an implementation of hash tables that can perform put, get, and remove O(1) time in the worst case.
So you cannot use such a data structure in your assignments.
The University of 49
Theory of Hashing
There is rich theory of hashing beyond the basic hashing schemes we covered in class:
– Quadratic probing
– Double hashing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com