代写代考 The University of 1

The University of 1
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969
This material has been reproduced and communicated to you by or on behalf of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further copying or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice.

Copyright By PowCoder代写 加微信 powcoder

Data structures and Algorithms
Lecture 6: Hash tables [GT 6.1-4]
Dr. André van Renssen School of Computer Science
Some content is taken from material provided by the textbook publisher Wiley.
The University of 2

Application: Network Routers
Network routers process multiple streams of packets at high speed. To process a packet with destination k and data payload x, a router must determine which outgoing link to send the packet along
Such a system needs to support:
– destination-based lookups, i.e., get(k) operations that return the outgoing link for destination k
– updates to the routing table, i.e., put(k, c) operations, where c is the new outgoing link for destination k.
Ideally, we would like to achieve O(1) time for both operations.
The University of 3

Recall: Maps
A map models a searchable collection of key-value pairs (a.k.a., items or entries)
The main operations of a map are for searching, inserting, and deleting items
At most one item per key is allowed
The University of 4

Recall: Map Operations
– get(k): if the map M has an entry with key k, return its associated value; else, return null
– put(k, v): insert entry (k, v) into the map M; if key k is not already in M, then return null; else, return old value associated with k
– remove(k): if the map M has an entry with key k, remove it from M and return its associated value; else, return null
– is_empty()
The University of 5

Map Operations (extended)
– entries(): return an iterable collection of the entries in M – keys(): return an iterable collection of the keys in M
– values(): return an iterable collection of the values in M
The University of 6

Operation Output
is_empty() true
put(5,A) null
put(7,B) null
put(2,C) null
put(8,D) null
put(2,E) C
get(4) null
size() 4 remove(5) A
remove(2) E
get(2) null
is_empty() false The University of
(5,A), (7,B)
(5,A), (7,B), (2,C)
(5,A), (7,B), (2,C), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D)
(5,A), (7,B), (2,E), (8,D) (7,B), (2,E), (8,D)
(7,B), (8,D) (7,B), (8,D) (7,B), (8,D)

A Simple List-Based Map
We can implement a map using an unsorted list
– Store the items of the map in a list S (based on a doubly- linked list), in arbitrary order
header nodes/positions
9c6c5c8c entries
The University of 8

Performance of a List-Based Map
Performance:
– put takes O(1) time if the key doesn’t exist yet since we can insert the new item at the beginning or at the end of the sequence
– put, get and remove take O(n) time since in the worst case we must traverse the entire sequence to look for an item with the given key
The unsorted list implementation is effective only for maps of small size or for maps in which puts are the most common operations, while searches and removals are rare (e.g., historical record of logins to a workstation)
The University of 9

Simple Implementation with restricted keys
Maps support the abstraction of using keys as addresses to get items Consider a restricted setting in which a map with n items with keys in
a range from 0 to N − 1, for some N ≥ n.
– ImplementwithanarrayofsizeN
– Keycanbeindexsoentriescanbelocateddirectly – O(1)operations(get,put,remove)
Drawback is that usually N ≫ n, e.g. StudentID is 9 digits, so a Map with StudentID key can be stored in array of 1,000,000,000 entries (way more than the number of students).
The University of 10

Evaluation of this structure
Really good worst-case runtime
Often, bad space utilization when key set is sparse in the space of possible keys, as in StudentID example
Unable to handle more general keys like strings
The University of 11

Hash Functions and Hash Tables
To get around these issues, we use a hash function h to map keys to corresponding indices in an array A.
– h is a mathematical function (always gives same answer for any particular x)
– h is fairly efficient to compute
We call this data structure a hash table
h(x) index in table
The University of 12
Hash function

Hash Functions and Hash Tables
A hash function h maps keys of a given type to integers in a fixed interval [0, N – 1].
– Example: h(x) = x mod N is a hash function for integer keys – Theintegerh(x)iscalledthehashvalueofkeyx
A hash table for a given key type K consists of – Hashfunctionh:K→[0,N-1]
– Array(calledtable)ofsizeN
– Ideally,item(x,o)isstoredatA[h(x)]
The University of 13

We design a hash table for a map storing entries as SIDs (student ids, a nine-digit positive integer).
Our hash table uses an array of size N = 10,000 and the hash function h(x) = last four digits of x
The University of 14

Choice of Hash functions
Choosing a good hash function is not straightforward (to be discussed)
For our examples, we usually use very simple (and not good) choices that can be calculated by hand
– e.g.forunboundedintegerkeyinarrayofsize11,wemightuse remainder mod 11 as hash function
– e.g.forStringkey,inarrayofsize10wemightdoanexample where h(S) = (position in alphabet of first character of S) mod 10, so h(“Mary”) = 3 since M is 13-th character in alphabet
The University of 15

Arithmetic modulo N
x mod N is mathematical notation for remainder
– Ifx=c×N+rwith0<=r M then the probability that two keys collide is 1/N
[See Theorem 6.3 in GT for a proof] The University of 23

Collision Handling
Collisions occur when two or more elements are hashed to the same location in our array
A good hash function makes collisions rare
However, when collisions do happen we need to have a method for dealing with them:
– Separatechaining – Linearprobing
– Cuckoohashing
The University of 24

Separate Chaining
Let each cell in the table point to a linked list holding the entries that map there
Get, put, and remove operations are delegated to the appropriate list
def get(k):
return A[h(k)].get(k)
def put(k,v):
A[h(k)].put(k,v)
def remove(k):
return A[h(k)].remove(k)
Separate chaining is simple, but requires additional memory outside the table
The University of 25

Performance of Separate Chaining
Assume that our hash function, maps n keys to independent uniform random values in the range [0, N−1].
Let X be a random variable representing the number of items that map to a bucket in the array A, then E(X) = n/N
The parameter n/N, is called the load factor of the hash table, usually written as a.
The expected time for hash table operations is O(1+a) when collisions are handled with separate chaining.
But the worst case time is O(n), which happens when all the items collide into a single chain
The University of 26

Open addressing using Linear Probing
Open addressing: the colliding item is placed in a different cell of the table
Linear probing: handles collisions by placing the colliding item in the next (circularly) available cell
– Eachtablecellinspectedis referred to as a probe
– Collidingitemslumptogether, causing future collisions to cause a longer sequence of probes
Example with h(x) = x mod 13. Suppose we sequentially insert:
18 [5], 41 [2], 22 [9], 44 [5], 59 [7], 32 [6], 31 [5], 63 [11]
The University of 27
0 1 2 3 4 5 6 7 8 9 10 11 12
41 18 44 59 32 22 31 63
0 1 2 3 4 5 6 7 8 9 10 11 12

Chaining versus probing
The University of 28

Search with Linear Probing
How to implement get(k) in a hash table with linear probing:
– Startatcellh(k)
– Probeconsecutivelocations until one of the following occurs
– Anitemwithkeykisfound,or – Anemptycellisfound,or
– N cells have been probed
def get(k): i ¬ h(k) p¬0 repeat
c ¬ A[i] ifc= Æ then
return null
else if c.get_key()=k then
returnc.get_value() else
i ¬ (i+1)modN
p¬p+1 until p=N
return null
The University of 29

Updates with Linear Probing
To handle insertions and deletions, we introduce a special object, called DEFUNCT, which replaces deleted elements
– get(k):mustpassovercells with DEFUNCT and keep probing until the element is found, or until reaching an empty cell
– remove(k):searchforthe entry as in get(k). If found, replace it with the special item DEFUNCT and return element o
The University of Sydney
– put(k,o):searchfortheentryasin get(k), but we also remember the index j of the first cell we find that has DEFUNCT or empty.
If we find key k, we replace the value there with o and return the previous value. If we don’t find k, we store (k, o) in cell with index j
Throw exception if table is full Page 30

Performance of Linear Probing
In the worst case, get, put, and remove take O(n) time.
Fact: Assuming hash values are uniformly randomly distributed, expected number of probes for each get and put is 1/(1-a) where a = n/N is the load factor of the hash table.
Thus, if the load factor is a constant < 1 then the expected running time for the get and put operations is O(1) In practice, hashing is very fast provided the load factor is not close to 100%, but removals complicate the implementation and degrade the performance. The University of 31 Hash tables implementations Recall that load factor of a hash table is defined as a = n/N Experiments and theory suggest that a should be kept not too high: – Java’s HashSet uses chaining with a < 0.75 and switches from a linked list to a binary search tree if bucket gets too large – Python’s dict uses open addressing with a < 0.66 When a hash table reaches its load factor, the table is replaced with a larger table (e.g., twice the size) and the elements are hashed over to the new table The University of 32 Cuckoo hashing Main problem with the methods we’ve seen so far is that operations take O(n) time in the worst-case. Cuckoo hashing achieves worst-case O(1) time for lookups and removals, and expected O(1) time for insertions. In practice Cuckoo hashing is 20-30% slower than linear probing but is still often used due to its worst case guarantees on lookups and its ability to handle removal more gracefully. The University of 33 The Power of Two Choices Use two lookup tables, T1 and T2, each of size N Use two hash functions, h1 and h2, for T1 and T2 respectively For an item with key k, there are only two possible places where we are allowed to store the item: T1[h1(k)] or T2[h2(k)] This restriction, simplifies lookup dramatically, while still allowing worst-case expected O(1) running time for get and remove. The University of 34 An Example of Cuckoo Hashing Each key in the set S = {2, 3, 5, 8, 9} has two possible locations it can go, one in the table T1 and one in the table T2. Note that 2 and 8 collide in T2, but that is okay, since there is no collision for 2 in its alternative location in T1. The University of 35 Pseudo-code for get and remove def get(k): if T1[h1(k)] ≠ null and T1[h1(k)].key = k then return T1[h1(k)].value if T2[h2(k)] ≠ null and T2[h2(k)].key = k then return T2[h2(k)].value return null def remove(k): temp ← null if T1[h1(k)] ≠ null and T1[h1(k)].key = k then temp ← T1[h1(k)].value T1[h1(k)] = null if T2[h2(k)] ≠ null and T2[h2(k)].key = k then temp ← T2[h2(k)].value T2[h2(k)] = null The University of 36 return temp Both are simple and run in O(1) time High level idea behind put If a collision occurs in the insertion operation in the cuckoo hashing scheme, then we evict the previous item in that cell and insert the new one in its place. This forces the evicted item to go to its alternate location in the other table and be inserted there, which may repeat the eviction process with another item, and so on. Eventually, we either find an empty cell and stop or we repeat a previous eviction, which indicates an eviction cycle. If we discover an eviction cycle, then we bail out or rehash all the items into larger tables The University of 37 Intuition for the Name The name “cuckoo hashing” comes from the way the put(k, v) operation is performed in this scheme, because it mimics the breeding habits of the Common Cuckoo bird. The Common Cuckoo is a brood parasite—it lays its egg in the nest of another bird after first evicting an egg out of that nest. The University of 38 Example Eviction Sequence put(7) generated an eviction sequence of length 3: The University of 39 Example Eviction Sequence put(7) generated an eviction sequence of length 3: The University of 40 Example Eviction Sequence put(7) generated an eviction sequence of length 3: The University of 41 Pseudo-code for put def put(k, v): # try to fit item into T1 if T1[h1(k)] ≠ null and T1[h1(k)].key = k then T1[h1(k)] ← (k, v) # try to fit item into T2 if T2[h2(k)] ≠ null and T2[h2(k)].key = k then T2[h2(k)] ← (k, v) # start eviction sequence i←1 repeat if Ti[hi(k)] = null then Ti[hi(k)] ← (k, v) return temp ← Ti[hi(k)] Ti[hi(k)] ← (k, v) (k, v) ← temp i ← 1ifi=2else2 until a cycle occurs rehash elements The University of 42 How to detect eviction cycles Use a counter to keep track of the number of evictions. If we iterate enough times we are guaranteed to have a cycle. Keep an additional flag for each entry. Every time we evict an entry, we flag it. After a successful put, we need to unflag the entries flagged. The details of these strategies are not complicated and are left as an exercise for the tutorials. The University of 43 Performance of Cuckoo Hashing One can show that “long eviction sequences” happen with very low probability. Fact: Assuming hash values are uniformly randomly distributed, expected work of n put operations is O(n) provided N > 2n
Fact: Cuckoo hashing achieves worst-case O(1) time for lookups and removals
The University of 44

Another ADT: Set
A set is an unordered collection of elements, without duplicates that typically supports efficient membership tests.
Elements of a set are like keys of a map, but without any auxiliary values.
The University of 45

add(e): add the element e to S (if not already present) remove(e): remove the element e from S (if present) contains(e): returns whether e is present in S iterator(): returns an iterator of the elements of S
There is also support for the traditional mathematical set operations union, intersection, and subtraction of two sets S and T:
S ∪ T = {e ∶ e in S or e in T} S ∩ T = {e ∶ e in S and e in T}
S − T = {e ∶ e in S and e not in T}
addAll(T): Updates S to include all elements of T, effectively replacing S by S ∪ T retainAll(T): Updates S to keep only elements that are also elements in T, effectively replacing S by S ∩ T
removeAll(T): Updates S to remove any elements that are also elements in T,
effectively replacing S by S − T
The University of 46

Set implemented via Map
– Use a Map to store the keys, and ignore the value.
– Allows contains(k) to be answered by get(k)
– Similarly for add and remove
– Using HashMap for Map, gives main Set operations that usually can be performed in O(1) time.
The University of 47

– Like a Set, but allows duplicates
– also called a Bag
– operation count(e) says how many occurrences of e in collection
– remove(e) removes ONE occurrence (provided e is in the collection already)
– Implement by Map where the element is the key, and the associated value is the number of occurrences
The University of 48

Practice vs Theory
In practice hash tables implementation are usually fast and people use them as though put, get, and remove take O(1) time.
The analyses we covered in lecture assume uniformly random hash values, which are not possible to implement in practice. Removing these assumptions is an active area of research well beyond the scope of this class.
In theory we do not know of an implementation of hash tables that can perform put, get, and remove O(1) time in the worst case.
So you cannot use such a data structure in your assignments.
The University of 49

Theory of Hashing
There is rich theory of hashing beyond the basic hashing schemes we covered in class:
– Quadratic probing
– Double hashing

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com