CSE 440: Advanced Algorithms
Lecture 7: Hash Tables CLRS Ch. 11 (11.1-11.4)
Direct Address Tables
• Works well when the universe of keys is small
• Compare with • Arrays?
• Linked Lists?
Hash Tables
• Observation:
• if |U| is too large then using “Direct Access table” is
prohibitive expensive (memory)
• If |K| << |U| then too much space is wasted
• Dual goal:
• space requirement: Q(|K|) space used
• Expected time for dictionary operations O(1)
• Idea
• h: Uà{0,...,m-1} //m = Q(|K|)
Hash Tables
• Differences in design with direct address tables
• Hash functions
• Key k is stored in slot h(k). -- we name h(k) the hash value of k
• Has to be deterministic!!
• Possible collision: efficient hash function should reduce its probability
• Collision resolution • Closed addressing • Open addressing
Hash Function
• Goal:
• Simpleuniformhashing:eachkeyisequallylikelytohashtoanyofthe
m slots, independently of where any other key has hashed to.
• Challenges
• Requiresunderstandingthedistributionfromwhichkeysaredrawn • Keysmightnotbedrawnindependently
• Division method
• Interpretkeysasnaturalnumbers
• H(k) =kmodm
• Simplesolution,butmostlyeffectiveifmisprimenotooclosetoan exact power of 2
• More complex solutions (used for security more than balancing) • Universalhashing
• SHA-256
Closed Addressing (Chaining)
• Insert: O(1) if no check for duplicates
• Delete: O(1) if doubly linked list and input is element not key • Main issue is in search
Analysis of Chaining
• Letn=|K|andm=sizeofarray
• Load factor: a(T) = a = n/m is load factor
• Note: because of chaining, a(T) may be greater than 1
• In fact, a(T) is a good representation for the number of elements stored in a chain
• Analysis of search
• Worst case?
• Best case?
• Average case?
Analysis of Chaining
• Assume uniform hashing: any given element is equally likely to hash into any of the m slots
• nj is the length of the list T[j] • E[nj] = a
• If key does not exist
• Average case time = Q(1+a)
• What if key exists?
• WewillshowinthenextslidesthatitisalsoQ(1+a)
• Ifn=O(m)
• Q(1+a) = O(1)
Analysis of Chaining
• Number of elements examined
• Number of elements appear before x in x’s list • Which is: number of elements inserted after x
• Definitions
• Xi = “i-th element added to the table” • ki = xi.key
• Xij = I{h(ki) = h(kj)}
• Thus
• E[Xij] = Pr(h(ki) = h(kj)) = 1/m
• Expected number of elements examined
Analysis of Chaining
Which is Q(1+a)
Open Addressing
• Probing within the array itself • No pointers at all
• a is never greater than 1
• Hash function includes probe number
• h: U ́ {0,1,...,m-1} à {0,1,...,m-1} • Probe sequence:
•
• Has to be a permutation of <0, 1, m-1>
• Better if uniform hashing: equally likely to be any of the m! permutations
What About Deletion?
• Cannot simply replace entry with Nil • Probing will not work.
• Delete(T,k)
i = hash-search(T,k)
T[i] = “Delete”
• We modify Hash-insert accordingly:
T[j] == nil or T[j] == “Delete”
Probe Sequencing
• Linear probing
• h(k,i)=(h’(k)+i)modm
• Maindrawback:primaryclustering
• long runs of occupied cells, which increases average search time
• Quadratic probing
• h(k,i) = (h’(k) + (c1*i)+ (c2 *c2*i) ) mod m • Maindrawback:secondaryclustering
• Initial probe determines the entire sequence.
• Double hashing
• h(k,i) = (h1(k) + i*h2(k) ) mod m
• Better performance: more possible probe sequences, less clustering • Still not as ideal as uniform hashing
Analysis of Open Addressing
• Assume uniform hashing
• Not true for our probing techniques, but simplifies
analysis
• Unsuccessful search: E[number of probes] ≤ 1/(1-α)
• Informal justification (formal proof in textbook – skip) • 1/(1-α)=1+α+α2 +α3 +…
First probe always happen
Second probe happens with probability α
Third probe happens with probability α2
• Successful search is even better
• E[X] ≤ (1/ α) ln(1/(1-α)) (proof skipped)
Back to Dynamic Tables
Discussion
Discussion
• What about hash tables with closed addressing (e.g., chaining)?
• Sameidea,withslightchangeofdefinitionof⍺
• Expand/shrink table based on threshold of bucket sizes • Examples:
• Maintain per-bucket/per-table metadata with each insert/delete
• Sample the sizes of randomly selected buckets
• Possibleimprovement
• Resize each bucket independently
• Example:
Kevin Williams, Joe Foster, Athicha Srivirote, Ahmed Hassan, Joseph Tassarotti, Lewis Tseng, Roberto Palmieri “On Building Modular and Elastic Data Structures with Bulk Operations”, ICDCN’21
• What about hash tables with open addressing?
• Nodifferenceforexpansion
• Contractiondependsonthewaywehandledeletes
Discussion
• What about concurrent dynamic tables
• Although expansion cost is amortized, it blocks any concurrent
operation until it is finished. • Solutions?
• Amortize the actual resizing not its cost
• Allocate new table eagerly, move data to new table lazily
• Key idea: resizing is a structural change not a semantic change.
• Depends on implementation details
• Example: Yujie Liu, Kunlong Zhang, and Michael Spear “Dynamic-Sized Nonblocking Hash Tables”, PODC’14
• Freeze objects in old hash table
• A new operation (insert/delete/lookup) on a frozen object x first copies x to
• Another resize first completes the previous resize (hopefully by then most elements will be already moved).
the new hash table
Thanks!