COMP251: Hashing
Jérôme Waldispühl School of Computer Science
McGill University
Based on (Cormen et al., 2002)
Problem Definition Table S with n records x:
X
Satellite data
We want a data structure to store and retrieve these data. Operations:
Key[x]
Information or data associated with x
• insert(S,x):S←S∪{x} • delete(S,x):S←S\{x} • search(S,k)
Dynamic set
Direct Address Table
Illustration (CLR, 2005)
• If there is an element x with key k, then T[k] contains a pointer to x.
• If T [k] is empty, represented by NIL.
All operations in O(1), but if n (#keys) < m (#slots), lot of wasted space.
• Each slot, or position, corresponds to a key in U .
Hash Tables
• Reduce storage to O(n) keys.
• Resolve conflicts by chaining.
• Search time in O(1) time in average, but not the worst case.
Hash function:
h:U →{0,1,...,m−1}
h(k1) h(k4)
Analysis of Hashing with Chaining
Insertion: O(1) time (Insert at the beginning of the list). Deletion: Search time + O(1) if we use a double linked list. Search:
• Worst case: Worst search time to is O(n).
Search time = time to compute hash function + time to search the list.
Assuming the time to compute hash function is O(1).
Worsttimehappenswhenallkeysgothesameslot(listof sizen), and we need to scan the full list => O(n).
• Average case: It depends how keys are distributed among slots.
Average case Analysis Assume a simple uniform hashing: n keys are distributed
uniformly among m slots.
Let n be the number of keys, and m the number of slots.
Average number of element per linked list?
α = mn
Theorem:
The expected time of a search is Θ(1 + α). Note: O(1) if α < 1, but O(n) if α is O(n).
Load factor:
Average case Analysis
Theorem:
The expected time of a search is Θ(1 + α).
Proof?
Distinguish two cases:
• search is unsuccessful
• search is successful
Unsuccessful search
• Assume that we can compute the hash function in O(1) time.
• An unsuccessful search requires to scan all the keys in the list. Average search time = O( 1 + average length of lists )
Let ni be the length of the list attached to slot i.
Average value of ni ?
E ( n i ) = α = mn Þ O( 1 ) + O( α ) = O( 1 + α )
(Load factor)
Successful search
• Assume the position of the searched key x is equally likely to be any of the elements stored in the list.
• New keys inserted at the head of the list ⇒ Keys scanned after finding x have been inserted in the hash table before x.
• Use indicator to count the number of collisions:
m
X =I h(k)=h(k ) ;E(X )= 1 (probabilityofacollision)
ij{i j}ij
Successful search
&
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑘𝑒𝑦𝑠 𝑖𝑛𝑠𝑒𝑟𝑡𝑒𝑑 𝑎𝑓𝑡𝑒𝑟 𝑥 = 1 + ; 𝑋#! !"#$%
1&&
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑎𝑛𝑛𝑒𝑑 𝑘𝑒𝑦𝑠 = 𝐸
(1∑n " ∑n %+ 1∑n " ∑n %
E * $ 1 + X i j ' - = $ 1 + E () X i j +, ' *n$ '-n$ '
𝑛 ; #"%
1 +
; !"#$%
𝑋#!
) i=1 #
j=i+1 &,
i=1 # j=i+1 & 1∑n" ∑n 1%
Search time:
Θ 1+1+𝛼−𝛼 =Θ(1+𝛼) 2 2𝑛
= $1+ ' n$m'
i=1 # j=i+1 &
=1+𝛼−𝛼 2 2𝑛
Supplementary material
Designing a hash function
Properties:
1. Uniform distribution of keys into slots
2. Regularity in key disturb should not affect uniformity.
List of functions:
• Division method
• Multiplication methods
• Open addressing:
• Linear probing
• Quadratic probing
• Double hashing
Binary Numbers (reminder) Each integer x accepts a unique decomposition x = ∑ai ⋅ 2i
where0≤ai <2
Example: x =11=1⋅20 +1⋅21 +0⋅22 +1⋅23
i
The binary number representation of an integer x is its (reversed) sequence of a’s.
23
22 21
1, 0,1,1
01
20
1
1
Example:
x =11→ →1011 Binary number operations:
101101 >> 1 = 10110 (right shift) : quotient of division by 2k 101101 << 2 = 10110100 (left shift) : multiplication by 2k 101101 mod 22 = 01 (modulo 2k) : remainder of division by 2k
Division Method
h(k)=kmodd d must be chosen carefully!
Example 1: d = 2 and all keys are even? Odd slots are never used...
Example 2: d = 2r
k = 100010110101101011
r = 2 -> 11
r = 3 -> 011 r = 4 -> 1011
keeps only r last bits…
Good heuristic: Choose d prime not too close from a power of 2. Note: Easy to implement, but division is slow…
Multiplication method
h(k) = (A ⋅ k mod 2w )>> (w − r) 2w-1 { 1, … , m } Universe of keys probe number slot
Constraints:
• 𝑛 ≤ 𝑚 (i.e. more slots than keys to store)
• Deletion is difficult
Challenge: How to build the hash function?
Open addressing
Illustration: Where to store key 282?
Full!
index
key
1
355
2
3
567
4
233
5
282
6
799
7
h(282,0)=3 h(282,1)=1 h(282,2)=5
Note: Search must use the same probe sequence.
Linear & Quadratic probing Linear probing:
h(k,i)=(h'(k)+i)modm Note: tendency to create clusters.
Quadratic probing:
h(k,i)= h'(k)+c ⋅i+c ⋅i2 modm (12)
Remarks:
• We must ensure that we have a full permutation of ⟨
0, … , m-1 ⟩.
• Secondary clustering: 2 distinct keys have the same
hʹ value, if they have the same probe sequence.
Double hashing
h(k,i)=(h1(k)+i⋅h2(k))modm
Must have h2(k) be “relatively” prime to m to guarantee that
the probe sequence is a full permutation of ⟨0, 1, . . . , m −1⟩.
Examples:
• m power of 2 and h2 returns odd numbers • mprimenumberand1
,
• a = denotes a sequence of r+1 elements randomly chosen from {0, 1, … , m – 1}.
The class H defined by:
H=!a {ha}withha(x)=åi=0tor aixi modm
is an universal function.
Cost of Universal Hashing
Theorem:
Using chaining and universal hashing on key k:
• If k is not in the table T, the expected length of the list that k
hashes to is £ a.
• If k is in the table T, the expected length of the list that k
hashes to is £ 1+a.
Proof:
Xk = # of keys (≠k) that hash to the same slot as k. Ckl = I{h(k)=h(l)}; E[Ckl] = Pr{h(k)=h(l)} ≤ 1/m.
#&1 Xk = ∑ Ckl, andE[CXk]= E% ∑ Ckl(= ∑ E[Ckl]≤ ∑ m
l∈T \{k} $l∈T \{k} ‘ l∈T \{k} l∈T∧l≠k If k∉T,E[Xk]≤n/m=α.
Ifk∈T, E[Xk]+1≤(n−1)/m+1=1+α−1/m<1+α.
Proof (universal hashing function)
Let X = x0, x1,..., xr and Y = y0, y1,..., yr be 2 distinct keys. They differ at (at least) one position. WLOG let 0 be this position.
For how many h do X and Y collide?
rr
∑aixi ≡∑aiyi(modm)
i=0 i=0 r
∑ai(xi −yi)≡0(modm) i=0
r
a0(x0 −y0)≡−∑ai(xi −yi)(modm)
For any choice of < a1, a2, ... , ar> there is only one choice of a0 s.t. X and Y collide.
#{h that collide} = m × m × … × m × 1
= mr = |H|/m
i=1 ⎛∑r ⎞
a0 ≡⎜⎝− ai(xi −yi)⎟⎠⋅(x0 −y0)−1 (modm) i=1
The class H defined by:
H=!a {ha}withha(x)=åi=0tor aixi modm
is an universal function.
Cost of Universal Hashing
Theorem:
Using chaining and universal hashing on key k:
• If k is not in the table T, the expected length of the list that k
hashes to is £ a.
• If k is in the table T, the expected length of the list that k
hashes to is £ 1+a.
Proof:
Xk = # of keys (≠k) that hash to the same slot as k. Ckl = I{h(k)=h(l)}; E[Ckl] = Pr{h(k)=h(l)} ≤ 1/m.
#&1 Xk = ∑ Ckl, andE[CXk]= E% ∑ Ckl(= ∑ E[Ckl]≤ ∑ m
l∈T \{k} $l∈T \{k} ‘ l∈T \{k} l∈T∧l≠k If k∉T,E[Xk]≤n/m=α.
Ifk∈T, E[Xk]+1≤(n−1)/m+1=1+α−1/m<1+α.
Proof (universal hashing function)
Let X = x0, x1,..., xr and Y = y0, y1,..., yr be 2 distinct keys. They differ at (at least) one position. WLOG let 0 be this position.
For how many h do X and Y collide?
rr
∑aixi ≡∑aiyi(modm)
i=0 i=0 r
∑ai(xi −yi)≡0(modm) i=0
r
a0(x0 −y0)≡−∑ai(xi −yi)(modm)
For any choice of < a1, a2, ... , ar> there is only one choice of a0 s.t. X and Y collide.
#{h that collide} = m × m × … × m × 1
= mr = |H|/m
i=1 ⎛∑r ⎞
a0 ≡⎜⎝− ai(xi −yi)⎟⎠⋅(x0 −y0)−1 (modm) i=1