Universal Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides of Kevin Wayne (https://www.cs.princeton.edu/~wayne/kleinberg-tardos/) Designed for the textbook Algorithm Design
by Jon Kleinberg and Eva Tardos.
Auckland, Aug 3, 2020
1
Dictionary data type
Dictionary problem:
Given a universe U of possible elements, maintain a subset S Í U
supporting fast searching, inserting, deleting in S (in constant time).
Main dictionary operations:
Insert(x): insert element x Î U to S.
Delete(x): remove element x from S if x Î S.
Search(x): return true if x Î S and false otherwise.
Applications:
File systems, web caching, databases, Google, checksum P2P network,…
2
C++ unordered_set
3
Dictionary problem
Challenge:
The size of universe U is too large to store in an array of size |U|.
Can we use the following data structures?
Bitvector
Linked list
Binary search tree Hash table
4
Hashing
Hash function:
h: U {0, 1, … , n-1}
Hashing:
Create an array A of length n. For any operation on element x Î S,
operate on the array position A[h(x)] Collision:
Two different elements x and y have the same hash value: h(x) = h(y). Chaining: A[i] is a linked list containing all elements x where h(x) = i.
5
5
Example of chain hashing
0
1 11 31
2
3
4 34 74 54
6 96 7
U = {0, 1, … , 99}
h(x) = x mod 10
S = {96, 54, 31, 11, 74, 34}
8 9
6
Dictionary operations
Search(x):
Compute h(x), then scan through the list A[h(x)]. Return true if
finding x, and false otherwise. Insert(x):
Compute h(x), then scan through the list A[h(x)] to find x. If x is not in the list, add x to the front of the list. Otherwise, do nothing.
Delete(x):
Compute h(x), then scan through the list A[h(x)] to find x. If x is in
the list, remove x from the list. Otherwise, do nothing. Time complexity: O(1 + length of A[h(x)])
7
Deterministic hash function
For any choice of h, we can find a set whose elements have the same hash values (map to one slot).
We end up with a single linked list containing all elements. Hence, it takes O(n) time for the searching operation.
Solutions:
Assume the input S is random.
Ideal hash function: Map m elements uniformly at random to n slots.
8
Hashing performance
Ideal hash function:
Any m elements are uniformly distributed to n slots.
Average length of chain = m/n.
Choose m ~ n, we expect O(1) cost for insert, delete, search operations.
We cannot have an ideal hash function for all possible sets of elements (true randomness might not exist).
Approach: Choose the hash function h at random from a hash family H.
9
Universal hashing
A universal family of hash functions H is a set of hash functions h mapping the universe U to a set {0, 1, … , n-1} such that
For any different pair x ≠ y: Prh[h(x) = h(y)] ≤ 1/n, i.e. collision probability is very tiny.
Can select a random h efficiently
Can compute h(x) efficiently (in constant time)
h is chosen at random.
10
Example
Universal property: If x ≠ y, Prh[h(x) = h(y)] ≤ 1/n.
11
Universal hash family for integer
We construct a universal hash function h that hashes an integer x Î U to a set of {0, 1, … , n-1} as follows:
Pick a large prime number p ≥ n
Pick a random pair of integers 0 ≤ a, b < p ha,b(x) = ((ax + b) mod p) mod n
Universal hash family: H = {ha,b: 0 ≤ a, b < p} Proof of the universal property:
The 13.6 section of Algorithm Design textbook.
Carter & Wegman: “Universal Classes of Hash Functions”. STOC’77.
12
Universal hash for a set of integers
We construct a universal hash function h that hashes a set of integers X ={x1, ... , xd} to a set {0, 1, ... , n-1} as follows:
Choose independently at random d hash functions h1, ... , hd from a universal hash family H.
h(X) = h1(x1) + ... + hd(xd) mod n. Proof of universal property:
Carter & Wegman: “Universal Classes of Hash Functions”. STOC’77.
Practical hash function: For a large prime p and 0 ≤ ai < p, ha0,...,ad(X) = ((a0+a1x1 + a2x2+...+adxd) mod p) mod n
13
Analysis of dictionary operations
Time complexity: O(1 + length of A[h(x)])
Assumption: |S| ≤ n at all time, hence our dictionary uses
O(n) space.
The expectation of the length of the linked list A[h(x)]: 𝐄 lengthof𝐴 h 𝒙 1𝐄|𝒚Î𝑆\𝒙|h𝒚 h𝒙|
1 𝑆 1 ∗ 2 .
Expected time complexity of dictionary operations is O(1).
14
Conclusion
Given a random choice of a hash function h from the universal family such that h: U {0, 1, ... , n-1}.
Our dictionary needs O(n) space and O(1) expected time per operation (search, insert, delete).
Expectation is over the random choice of hash function. Independent of input set.
15
Homework
Write Python script to implement the universal hash family for a set of integers
Source: https://zeroequalsfalse.com/posts/learn-hashtable/
16