程序代写代做代考 database data structure AI chain file system c++ algorithm Universal Hashing

Universal Hashing
COMPCSI 753: Algorithms for Massive Data
Ninh Pham
University of Auckland
Parts of this material are modifications of the lecture slides of Kevin Wayne (https://www.cs.princeton.edu/~wayne/kleinberg-tardos/) Designed for the textbook Algorithm Design
by Jon Kleinberg and Eva Tardos.
Auckland, Aug 3, 2020
1

Dictionary data type
 Dictionary problem:
 Given a universe U of possible elements, maintain a subset S Í U
supporting fast searching, inserting, deleting in S (in constant time).
 Main dictionary operations:
 Insert(x): insert element x Î U to S.
 Delete(x): remove element x from S if x Î S.
 Search(x): return true if x Î S and false otherwise.
 Applications:
 File systems, web caching, databases, Google, checksum P2P network,…
2

C++ unordered_set
3

Dictionary problem
 Challenge:
 The size of universe U is too large to store in an array of size |U|.
 Can we use the following data structures?
 Bitvector
 Linked list
 Binary search tree  Hash table
4

Hashing
 Hash function:
 h: U {0, 1, … , n-1}
 Hashing:
 Create an array A of length n. For any operation on element x Î S,
operate on the array position A[h(x)]  Collision:
 Two different elements x and y have the same hash value: h(x) = h(y).  Chaining: A[i] is a linked list containing all elements x where h(x) = i.
5

5
Example of chain hashing
0
1 11 31
2
3
4 34 74 54
6 96 7
U = {0, 1, … , 99}
h(x) = x mod 10
S = {96, 54, 31, 11, 74, 34}
8 9
6

Dictionary operations
 Search(x):
 Compute h(x), then scan through the list A[h(x)]. Return true if
finding x, and false otherwise.  Insert(x):
 Compute h(x), then scan through the list A[h(x)] to find x. If x is not in the list, add x to the front of the list. Otherwise, do nothing.
 Delete(x):
 Compute h(x), then scan through the list A[h(x)] to find x. If x is in
the list, remove x from the list. Otherwise, do nothing.  Time complexity: O(1 + length of A[h(x)])
7

Deterministic hash function
 For any choice of h, we can find a set whose elements have the same hash values (map to one slot).
 We end up with a single linked list containing all elements. Hence, it takes O(n) time for the searching operation.
 Solutions:
 Assume the input S is random.
 Ideal hash function: Map m elements uniformly at random to n slots.
8

Hashing performance
 Ideal hash function:
 Any m elements are uniformly distributed to n slots.
 Average length of chain = m/n.
 Choose m ~ n, we expect O(1) cost for insert, delete, search operations.
 We cannot have an ideal hash function for all possible sets of elements (true randomness might not exist).
 Approach: Choose the hash function h at random from a hash family H.
9

Universal hashing
 A universal family of hash functions H is a set of hash functions h mapping the universe U to a set {0, 1, … , n-1} such that
 For any different pair x ≠ y: Prh[h(x) = h(y)] ≤ 1/n, i.e. collision probability is very tiny.
 Can select a random h efficiently
 Can compute h(x) efficiently (in constant time)
h is chosen at random.
10

Example
 Universal property: If x ≠ y, Prh[h(x) = h(y)] ≤ 1/n.
11

Universal hash family for integer
 We construct a universal hash function h that hashes an integer x Î U to a set of {0, 1, … , n-1} as follows:
 Pick a large prime number p ≥ n
 Pick a random pair of integers 0 ≤ a, b < p  ha,b(x) = ((ax + b) mod p) mod n  Universal hash family: H = {ha,b: 0 ≤ a, b < p}  Proof of the universal property:  The 13.6 section of Algorithm Design textbook.  Carter & Wegman: “Universal Classes of Hash Functions”. STOC’77. 12 Universal hash for a set of integers  We construct a universal hash function h that hashes a set of integers X ={x1, ... , xd} to a set {0, 1, ... , n-1} as follows:  Choose independently at random d hash functions h1, ... , hd from a universal hash family H.  h(X) = h1(x1) + ... + hd(xd) mod n.  Proof of universal property:  Carter & Wegman: “Universal Classes of Hash Functions”. STOC’77.  Practical hash function: For a large prime p and 0 ≤ ai < p,  ha0,...,ad(X) = ((a0+a1x1 + a2x2+...+adxd) mod p) mod n 13 Analysis of dictionary operations  Time complexity: O(1 + length of A[h(x)])  Assumption: |S| ≤ n at all time, hence our dictionary uses O(n) space.  The expectation of the length of the linked list A[h(x)]: 𝐄 lengthof𝐴 h 𝒙 􏰀 1􏰁𝐄􏰂|𝒚Î𝑆\􏰃𝒙􏰄|h􏰅𝒚􏰆 􏰀 h􏰅𝒙􏰆|􏰇 􏰈 1 􏰁 𝑆 􏰉 1 ∗ 􏰋􏰊 􏰈 2 .  Expected time complexity of dictionary operations is O(1). 14 Conclusion  Given a random choice of a hash function h from the universal family such that h: U  {0, 1, ... , n-1}.  Our dictionary needs O(n) space and O(1) expected time per operation (search, insert, delete).  Expectation is over the random choice of hash function.  Independent of input set. 15 Homework  Write Python script to implement the universal hash family for a set of integers Source: https://zeroequalsfalse.com/posts/learn-hashtable/ 16