CS代考 Algorithms & Data Structures (Winter 2022) Hashing

Algorithms & Data Structures (Winter 2022) Hashing

Announcements

Copyright By PowCoder代写 加微信 powcoder

• Introduction.
• Hash functions.
• Collision resolution.

Hashing – Problem definition
• A map is a set of (key, value) pairs. Each key maps to at most one value.
Taken from Langer2014
• This is also a map (‘function’ = ‘map’), but we will not be considering continuous functions/maps here.
Taken from Langer2014

Hashing – Problem definition
Table S with m records x: X
Information or data associated with x
We want a data structure to store and retrieve these data. Operations:
Satellite data
• insert(S,x):S←S∪{x} • delete(S, x) : S ← S \ {x} • search(S,k)
Dynamic set

Hashing – Real world Examples
• Compilers.
• a compiler that translates a programming language maintains a symbol table, in which the keys of elements are arbitrary character strings corresponding to identifiers in the language.
• Password Authentication
• Keys are usernames, and values are passwords (hash(passwords)). • http://www.md5.cz/
• https://crackstation.net/hashing-security.htm
• Data consistency
• For example, after downloading a file, you can compare the fingerprint of the downloaded file with the fingerprint of the original file. If the two fingerprints differ, the files are surely different, if they are the same, the files are equal with a very high probability.

Hashing – Why using hash
• Hashing is an extremely effective and practical technique.
• Many applications require a dynamic set that supports only the dictionary operations
• INSERT, SEARCH, and DELETE.
• A hash table is an effective data structure for implementing dictionaries.
• Under reasonable assumptions, the average time to search for an element in a hash table is O(1).
• A hash table generalizes the simpler notion of an ordinary array.
• Directly addressing into an ordinary array makes effective use of our ability to examine an arbitrary position in an array in O(1) time.
• The array index is computed from the key.
• To allocate an array that has one position for every possible key.

Direct-address tables – First attempt
Each slot, or position, corresponds to a key in U .
If there is an element x with key k, then T[k] contains a pointer to x. If T [k] is empty, represented by NIL.
All operations in O(1), but:
if n (#keys) < m (#slots), lot of wasted space. Illustration (CLR, 2005) needs to be not too large (space constraint) Hashing – Hash map • Hash map (hash function + hash table). • Direct addressing: • an element with key k is stored in slot k. • Hash function: • an element with key k is stored in slot h(k). • That is, we use a hash function h to compute the slot from the key k. • Hash function: h:U →{0,1,...,m−1} Illustration9(CLR, 2009) Hashing – Hash table • The hash function reduces the range of array indices. Instead of a size of |U|, the array can have size m. • The catch: We reduce the storage requirement while we maintain the benefit of searching in O(1). • This boundary is for the average-case time, whereas for direct addressing it holds for the worst-case time. • The hitch: Two keys may hash to the same slot (i.e., collision). Illustration (CLR, 2009) • Introduction. • Hash functions. • Collision resolution. Hashing – Function • A good hash function. • Satisfies the condition of simple uniform hashing. • Each key is equally likely to hash to any of the m slots. • Nearby/similar keys in U should map to different values. • http://www.md5.cz/ • Independent of any patterns of U. • Can be computed quickly O(1) Taken from Langer2014 Hashing – Function • h:U->{0,1,….m-1}. • Two steps
Taken from Langer2014

Hashing – Function
• Collisions:
• Two keys have same hash code
• Two keys have different hash codes but are compressed to same hash value
Taken from Langer2014

Hashing – Design
• List of functions:
• Division method
• Multiplication methods
• Universal Hashing

Hashing – Division Method
h(k)=kmodd d must be chosen carefully!
Example 1: d = 2r and all keys are even?
Odd slots are never used…
Example 2: d = 2r
r = 2 -> 11
r = 3 -> 011 r = 4 -> 1011
k = 100010110101101011
keeps only r last bits…
Good heuristic: Choose d prime not too close from a power of 2.
Note: Easy to implement, but it depends on value d

Hashing – Multiplication Method
𝑤 = 𝑤𝑜𝑟𝑑 𝑠𝑖𝑧𝑒 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟
𝑘𝐴𝑚𝑜𝑑1=𝑘𝐴 − 𝑘𝐴 𝑘𝑠=2″𝑟# + 𝑟$ (2w-bit value)
h 𝑘 = 𝑚(𝑘𝑠𝑚𝑜𝑑1)
Note: A bit more complicated, but the value of m is not critical
Illustration (CLR, 2009)

Hashing – Universal Hashing
• A malicious adversary who has learned the hash function chooses keys that all map to the same slot, giving worst-case behavior.
• Defeat the adversary using Universal Hashing
• Use a different random hash function each time.
• No single input will always evoke worst-case behavior.
• Ensure that the random hash function is independent of the keys that
are actually going to be stored.
• Ensure that the random hash function is “good” by carefully designing a class of functions to choose from:
• Design a universal class of functions.

Hashing – Universal Hashing
• A finite collection of hash functions H that maps a universe U of keys into the range {0, 1,…, m–1} is
universal if:
for each pair of distinct keys x, y Î U, the number of hash functions h Î H
for which h(x)=h(y) is ≤ |H|/m.
for a hash function h chosen randomly from H, the chance of a collision between two keys is ≤ 1/m.
Universal hash functions give good hashing behavior.

• Introduction.
• Hash functions.
• Collision resolution.

Hashing – Hash table
• The hash function reduces the range of array indices. Instead of a size of |U|, the array can have size m.
• The catch: We reduce the storage requirement while we maintain the benefit of searching in O(1).
• This boundary is for the average-case time, whereas for direct addressing it holds for the worst-case time.
• The hitch: Two keys may hash to the same slot (i.e., collision).
Illustration (CLR, 2009)

Hashing – Collisions – Birthday Problem
• We are 500 people in Comp251. What is the probability that at least one pair of us have the same birthday?
Taken from Langer2014

Hashing – Collisions – Birthday Problem
• We are 500 people in Comp251. What is the probability that at least one pair of us has the same birthday?
Taken from Langer2014
• |U| > m, then there must be at least two people that has the same birthday. In other words, |U| > m, then there must be at least two keys that have the same hash value.
• For n = 23, probability ~ 0.5 (50% chance!)
• The home take message: collisions happen a lot.

Hashing – Collisions
• List of :
Chaining method.
Open addressing. • Linear
• Quadratic
• Double Hashing

Collision resolution – by chaining
• We place all the elements that hash to the same slot into the same linked list.
Pointer to the head of the list
Illustration (CLR, 2009)

Collision resolution – by chaining
• Dictionary Operations.
• CHAINED-HASH-INSERT(T, x)
• insert x at the head of list T[h(x.key)]
• O(1) if we assume that that x is not already present in the table.
• CHAINED-HASH-SEARCH(T, k)
• search for an element with key k in list T[h(k)] • proportional to the length of the list.
• CHAINED-HASH-DELETE(T, x)
• delete x from the list T[h(x.key)]
• O(1) if we assume that the list is doubly linked and a pointer to the object is given.
• Proportional to the length of the list, otherwise.

Collision resolution – by chaining
• Insertion: O(1) time (Insert at the beginning of the list).
• Deletion: Search time + O(1) if we use a double linked list.
• Worst case: Worst search time is O(n).
• Search time = time to compute hash function + time to search the list.
• Assuming the time to compute hash function is O(1).
• Worst time happens when all keys go the same slot (list of size n), and we need to scan the full list => O(n).
• Average case: It depends how keys are distributed among slots. 27

Chaining – Average case Analysis
• Assume a simple uniform hashing: n keys are distributed uniformly among m slots.
• Let n be the number of keys, and m the number of slots.
• Average number of element per linked list?
• The expected time of a search is O(1 + α). • Note:O(1)ifα<1,butO(n)ifαisO(n). • Load factor: • Theorem: Chaining – Average case Analysis • Theorem: • The expected time of a search is O(1 + α). • expectednumberofelementsexaminedtoseewhetheranyhaveakeyequaltok. • Proof? • Distinguish two cases: • search is unsuccessful: • No element in the table has key k • search is successful • Finding an element with key k Chaining – Unsuccessful search • Assume that we can compute the hash function in O(1) time. • An unsuccessful search requires to scan all the keys in the list. • Average search time = O(1 + average length of lists) • Let ni be the length of the list attached to slot i. • Average value of ni ? E ( n i ) = α = mn Þ O( 1 ) + O( α ) = O( 1 + α ) (Load factor) Chaining – Successful search • Each list is not equally likely to be searched. Instead, the probability that a list is searched is proportional to the number of elements it contains. • Assume the position of the searched key x is equally likely to be any of the elements stored in the list. • The number of elements examined during a successful search for an element x is one more than the number of elements that appear before x in x’s list. • new elements are placed at the front of the list, elements before x in the list were all inserted after x was inserted • To find the expected number of elements examined, we take the average, over the n elements x in the table, of 1 plus the expected number of elements added to x’s list after x was added to the list. X =I h(k)=h(k);E(X)= ij{i j}ij (probability of a collision) Chaining – Successful search 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑘𝑒𝑦𝑠 𝑖𝑛𝑠𝑒𝑟𝑡𝑒𝑑 𝑎𝑓𝑡𝑒𝑟 𝑥 = 1 + : 𝑋86 6789: 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑎𝑛𝑛𝑒𝑑 𝑘𝑒𝑦𝑠 = 𝐸 𝑛 : 1 + : 87: 6789: (1n"n%+1n"n % E* ∑$1+∑Xij'-= ∑$1+∑E()Xij+,' Bylinearityofexpectation *n$ '-n$ ' i=1 # j=i+1 & 1∑n" ∑n 1% = $1+ ' n$m' i=1 # j=i+1 & =1+𝛼−𝛼 2 2𝑛 Search time: Θ 1+1+𝛼−𝛼 =Θ(1+𝛼) Supplementary material Chaining – Home take message • If the number of hash-table slots is at least proportional to the number of elements in the table, we have n = O(m) and: 𝛼= 𝑛 =𝑂(𝑚)=𝑂(1) 𝑚𝑚 • We can support all dictionary operations in O(1) time on average. • Insertion => O(1) worst case time.
• Deletion => O(1) worst case time (under assumptions). • Searching => O(1) on average time.

Hashing – Collisions
• List of :
Chaining method.
Open addressing. • Linear
• Quadratic
• Double Hashing

Hashing – Open Addressing
• No storage for multiple keys on single slot (i.e. no chaining).
• Idea: Probe the table.
• Insert if the slot is empty,
• Try another hash function otherwise.
• h:Ux{0,…,m-1}->{1,…,m} Universe of keys probe number slot
• Constraints:
• 𝑛 ≤ 𝑚 (i.e. more slots than keys to store)
• Deletion is difficult
• Challenge: How to build the hash function?

Hashing – Open Addressing
Illustration: Where to store key 282?
h(282,0)=3 h(282,1)=1 h(282,2)=5
Note1: Search must use the same probe sequence.
Note2: The probe sequence must be a permutation of the indexes

Hashing – Open Addressing
Linear probing:
h(k,i)=(h'(k)+i)modm
• Tendency to create primary clusters (long runs of occupied slots build up).
• There are only m distinct probe sequences (the initial probe determines
the entire probe sequence) Quadratic probing:
h(k,i)= h'(k)+c ⋅i+c ⋅i2 modm (12)
• The values of c1, c2 and m must be constrained to ensure that we have
a full permutation of ⟨ 0, … , m-1 ⟩.
• There are only m distinct probe sequences
• Secondary clustering: If 2 distinct keys have the same h′ value, then
they have the same probe sequence.

Hashing – Open Addressing
Double hashing:
h(k,i)=(h1(k)+i⋅h2(k))modm • One of the best methods available.
• More permutations.
• Characteristics of randomly chosen permutations.
• Alleviatetheproblemofclustering
• Must have h2(k) be “relatively” prime to m to guarantee that the
probe sequence is a full permutation of ⟨0, 1, . . . , m −1⟩. • Examples:
• m power of 2 and h2 returns odd numbers. • O(m2) probes are used.
• Each possible (h1(k), h2(k)) pairs yields a distinct probe sequence. • m prime number and 1 < h2(k) < m 39 Hashing – Open Addressing Double hashing: h(k,i)=(h1(k)+i⋅h2(k))modm Task: To insert k = 14 (key = value) h# 𝑘 =𝑘𝑚𝑜𝑑𝑚=𝑘𝑚𝑜𝑑13 h% 𝑘 =1+ 𝑘𝑚𝑜𝑑𝑚& =1+(𝑘𝑚𝑜𝑑11) h# 14 =14𝑚𝑜𝑑13=1 h% 14 =1+ 14𝑚𝑜𝑑11 =4 h 14,0 =1+0∗4=1 h 14,1 =1+1∗4=5 h 14,2 =1+2∗4=9 Hashing – Open Addressing We assume uniform hashing: Each key equally likely to have anyone of the m’s permutations as its probe sequence, independently of other keys. Theorem 1: The expected number of probes in an unsuccessful search is at most 1 . Theorem 2: The expected number of probes in a successful search is at most Reminder: α = mn is the load factor 1⋅log# 1 & α %$ 1 − α (' Open Addressing – supplemental material Initial state: n keys are already stored in m slots. Probability 1st slot is occupied: n/m. Probability 2nd slot is occupied knowing 1st is too: (n-1)/(m-1). Probability 3rd slot is occupied knowing 2nd is too : (n-2)/(m-2). Let X be the number of unsuccessful probes. Pr𝑋≥𝑖 =𝑛B𝑛−1B𝑛−2⋯𝑛−𝑖+2 𝑚 𝑚−1 𝑚−2 𝑚−𝑖+2 𝑛<𝑚⇒(𝑛−𝑗)⁄ 𝑚−𝑗 ≤𝑛⁄𝑚,for𝑗≥0 Pr𝑋≥𝑖 ≤ 𝑛⁄𝑚567=𝛼567 9991 𝐸𝑋 =:Pr{𝑋≥𝑖}≤:𝛼567=:𝛼5=1−𝛼42 587 587 58: Open Addressing – home take message The expected number of probes to insert is at most 1/(1 − α). Interpretation: • If α is constant, an unsuccessful search takes O(1) time. • If α = 0.5, then an unsuccessful search takes an average of 1/(1 − 0.5) = 2 probes (1.387 for a successful search). • If α = 0.9, takes an average of 1/(1 − 0.9) = 10 probes (2.559 for a successful search). Proof of Theorem on successful searches: See [CLRS, 2009]. Open Addressing • Dictionary Operations. • HASH-INSERT(T, x) • HASH-SEARCH(T, k) • HASH-DELETE(T, x) Open Addressing - Problem • Dictionary Operations. • HASH-DELETE(T, x) => Works poorly
• get(k5) will fail because slot 1 will be empty even though k5 is in the table 45
Taken from Langer2014

• Introduction.
• Hash functions.
• Collision resolution.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com