CS计算机代考程序代写 data structure c++ algorithm Hash Tables

Hash Tables

Data Structures and Abstractions
Hash Tables

Lecture 31

*

Important Advice for Assignment2/Project
You must demonstrate progress on the Assignment 2/project to your tutor.
This is to ensure that the following good practices are being followed:
1. The work does not have to be complete but it must build (compile and link) and run.
2. How you have been keeping backups as these are necessary so that you do not lose all your work if something goes wrong.
3. Be aware of the minimum requirements of the assignment, as not meeting the minimum requirements would result in failing the assignment.
If you submit your assignment without your tutor knowing how you are progressing with the assignment in the weeks preceding the due date, your assignment may not be marked. You may also be called in for an interview, quite possibly multiple interviews.
The mark/grade that you get for a piece of assessed work should be a reflection of you actual ability. We want to ensure that this is the case when we give you a grade/mark and so we would be suspicious of any work “appearing by magic”.
If you submit a good assignment but are unable to show that you can design and write good C++ code in the exam, your assignment mark will be in question as we wouldn’t be able to say that the mark/grade we gave you for the assignment reflects your actual ability. The exam is easier and less complex compared to the assignment. You will need to demonstrate that you understand the material that you submitted in your assignment. You tutor can be your witness, and it would be sufficient if the tutor indicates that you actually understood the work you submitted.

Maps
Previously we have looked at the STL map class, where any type can be used as a key into a container of paired values, giving direct access to the second part of the data.
So we can have:

*
map DictionaryType;
Dictionary dictionary; // not really a good way to name
dictionary[“aardvark”] = “A nocturnal mammal of southern Africa”
cout << “arardvark:” << dictionary[“aardvark”]; //Figure out how may string object constructions occurred in the lines //above Hash Tables One way to achieve this kind of direct access for the map class is to use what is known as a hash table. If keys are unique a balanced binary search tree can be used. If keys are not unique and key are unordered as in std::unordered_multimap or std::unordered_multiset , then hashing is used. When storing the data, the key—in this case “aardvark” —is passed through a hash function, to give an index into an ordinary array. [1] [2] The quality of the hash function determines how many different keys hash to the same index value. (technically known as “collisions”) No hash function is perfect under all conditions, therefore there will always be clashes (“collisions”). Therefore there must also be a collision resolution defined. Hash tables will always have empty space. To work most efficiently they are generally required to be no more than half full. * * [1] Why would you want to do something like this? [2] Users of the STL map should not base their code on any underlying implementation of the map. Dealing with Strings The key used in the above example is a string. Obviously you cannot pass a string through a mathematical function. Therefore strings must be mapped to integers before hashing. There are many ways to do this, however it is important to make sure that the method chosen does not promote collisions. * Insertion into a Hash Table Insert (pair) index = HashFunction (pair.key) index = index % arraySize IF array[index] is empty array[index] = pair ELSE HandleCollision (index, pair) ENDIF End Insert * [1] index is wrapped around the array using the mod % operator: * Searching a Hash Table Search (key, found) index = HashFunction (key) index = index mod arraySize IF array[index] is empty found = false ELSE IF array[index].key = key found = true ELSE // another key was hashed to the same index found = CollisionSearch (index, pair) ENDIF ENDIF End Search * Hash Functions The ideal function would: [1] be easy to calculate; never produce the same index from two different keys; spread the records evenly throughout the array; deal with ‘bad’ keys better than others. Of course no function has all of these attributes under all conditions. It may be possible under restricted conditions where all keys are know in advance. Common Hash Functions Truncation Extraction Folding Modular arithmetic Prime number division Mid-square hashing Radix conversion * [1] Ideal function is possible if all keys are known in advance. In the non-ideal case, the load factor needs to be kept below some bound. The load factor is the ratio of number of keys to the number of places (buckets) available to store the keys. A load factor close to 0 is wasting space without an increase in performance. * Radix Conversion The best known of these is Radix Conversion. Choose a low prime [1] number such as 7, 11 or 13 to use as the base of a polynomial. Then use the digits of the key as the factors of the polynomial. Finally modulate by the array size, which should be a prime number. For example, if key = 32934648 array size = 997 Base = 7 Then index = (3 * 77 + 2 * 76 + 9 * 75 + 3 * 74 + 4 * 73 + 6 * 72 + 4 * 7 + 8) MOD 997 = 2866095 MOD 997 = 717 * * [1] Why prime? Collision Resolution Collision resolution needs to: avoid clustering of records; be as simple to code as possible; only fail when the array is actually full; be ‘reversible’ to allow for deletion. As before, no method fulfils all these requirements under every possible condition. Common collision resolutions are: Linear probing Quadratic probing Random probing Linked collisions Overflow containers * Probing Linear probing simply looks for the next empty space in the array. So if index is full, index+1 is checked, then index+2, index+3 etc. This has the disadvantage of increasing clustering and therefore collisions. Quadratic probing looks for the next empty space using ‘square’ jumps. So if index is full, index+1 is checked, then index+4, index+9, index+16 etc. This avoids the clustering of linear probing, but can fail when the array is not full. Random probing uses a random number generator—from a set starting point—for the increments in index. This avoids clustering, but is harder to reverse when a record is deleted. Use pseudo -random number generator. For all three of these, once a collision (keys hashing to the same value) has been detected it is necessary to do a linear search for a space. Similarly searching becomes linear once a collision has been detected. * Linking Collisions After the first collision, a second hash function is used to generate an alternate position and the two are linked. A third collision would then have a link from the second collision and so on. This is harder to code and uses extra memory, but retrieval is faster. * key1 key2 An Overflow Container Instead of using a one dimensional array to store data, a two dimensional structure is used. The records are placed in a linked list from the hashed index. * key2 key1 Readings Reference book, Introduction to Algorithms. Chapter on Hash Tables. Khan Academy Video one particular example of the use Hash functions “Bitcoin: Cryptographic hash functions” Tutorial on Hash functions http://research.cs.vt.edu/AVresearch/hashing/ Sedgewick (Parts 1–4), Ch 14 see https://murdoch.rl.talis.com/items/92774E67-EC42-763E-180D-909194686C4C.html Chapter on Hash Tables in Algorithms, Data Structures, and Problem Solving with C++ by Mark Weiss. http://prospero.murdoch.edu.au/record=b1300079 *