CS计算机代考程序代写 data structure database chain algorithm COMP2521

COMP2521
Data Structures & Algorithms

Week 9.2
Hashing

 

1

In this lecture

Why?

Associate data structures (as opposed to ordered data
structures) are a core data structure that needs
exploration

What?

Hashing
Hash Table ADT
Hash Collisions

2

Ordered & Associative Containers

So far we’ve mainly explored ordered containers (linked lists, arrays, trees).
 

Ordered containers have a sense of order between elements, and typically you
can assign a meaningful index to each element that denotes order.

3 . 1

Ordered & Associative Containers

Associative containers map “keys” to various values, and items in associative
containers have no sense of order. Keys are typically strings.

 
You can think of these a little like structs (conceptually). Identified by name.

 

3 . 2

Hashing
We need a way for this to make sense

courses[“COMP3311”] = “Database Systems”;
printf(“%s\n”, courses[“COMP3311”]);

1
2

courses[h(“COMP3311”)] = “Database Systems”;
printf(“%s\n”, courses[h(“COMP3311”)]);

1
2

Let’s define a function h that converts a string to a number

4 . 1

Hashing

In reality we do this
key = “COMP3311”;
item = {“COMP3311″,”Database Systems”,…};
courses = HashInsert(courses, key, item);
printf(“%s\n”, HashGet(courses, “COMP3311”));

1
2
3
4

To use arbitrary values as keys, we need …

Set of Key (strings) values dom(Key),   each key identifies one Item
An array (of size N ) to store Items
A hash function h() of type dom(Key) → [0..N-1]

requirement: if (x = y)  then h(x) = h(y)
requirement: h(x) always returns same value for given x

One issue: Array is size N, but dom(Key) > N in nearly all cases
Therefore collisions are inevitable (we will discuss this later)

 
4 . 2

Hash Table ADT

typedef struct HashTabRep *HashTable;
// make new empty table of size N
HashTable newHashTable(int);
// add item into collection
void HashInsert(HashTable, Item);
// find item with key
Item *HashGet(HashTable, Key);
// drop item with key
void HashDelete(HashTable, Key);
// free memory of a HashTable
void dropHashTable(HashTable);

1
2
3
4
5
6
7
8
9

10
11

5 . 1

Hash Table ADT Implementation

typedef struct HashTabRep {
Item **items; // array of (Item *)
int N; // size of array
int nitems; // # Items in array
} HashTabRep;

HashTable newHashTable(int N)
{
HashTable new = malloc(sizeof(HashTabRep));
new->items = malloc(N*sizeof(Item *));
new->N = N;
new->nitems = 0;
for (int i = 0; i < N; i++) new->items[i] = NULL;
return new;
}

void HashInsert(HashTable ht, Item it) {
int h = hash(key(it), ht->N);
// assume table slot empty!?
ht->items[h] = copy(it);
ht->nitems++;
}
Item *HashGet(HashTable ht, Key k) {
int h = hash(k, ht->N);
Item *itp = ht->items[h];
if (itp != NULL && equal(key(*itp),k))
return itp;
else
return NULL;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

This assumes no collisions

void HashDelete(HashTable ht, Key k) {
int h = hash(k, ht->N);
Item *itp = ht->items[h];
if (itp != NULL && equal(key(*itp),k)) {
free(itp);
ht->items[h] = NULL;
ht->nitems–;
}
}
void dropHashTable(HashTable ht) {
for (int i = 0; i < ht->N; i++) {
if (ht->items[i] != NULL) free(ht->items[i]);
}
free(ht);
}

// key() and copy() come from Item type; equal() from Key type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

5 . 2

Hash Functions

Basic mechanism of hash functions

int hash(Key key, int N)
{
int val = convert key to 32-bit int;
return val % N;
}

1
2
3
4
5

If keys are ints, conversion is easy (identity function)
How to convert keys which are strings?   (e.g. “COMP1927” or “John”)

Definitely prefer that  hash(“cat”,N)  ≠  hash(“dog”,N)
Prefer that  hash(“cat”,N)  ≠  hash(“act”,N)  ≠  hash(“tac”,N)

6 . 1

Hash Function Examples

Universal hashing function
int hash(char *key, int N)
{
int h = 0, a = 31415, b = 21783;
char *c;
for (c = key; *c != ‘\0’; c++) {
a = a*b % (N-1);
h = (a * h + *c) % N;
}
return h;
}

1
2
3
4
5
6
7
8
9
10

Hashing function (Postgresql dbms)
hash_any(unsigned char *k, register int keylen, int N)
{
register uint32 a, b, c, len;
// set up internal state
len = keylen;
a = b = 0x9e3779b9;
c = 3923095;
// handle most of the key, in 12-char chunks
while (len >= 12) {
a += (k[0] + (k[1] << 8) + (k[2] << 16) + (k[3] << 24)); b += (k[4] + (k[5] << 8) + (k[6] << 16) + (k[7] << 24)); c += (k[8] + (k[9] << 8) + (k[10] << 16) + (k[11] << 24)); mix(a, b, c); k += 12; len -= 12; } // collect any data from remaining bytes into a,b,c mix(a, b, c); return c % N; } #define mix(a,b,c) \ { \ a -= b; a -= c; a ^= (c>>13); \
b -= c; b -= a; b ^= (a<<8); \ c -= a; c -= b; c ^= (b>>13); \
a -= b; a -= c; a ^= (c>>12); \
b -= c; b -= a; b ^= (a<<16); \ c -= a; c -= b; c ^= (b>>5); \
a -= b; a -= c; a ^= (c>>3); \
b -= c; b -= a; b ^= (a<<10); \ c -= a; c -= b; c ^= (b>>15); \
}

1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

6 . 2

Problems with Hashing

In ideal scenarios, search cost in hash table is O(1).
 

Problems with hashing:

Hash function relies on size of array (⇒ can’t expand)
Changing size of array effectively changes the hash function
If change array size, then need to re-insert all Items

Items are stored in (effectively) random order
If size(KeySpace) size(IndexSpace),  collisions inevitable

Collision:   k ≠ j  &&  hash(k,N) = hash(j,N)
If nitems > nslots,  collisions inevitable

7

Hash Collisions

How do we deal with collisions?
 

Separate Chaining: allow multiple Items at a single array location
E.g. array of linked lists (but worst case is O(N))

Linear Probing: systematically compute new indexes until find a free slot
Need strategies for computing new indexes (aka probing)

Double Hashing: increase the size of the array
Needs a method to “adjust” hash()   (e.g. linear hashing)

8

Hash Collisions – Separate Chaining

Solve collisions by having multiple items per array entry.
Make each element the start of linked-list of Items.
All items in a given list have the same hash() value

9 . 1

Hash Collisions – Separate Chaining

Solve collisions by having multiple items per array entry.
Make each element the start of linked-list of Items.
All items in a given list have the same hash() value

9 . 2

Hash Collisions – Separate Chaining

typedef struct HashTabRep {
List *lists; // array of Lists of Items
int N; // # elements in array
int nitems; // # items stored in HashTable
} HashTabRep;

HashTable newHashTable(int N) {
HashTabRep *new = malloc(sizeof(HashTabRep));
assert(new != NULL);
new->lists = malloc(N*sizeof(List));
assert(new->lists != NULL);
for (int i = 0; i < N; i++) new->lists[i] = newList();
new->N = N; new->nitems = 0;
return new;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Item *HashGet(HashTable ht, Key k) {
int i = hash(k, ht->N);
return ListSearch(ht->lists[i], k);
}

void HashInsert(HashTable ht, Item it) {
Key k = key(it);
int i = hash(k, ht->N);
ListInsert(ht->lists[i], it);
}

void HashDelete(HashTable ht, Key k) {
int i = hash(k, ht->N);
ListDelete(ht->lists[i], k);
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

9 . 3

Hash Collisions – Separate Chaining

Cost analysis:

N array entries (slots), M stored items
Average list length L = M/N
Best case: all lists are same length L
Worst case: one list of length M   (h(k)=0 )
searching within a list of length n :

best: 1, worst: n, average: n/2  ⇒  O(n)
if good hash and M≤N, cost is 1
if good hash and M>N, cost is (M/N)/2

Complexity = O(M/N/2)
Ratio of items/slots is called   load α = M/N

 

9 . 4

Hash Collisions – Linear Probing

Collision resolution by finding a new location for Item

Hash indicates slot i  which is already used
Try next slot, then next, until we find a free slot
Insert item into available slot

 
Because the value at every index stores the original hash value, it’s

easy to tell when we’ve found it

10 . 1

Hash Collisions – Linear Probing

typedef struct HashTabRep {
Item **items; // array of pointers to Items
int N; // # elements in array
int nitems; // # items stored in HashTable
} HashTabRep;

HashTable newHashTable(int N)
{
HashTabRep *new = malloc(sizeof(HashTabRep));
assert(new != NULL);
new->items = malloc(N*sizeof(Item *));
assert(new->items != NULL);
for (int i = 0; i < N; i++) new->items[i] = NULL;
new->N = N; new->nitems = 0;
return new;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

void HashInsert(HashTable ht, Item it)
{
assert(ht->nitems < ht->N);
int N = ht->N;
Key k = key(it);
Item **a = ht->items;
int i = hash(k,N);
for (int j = 0; j < N; j++) { if (a[i] == NULL) break; if (equal(k,key(*(a[i])))) break; i = (i+1) % N; } if (a[i] == NULL) ht->nitems++;
if (a[i] != NULL) free(a[i]);
a[i] = copy(it);
}

Item *HashGet(HashTable ht, Key k)
{
int N = ht->N;
Item **a = ht->items;
int i = hash(k,N);
for (int j = 0; j < N; j++) { if (a[i] == NULL) break; if (equal(k,key(*(a[i])))) return a[i]; i = (i+1) % N; } return NULL; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Creation, insertion, search 10 . 2 Hash Collisions - Linear Probing void HashDelete(HashTable ht, Key k) { int N = ht->N;
Item *a = ht->items;
int i = hash(k,N);
for (int j = 0; j < N; j++) { if (a[i] == NULL) return; // k not in table if (equal(k,key(*(a[i])))) break; i = (i+1) % N; } free(a[i]); a[i] = NULL; ht->nitems–;
// clean up probe path
i = (i+1) % N;
while (a[i] != NULL) {
Item it = *(a[i]);
a[i] = NULL; // remove ‘it’
ht->nitems–;
HashInsert(ht, it); // insert ‘it’ again
i = (i+1) % N;
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Deletion is tricky,
Need to ensure no NULL in middle of “probe path”

(i.e. previously relocated items moved to appropriate location)

10 . 3

Hash Collisions – Linear Probing

10 . 4

Hash Collisions – Linear Probing

Search cost analysis:

Cost to reach first Item is O(1)
Subsequent cost depends how much we need to scan
Affected by load α = M/N   (i.e. how “full” is the table)
Average cost for successful search = 0.5*(1 + 1/(1-α))
Average cost for unsuccessful search = 0.5*(1 + 1/(1-α)^2)

Example costs (assuming large table, e.g. N>100 ):

load (α) 0.50   0.67   0.75   0.90

search hit 1.5 2.0 3.0 5.5

search miss 2.5 5.0 8.5 55.5
 

Assumes reasonably uniform data and good hash function.

10 . 5

Hash Collisions – Double Hashing

Double hashing improves on linear probing:

By using an increment which …
Is based on a secondary hash of the key
Ensures that all elements are visited
(can be ensured by using an increment which is relatively prime to N)

Tends to eliminate clusters ⇒ shorter probe paths
To generate relatively prime

Set table size to prime e.g. N=127
Hash2() in range [1..N1] where N1 < 127 and prime 11 . 1 Hash Collisions - Double Hashing typedef struct HashTabRep { Item **items; // array of pointers to Items int N; // # elements in array int nitems; // # items stored in HashTable int nhash2; // second hash mod } HashTabRep; #define hash2(k,N2) (((k)%N2)+1) HashTable newHashTable(int N) { HashTabRep *new = malloc(sizeof(HashTabRep)); assert(new != NULL); new->items = malloc(N*sizeof(Item *));
assert(new->items != NULL);
for (int i = 0; i < N; i++) new->items[i] = NULL;
new->N = N; new->nitems = 0;
new->nhash2 = findSuitablePrime(N);
return new;
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Item *HashGet(HashTable ht, Key k)
{
Item **a = ht->items;
int N = ht->N;
int i = hash(k,N);
int incr = hash2(k,ht->nhash2);
for (int j = 0, j < N; j++) { if (a[i] == NULL) break; // k not found if (equal(k,key(*(a[i]))) return a[i]; i = (i+incr) % N; } return NULL; } 1 2 3 4 5 6 7 8 9 10 11 12 13 void HashInsert(HashTable ht, Item it) { assert(ht->nitems < ht->N); // table full
Item **a = ht->items;
Key k = key(it);
int N = ht->N;
int i = hash(k,N);
int incr = hash2(k,ht->nhash2);
for (int j = 0, j < N; j++) { if (a[i] == NULL) break; if (equal(k,key(*(a[i])))) break; i = (i+incr) % N; } if (a[i] == NULL) ht->nitems++;
if (a[i] != NULL) free(a[i]);
a[i] = copy(it);
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

11 . 2

Hash Collisions – Double Hashing
   

Search cost analysis:

cost to reach first Item is O(1)
subsequent cost depends how much we need to scan
affected by  load α = M/N   (i.e. how “full” is the table)
average cost for successful search =
Average cost for unsuccessful search =

Costs for double hashing (assuming large table, e.g. N>100 ):
 

load (α) 0.5   0.67   0.75   0.90
search hit 1.4 1.6 1.8 2.6

search miss 1.5 2.0 3.0 5.5
Can be significantly better than linear probing

ln( )
α
1

1−α
1

1−α
1

11 . 3

Hash Collisions – Summary
   

Collision resolution approaches:

chaining: easy to implement, allows α > 1
linear probing: fast if α << 1, complex deletion double hashing: faster than linear probing, esp for α ≅ 1 Only chaining allows α > 1, but performance poor when α >> 1
 

For arrays, once M  exceeds initial choice of N,

Need to expand size of array (N)
Problem: hash function relies on N, so changing array size potentially
requires rebuiling whole table

12

Feedback

13