Extensible and Linear
Introduction
Extensible hash indexes Linear hash indexes
4.1
In a hash table a hash function maps search key values to array elements, which contain
▪ The data objects themselves – open addressing
▪ Linked lists containing data objects – separate chaining ▪ Let’s refer to the array elements as buckets
Hash functions generate a value between 0 and B-1 ▪ Where B is the number of buckets
▪ In an in-memory hash table a bucket is either a single array element or a linked list
▪ A record with search key K is stored in bucket h(K)
▪ A collision occurs if two keys are mapped to the same bucket Range searches are not supported by hash indexes
A common in-memory hash function is K % B
Where B is prime
Buckets consist of blocks
▪ Choose B > |data entries| blocking factor
▪ Blocking factor is number of entries that can fit on page
▪ Leave some free space in blocks for future insertions
▪ Hash function h = search key(K) modulo B
▪ Use h to find bucket to insert or remove data entries
To find a bucket
▪ Use a directory – an array of pointers to buckets, or ▪ Store the first block of each bucket consecutively
▪ Using a hash index requires one or two disk I/Os
With repeated insertions buckets fill up
▪ Chain overflow pages to the first block of the bucket
h(K)
…
overflow blocks
…
▪ Reducing the efficiency of the index for those buckets
combine overflow blocks on removals
As overflow buckets are inserted a static hash index becomes less efficient
▪ The solution is to re-hash the index to more buckets ▪ Using another hash function to distribute data entries
▪ This is time consuming
▪ And the index is unavailable during the process
An alternative is to use dynamic hashing
▪ The index grows as necessary during use
▪ There are two versions of dynamic hashing ▪ Extensible hashing, and linear hashing
▪ Both systems use a family of hash functions
▪ The range of the hash function is increased
▪ And additional buckets are created as necessary
static hash index
h(K)
…
re-hash index using h2
…
Consider a function that returns the bit value of some attribute
▪ The bucket can be found using
▪ h = bit value modulo n
▪ where n is the number of buckets
▪ If n = 2k the bucket can be determined by looking at the last k bits
▪ e.g. if n = 8, k = 3 and the bit value shown above maps to bucket 5
The range of a hash function can be doubled by increasing k by one
▪ To increase the number of buckets in the index
0100 1101 01
64 16 4 2
10
010
1
4.1
Extensible hashing uses a directory to buckets ▪ The directory is an array of pointers
▪ It is therefore relatively small and often fits on one page
▪ When the directory increases in size it doubles
▪ New buckets are only created as necessary ▪ When a bucket overflows
The hash function computes a hash value
▪ A sequence of bits, only the right-most i bits are used
to access the directory
▪ The directory has 2i entries
▪ When the directory size doubles i+1 bits of the hash value are used
h(K)
i –number of bits of the hash value being used by the directory – global depth
64
12
2
i =2
01 11
directory
The array index is the last two bits of the hash value
Assume that three data entries fit in one bucket
▪ i.e. one disk page
The values shown are the
decimal equivalents of the hash values
▪ Not the search key values Only four index pages are
currently required
00
1
17
5
2
10
31
15
2
6
2
j –the number of bits used to determine membership in bucket – appears in the block header – local depth
insert record where h(K) = 1100 1000
global depth i =1
0100 0000
0100 0010
1100 1000
1
0000 1101
0100 0011
1
0
local depth
follow pointer in index 0 of the directory (as i = 1 only one bit of the hash value is being used)
Only the last bit of the hash value is currently used to determine which blocks records are inserted into
1
insert record where h(K) = 1101 1010 global depth
i =1
0100 0000
0100 0010
1100 1000
1
0000 1101
0100 0011
1
compare j to i
local depth
00
10
the block is full
distribute the new pointers based on the previous index
split the block that was full
j = i so double the directory size
011 11
insert record where h(K) = 1101 1010 global depth
i i = = 21
0100 0000
01100 100010
1100 1000
21
0000 1101
0100 0011
1
0100 0010
1101 1010
2
10
local depth
split the block that was full
00
use the last two digits of the hash value as i = 2
and increment j
adjust pointers to the block that has split
01 11
insert record where h(K) = 1111 1101
global depth i =2
local depth
0100 0000
1100 1000
2
0000 1101
0100 0011
1111 1101
1
00
0100 0010
1101 1010
2
10
01 11
insert record where h(K) = 1001 1001
global depth i =2
local depth
0100 0000
1100 1000
2
the block is full but j < i
0000 1101
0100 0011
1111 1101
1
00
0100 0010
1101 1010
2
10
split the block but don’t double directory size
01 11
insert record where h(K) = 1001 1001
global depth i =2
local depth
0100 0000
1100 1000
2
0000 1101
0110110 1010011
101011 11001
21
00
0100 0010
1101 1010
2
10
increment j from 1 to 2
0100 0011
2
split the block but don’t double directory size
and adjust pointers to the new block
01 11
insert record where h(K) = 0110 1101
global depth i =2
local depth
0100 0000
1100 1000
2
0000 1101
1111 1101
1001 1001
2
000
0100 0010
1101 1010
2
010
0100 0011
2
100
110
the block is full, and j = i
001 011 101 111
insert record where h(K) = 0110 1101
global depth i i = = 23
local depth
0100 0000
1100 1000
2
010010 10101
1111 1101
1001 1001
23
000
block is full, and j was equal to i
0100 0010
1101 1010
2
split block and distribute values
010
0100 0011
2
100
0000 1101
1111 1101
0110 1101
3
110
001 011 101 111
insert record where h(K) = 1110 1101?
global depth i =3
local depth
0100 0000
1100 1000
2
1001 1001
3
000
0100 0010
1101 1010
2
010
0100 0011
2
100
0000 1101
1111 1101
0110 1101
3
110
001 011 101 111
Removing an entry may result in an empty bucket
▪ Which may be merged with its counterpart
▪ The pointer entries in the directory are reset and
▪ The remaining bucket's local depth is decremented
▪ Which may result in the directory halving
▪ In practice these adjustments may not be performed
If many entries have the same hash value across the entire
range of bits overflow pages must be created
▪ The overflow pages are chained to the primary pages
▪ This can occur if the hash function is poor, or
▪ If there are many insertions with the same search key value (i.e. a skewed distribution)
If the directory resides in main memory, performance is identical to static hashing
▪ One disk read to use index and one to retrieve record
▪ If the directory is not in memory, another read is required
▪ In practice, the directory is often in memory
Many collisions creates a large directory
▪ Reducing the chance that it is in main memory
h(K) directory
look-up
read bucket
▪ More likely if the number of records per block is small
Increasing the directory size is relatively time read
consuming and interrupts access to the data file ▪ An alternative is to use linear hashing
record from file
4.1
Linear hashing is similar to extensible hashing
▪ But does not use a directory
▪ Primary blocks of buckets are stored in sequence like an array ▪ Bucket i is found by adding i to the address of the first bucket
The number of buckets (n) is selected to maintain a maximum average occupancy
▪ When this is exceeded the next bucket is split
▪ Buckets are split in turn, not as a response to overflow
▪ Therefore overflow blocks are added to buckets as needed
▪ The average number of overflow blocks per bucket is less than 1
Only the right-most i bits of h(K) are used to find buckets ▪ The number of bits used is log2n
h(K)
The right-most i bits of the hash value are used to map records to n buckets ... 100
▪ Values for the i bits range from 0 to 2i-1
▪ The value for n may be less than 2i-1
▪ e.g. if i = 3 there may be less than 8 buckets
To find the bucket with data entry for search key K ▪ Find the value m where m is equal to the last i bits of h(K)
▪ If m < n go to bucket m
▪ If m n (i.e. n ≤ m < 2i) then go to bucket m - 2i-1 ▪ i.e. change the left most bit of m to 0
... 011 ... 111
insert record where h(K) = 1001 0100 insert record where h(K) = 0101 1110
...
...
...
so just use i-1 bits (or subtract 2i-1 from 110)
n =6
...
...
place record in bucket 100 as 4 (100) < n
...
i =3
using 3 bits record is mapped to bucket 110 which does not exist
000 001 010 011 100 101
Buckets are periodically added to the index
▪ Whenever the ratio r / n (r records, n buckets) exceeds a threshold value a new bucket is created
▪ The threshold value is calculated to maintain the desired occupancy ▪ The bucket that is added to the index may not have any
relationship to the bucket that was just inserted into When a bucket m is added, values in a related
bucket are distributed between the two buckets
▪ The related bucket is bucket m - 2i-1
▪ e.g. if m = 6 and i = 3 the related bucket is bucket 2 (from 6 – 22)
When n > 2i, i is incremented by one
or ignore the ith bit … 110
insert record where h(K) = 1100 1011 insert in bucket 11
which does not exist so insert in 01
0100 0000
0100 0100
0000 1101
0100 0011
1001 0101
1100 1011
0100 0010
add an overflow block to the bucket
r / n < 2.4 so no new bucket is created
i = 2 n = 3 r r = = 67
index pages contain at most three entries, 80% occupancy = 0.8 * 3 = 2.4
max occupancy = 0.8 so max r / n = 2.4
why 2.4?
00 01 10
insert record where h(K) = 1000 0110 insert in bucket 10
0100 0000
0100 0100
0000 1101
0101001 010011
1001 0101
1100 1011
0100 0010
1000 0110
the next bucket is 11, so distribute the values in 01
r / n = 2.667, so make a new bucket
0100 0011
1100 1011
i = 2 n n = = 43 r r = = 8 7
max occupancy = 0.8 so max r / n = 2.4
00 01 10 11
insert record where h(K) = 1011 1110 insert in bucket 10
r / n = 2.25, so don't make a new bucket
0100 0000
0100 0100
0000 1101
1001 0101
0100 0010
1000 0110
1011 1110
0100 0011
1100 1011
i = 2 n = 4 r r = = 89
max occupancy = 0.8 so max r / n = 2.4
00 01 10 11
insert record where h(K) = 0101 0110 insert in bucket 10
0100 0000
0100 0100
0000 1101
1001 0101
0100 0010
1000 0110
1011 1110
0101 0110
0100 0011
1100 1011
n = 2i, so increase i to 3
0100 0100
the next bucket is 100, so distribute the values in 00
r / n = 2.5, so make a new bucket
i=23 n=45 rr=190
max occupancy = 0.8 so max r / n = 2.4
00 01 10 11 100
insert record where h(K) = 1101 1101 insert in bucket 101
n – 1 (100) < 101 so insert in 001
r / n = 2.2, so don't make a new bucket
0100 0000
0000 1101
1001 0101
1101 1101
0100 0010
1000 0110
1011 1110
0101 0110
0100 0011
1100 1011
0100 0100
i = 3 n = 5 r r = = 1 101
max occupancy = 0.8 so max r / n = 2.4
000 001 010 011 100
insert record where h(K) = 1001 1100 insert in bucket 100
r / n = 2.4, so don't make a new bucket
0100 0000
0000 1101
1001 0101
1101 1101
0100 0010
1000 0110
1011 1110
0101 0110
0100 0011
1100 1011
0100 0100
1001 1100
i = 3 n = 5 r r = = 1 121
max occupancy = 0.8 so max r / n = 2.4
000 001 010 011 100
insert record where h(K) = 1000 1000 insert in bucket 000
r / n = 2.6, so make a new bucket
0100 0000
1000 1000
0000 1101
1001 0101
1101 1101
0100 0010
1000 0110
1011 1110
0101 0110
0100 0011
1100 1011
the next time a new bucket is created the values in 010 will be distributed into it
0100 0100
1001 1100
0000 1101
1001 0101
1101 1101
i=3 n=65 r=123
max occupancy = 0.8 so max r / n = 2.4
000 001 010 011 100 101
Linear hashing does not require a directory
▪ But linear hashing may result in less space efficiency
because buckets are split before they overflow
Multiple collisions in one bucket in extensible
hashing will result in a large directory
▪ Which may then not fit on one disk page
Collisions in linear hashing lead to long overflow
chains for the bucket with the collisions
▪ Requiring multiple disk reads for that bucket
▪ But no increase in the cost of accessing other buckets
B+ Tree indices
▪ Supports range queries
▪ i.e.inequalityselections
▪ Allows searches on a prefix of the search key
▪ Includes concatenated keys
▪ Usually 3 to 6 disk reads to access leaf level
▪ AndfindRID
Hash indices
▪ Does not support range queries
▪ Only supports equalities
▪ Does not allow searches
on search key prefix
▪ Faster than B+ tree
▪ Usually 1 or 2 disk reads to access bucket
Faster than B+ tree
Good choice for primary index where file sorted on search key