程序代写代做代考 go chain Extensible and Linear

Extensible and Linear

 Introduction
 Extensible hash indexes  Linear hash indexes

4.1

 In a hash table a hash function maps search key values to array elements, which contain
▪ The data objects themselves – open addressing
▪ Linked lists containing data objects – separate chaining ▪ Let’s refer to the array elements as buckets
 Hash functions generate a value between 0 and B-1 ▪ Where B is the number of buckets
▪ In an in-memory hash table a bucket is either a single array element or a linked list
▪ A record with search key K is stored in bucket h(K)
▪ A collision occurs if two keys are mapped to the same bucket  Range searches are not supported by hash indexes
A common in-memory hash function is K % B
Where B is prime

 Buckets consist of blocks
▪ Choose B > |data entries|  blocking factor
▪ Blocking factor is number of entries that can fit on page
▪ Leave some free space in blocks for future insertions
▪ Hash function h = search key(K) modulo B
▪ Use h to find bucket to insert or remove data entries
 To find a bucket
▪ Use a directory – an array of pointers to buckets, or ▪ Store the first block of each bucket consecutively
▪ Using a hash index requires one or two disk I/Os
 With repeated insertions buckets fill up
▪ Chain overflow pages to the first block of the bucket
h(K)

overflow blocks

▪ Reducing the efficiency of the index for those buckets
combine overflow blocks on removals

 As overflow buckets are inserted a static hash index becomes less efficient
▪ The solution is to re-hash the index to more buckets ▪ Using another hash function to distribute data entries
▪ This is time consuming
▪ And the index is unavailable during the process
 An alternative is to use dynamic hashing
▪ The index grows as necessary during use
▪ There are two versions of dynamic hashing ▪ Extensible hashing, and linear hashing
▪ Both systems use a family of hash functions
▪ The range of the hash function is increased
▪ And additional buckets are created as necessary
static hash index
h(K)

re-hash index using h2

 Consider a function that returns the bit value of some attribute
▪ The bucket can be found using
▪ h = bit value modulo n
▪ where n is the number of buckets
▪ If n = 2k the bucket can be determined by looking at the last k bits
▪ e.g. if n = 8, k = 3 and the bit value shown above maps to bucket 5
 The range of a hash function can be doubled by increasing k by one
▪ To increase the number of buckets in the index
0100 1101 01
64 16 4 2
10
010
1

4.1

 Extensible hashing uses a directory to buckets ▪ The directory is an array of pointers
▪ It is therefore relatively small and often fits on one page
▪ When the directory increases in size it doubles
▪ New buckets are only created as necessary ▪ When a bucket overflows
 The hash function computes a hash value
▪ A sequence of bits, only the right-most i bits are used
to access the directory
▪ The directory has 2i entries
▪ When the directory size doubles i+1 bits of the hash value are used
h(K)

i –number of bits of the hash value being used by the directory – global depth
64
12
2
i =2
01 11
directory
 The array index is the last two bits of the hash value
 Assume that three data entries fit in one bucket
▪ i.e. one disk page
 The values shown are the
decimal equivalents of the hash values
▪ Not the search key values  Only four index pages are
currently required
00
1
17
5
2
10
31
15
2
6
2
j –the number of bits used to determine membership in bucket – appears in the block header – local depth

insert record where h(K) = 1100 1000
global depth i =1
0100 0000
0100 0010
1100 1000
1
0000 1101
0100 0011
1
0
local depth
follow pointer in index 0 of the directory (as i = 1 only one bit of the hash value is being used)
Only the last bit of the hash value is currently used to determine which blocks records are inserted into
1

insert record where h(K) = 1101 1010 global depth
i =1
0100 0000
0100 0010
1100 1000
1
0000 1101
0100 0011
1
compare j to i
local depth
00
10
the block is full
distribute the new pointers based on the previous index
split the block that was full
j = i so double the directory size
011 11

insert record where h(K) = 1101 1010 global depth
i i = = 21
0100 0000
01100 100010
1100 1000
21
0000 1101
0100 0011
1
0100 0010
1101 1010
2
10
local depth
split the block that was full
00
use the last two digits of the hash value as i = 2
and increment j
adjust pointers to the block that has split
01 11

insert record where h(K) = 1111 1101
global depth i =2
local depth
0100 0000
1100 1000
2
0000 1101
0100 0011
1111 1101
1
00
0100 0010
1101 1010
2
10
01 11

insert record where h(K) = 1001 1001
global depth i =2
local depth
0100 0000
1100 1000
2
the block is full but j < i 0000 1101 0100 0011 1111 1101 1 00 0100 0010 1101 1010 2 10 split the block but don’t double directory size 01 11 insert record where h(K) = 1001 1001 global depth i =2 local depth 0100 0000 1100 1000 2 0000 1101 0110110 1010011 101011 11001 21 00 0100 0010 1101 1010 2 10 increment j from 1 to 2 0100 0011 2 split the block but don’t double directory size and adjust pointers to the new block 01 11 insert record where h(K) = 0110 1101 global depth i =2 local depth 0100 0000 1100 1000 2 0000 1101 1111 1101 1001 1001 2 000 0100 0010 1101 1010 2 010 0100 0011 2 100 110 the block is full, and j = i 001 011 101 111 insert record where h(K) = 0110 1101 global depth i i = = 23 local depth 0100 0000 1100 1000 2 010010 10101 1111 1101 1001 1001 23 000 block is full, and j was equal to i 0100 0010 1101 1010 2 split block and distribute values 010 0100 0011 2 100 0000 1101 1111 1101 0110 1101 3 110 001 011 101 111 insert record where h(K) = 1110 1101? global depth i =3 local depth 0100 0000 1100 1000 2 1001 1001 3 000 0100 0010 1101 1010 2 010 0100 0011 2 100 0000 1101 1111 1101 0110 1101 3 110 001 011 101 111  Removing an entry may result in an empty bucket ▪ Which may be merged with its counterpart ▪ The pointer entries in the directory are reset and ▪ The remaining bucket's local depth is decremented ▪ Which may result in the directory halving ▪ In practice these adjustments may not be performed  If many entries have the same hash value across the entire range of bits overflow pages must be created ▪ The overflow pages are chained to the primary pages ▪ This can occur if the hash function is poor, or ▪ If there are many insertions with the same search key value (i.e. a skewed distribution)  If the directory resides in main memory, performance is identical to static hashing ▪ One disk read to use index and one to retrieve record ▪ If the directory is not in memory, another read is required ▪ In practice, the directory is often in memory  Many collisions creates a large directory ▪ Reducing the chance that it is in main memory h(K) directory look-up read bucket ▪ More likely if the number of records per block is small  Increasing the directory size is relatively time read consuming and interrupts access to the data file ▪ An alternative is to use linear hashing record from file 4.1  Linear hashing is similar to extensible hashing ▪ But does not use a directory ▪ Primary blocks of buckets are stored in sequence like an array ▪ Bucket i is found by adding i to the address of the first bucket  The number of buckets (n) is selected to maintain a maximum average occupancy ▪ When this is exceeded the next bucket is split ▪ Buckets are split in turn, not as a response to overflow ▪ Therefore overflow blocks are added to buckets as needed ▪ The average number of overflow blocks per bucket is less than 1  Only the right-most i bits of h(K) are used to find buckets ▪ The number of bits used is log2n h(K)  The right-most i bits of the hash value are used to map records to n buckets ... 100 ▪ Values for the i bits range from 0 to 2i-1 ▪ The value for n may be less than 2i-1 ▪ e.g. if i = 3 there may be less than 8 buckets  To find the bucket with data entry for search key K ▪ Find the value m where m is equal to the last i bits of h(K) ▪ If m < n go to bucket m ▪ If m  n (i.e. n ≤ m < 2i) then go to bucket m - 2i-1 ▪ i.e. change the left most bit of m to 0 ... 011 ... 111 insert record where h(K) = 1001 0100 insert record where h(K) = 0101 1110 ... ... ... so just use i-1 bits (or subtract 2i-1 from 110) n =6 ... ... place record in bucket 100 as 4 (100) < n ... i =3 using 3 bits record is mapped to bucket 110 which does not exist 000 001 010 011 100 101  Buckets are periodically added to the index ▪ Whenever the ratio r / n (r records, n buckets) exceeds a threshold value a new bucket is created ▪ The threshold value is calculated to maintain the desired occupancy ▪ The bucket that is added to the index may not have any relationship to the bucket that was just inserted into  When a bucket m is added, values in a related bucket are distributed between the two buckets ▪ The related bucket is bucket m - 2i-1 ▪ e.g. if m = 6 and i = 3 the related bucket is bucket 2 (from 6 – 22)  When n > 2i, i is incremented by one
or ignore the ith bit … 110

insert record where h(K) = 1100 1011 insert in bucket 11
which does not exist so insert in 01
0100 0000
0100 0100
0000 1101
0100 0011
1001 0101
1100 1011
0100 0010
add an overflow block to the bucket
r / n < 2.4 so no new bucket is created i = 2 n = 3 r r = = 67 index pages contain at most three entries, 80% occupancy = 0.8 * 3 = 2.4 max occupancy = 0.8 so max r / n = 2.4 why 2.4? 00 01 10 insert record where h(K) = 1000 0110 insert in bucket 10 0100 0000 0100 0100 0000 1101 0101001 010011 1001 0101 1100 1011 0100 0010 1000 0110 the next bucket is 11, so distribute the values in 01 r / n = 2.667, so make a new bucket 0100 0011 1100 1011 i = 2 n n = = 43 r r = = 8 7 max occupancy = 0.8 so max r / n = 2.4 00 01 10 11 insert record where h(K) = 1011 1110 insert in bucket 10 r / n = 2.25, so don't make a new bucket 0100 0000 0100 0100 0000 1101 1001 0101 0100 0010 1000 0110 1011 1110 0100 0011 1100 1011 i = 2 n = 4 r r = = 89 max occupancy = 0.8 so max r / n = 2.4 00 01 10 11 insert record where h(K) = 0101 0110 insert in bucket 10 0100 0000 0100 0100 0000 1101 1001 0101 0100 0010 1000 0110 1011 1110 0101 0110 0100 0011 1100 1011 n = 2i, so increase i to 3 0100 0100 the next bucket is 100, so distribute the values in 00 r / n = 2.5, so make a new bucket i=23 n=45 rr=190 max occupancy = 0.8 so max r / n = 2.4 00 01 10 11 100 insert record where h(K) = 1101 1101 insert in bucket 101 n – 1 (100) < 101 so insert in 001 r / n = 2.2, so don't make a new bucket 0100 0000 0000 1101 1001 0101 1101 1101 0100 0010 1000 0110 1011 1110 0101 0110 0100 0011 1100 1011 0100 0100 i = 3 n = 5 r r = = 1 101 max occupancy = 0.8 so max r / n = 2.4 000 001 010 011 100 insert record where h(K) = 1001 1100 insert in bucket 100 r / n = 2.4, so don't make a new bucket 0100 0000 0000 1101 1001 0101 1101 1101 0100 0010 1000 0110 1011 1110 0101 0110 0100 0011 1100 1011 0100 0100 1001 1100 i = 3 n = 5 r r = = 1 121 max occupancy = 0.8 so max r / n = 2.4 000 001 010 011 100 insert record where h(K) = 1000 1000 insert in bucket 000 r / n = 2.6, so make a new bucket 0100 0000 1000 1000 0000 1101 1001 0101 1101 1101 0100 0010 1000 0110 1011 1110 0101 0110 0100 0011 1100 1011 the next time a new bucket is created the values in 010 will be distributed into it 0100 0100 1001 1100 0000 1101 1001 0101 1101 1101 i=3 n=65 r=123 max occupancy = 0.8 so max r / n = 2.4 000 001 010 011 100 101  Linear hashing does not require a directory ▪ But linear hashing may result in less space efficiency because buckets are split before they overflow  Multiple collisions in one bucket in extensible hashing will result in a large directory ▪ Which may then not fit on one disk page  Collisions in linear hashing lead to long overflow chains for the bucket with the collisions ▪ Requiring multiple disk reads for that bucket ▪ But no increase in the cost of accessing other buckets  B+ Tree indices ▪ Supports range queries ▪ i.e.inequalityselections ▪ Allows searches on a prefix of the search key ▪ Includes concatenated keys ▪ Usually 3 to 6 disk reads to access leaf level ▪ AndfindRID  Hash indices ▪ Does not support range queries ▪ Only supports equalities ▪ Does not allow searches on search key prefix ▪ Faster than B+ tree ▪ Usually 1 or 2 disk reads to access bucket Faster than B+ tree Good choice for primary index where file sorted on search key