23/3/20
1
Hash-Based Indexes
Introduction
As for any index, 3 alternatives for data entries k*:
1. Data record with key value k
2.
3.
§ Choice orthogonal to the indexing technique § Hash-based indexes are the best for equality
selections. Cannot support range searches.
§ Static and dynamic hashing techniques exist;
trade-offs similar to ISAM vs. B+ trees.
23/3/20 2
Static Hashing
• # primary pages (index data entry pages) fixed, allocated sequentially, never de- allocated; overflow pages if needed.
• h(k) mod M = bucket to which data entry with key k belongs. (M = # of buckets)
0
1
N-1
h(key) mod N
key
h
23/3/20
3
Primary bucket pages
Overflow pages
Static Hashing (Contd.) • Buckets contain data entries.
• Hash function works on search key field of record r. Must distribute values over range 0 … M-1.
– h(key) = (a * key + b) usually works well.
– a and b are constants; lots known about how to
tune h.
• Long overflow chains can develop and degrade
performance.
– Keep 80% full initially and/or re-hashing
– Extendible and Linear Hashing: Dynamic techniques to fix this problem.
23/3/20
4
Extendible Hashing
• Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets?
– Reading and writing all pages is expensive!
– Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed!
– Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Ensure no overflow page!
– Trick lies in how hash function is adjusted! 23/3/20
5
Example
• Directory is array of size 4.
LOCAL DEPTH
GLOBAL DEPTH
00
01 10
11
Bucket A
Bucket B
Bucket C
Bucket D
2
4* 12* 32* 16*
2
2
• To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r).
– Ifh(r)=5=binary101, it is in bucket pointed to by 01.
DIRECTORY
1* 5* 21* 13*
2
10*
v Insert: If bucket is full, split it (allocate new page, re-distribute).
vIf necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)
2
15* 7* 19*
DATA PAGES
LOCAL DEPTH
GLOBAL DEPTH
00
01 10
11
Bucket A 4* 12*32*16*
20
Insert h(r)=20 (Causes Doubling)
2
2
2
2
Bucket B 1* 5* 21*13*
Bucket C
000
001
010
011
100 101
110 111
32*16* BucketA
Bucket A2 (`split image’ of Bucket A)
2
00
01 10 11
2
2
4* 12*20*
23/3/20
7
DIRECTORY
Bucket D
10*
DIRECTORY
2
15* 7* 19*
2
After inserting h(r)=20
LOCAL DEPTH
GLOBAL DEPTH
00
01 10
11
Bucket A 4* 12*32*16*
20
2
LOCAL DEPTH
GLOBAL DEPTH
000
001
010
011
100 101
110 111
3
3
2
Bucket B 1* 5* 21*13*
Bucket C
2
1*
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2
(`split image’ of Bucket A)
2
32* 16*
5*
21* 13*
2
10*
10*
2
23/3/20
8
DIRECTORY
Bucket D
2
15* 7* 19*
DIRECTORY
3
15* 7* 19*
3
4* 12*20*
Points to Note
• 20 = binary 10100. Last 2 bits (00) cannot tell us r belongs in A or A2. Last 3 bits can tell us the which bucket.
– Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to.
– Localdepthofabucket:#ofbitsusedtodetermineifan entry belongs to this bucket.
• When does bucket split cause directory doubling?
– Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)
23/3/20 9
6 = 110
6 = 110
00 0 10
1 01 11
Most Significant
2
010 00 011
Directory Doubling
3
000 001
000 100
010 110 001
101 011
111
2
1
6*
0 01 100 1 10 101
11 110 111
Least Significant
vs.
1
6*
6*
6*
6*
Why use least significant bits in directory? § Hard to decide where to start
§ Quite biased in the most significant bids
23/3/20
10
3
6*
Comments on Extendible Hashing
• If directory fits in memory, equality search answered with one disk access; else two.
– 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory.
– Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large.
– Multipleentrieswithsamehashvaluecauseproblems! • Delete: If removal of data entry makes bucket
empty, can be merged with `split image’. If each
directory element points to same bucket as its split
image, can halve directory.
23/3/20 11
2N, 2N+1, 2N+2, 2N+4, 2N+5 N = 100
00, 000, 0000, …., 000…000, 2101
23/3/20 12
Linear Hashing
§ This is another dynamic hashing scheme, an alternative to Extendible Hashing.
§ LH handles the problem of long overflow chains without using a directory, and handles duplicates.
§ Idea: Use a family of hash functions h0, h1, h2, …
– hi(key) = h(key) mod(2iN); N = initial # buckets
– h is some hash function (range is not 0 to N-1)
– If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di = d0 + i.
– hi+1 doubles the range of hi (similar to directory doubling)
23/3/20 13
Linear Hashing (Contd.)
• Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin.
– Splitting proceeds in `rounds’. Round ends when all NR initial (for round R) buckets are split. Buckets 0 to Next- 1 have been split; Next to NR yet to be split.
– CurrentroundnumberisLevel.
– Search:Tofindbucketfordataentryr,findhLevel(r):
• If hLevel(r) in range `Next to NR’ , r belongs here.
• Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR; must apply hLevel+1(r) to find out.
23/3/20
14
Overview of LH File In the middle of a round.
Bucket to be split
Next
Buckets that existed at the beginning of this round:
this is the range of
hLevel
Buckets split in this round: IfhLevel (searchkeyvalue) is in this range, must use
h Level+1 ( search key value )
to decide if entry is in `split image’ bucket.
`split image’ buckets:
created (through splitting
of other buckets) in this round
23/3/20
15
Linear Hashing (Contd.)
§ Insert: Find bucket by applying hLevel / hLevel+1: § If bucket to insert into is full:
§ Add overflow page and insert data entry. § (Maybe) Split Next bucket and increment
Next.
§ Can choose any criterion to `trigger’ split.
§ Since buckets are split round-robin, long overflow chains don’t develop!
§ Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.
Usually, when a new overflow page is created.
23/3/20 16
Example of Linear Hashing § Insert 43*
Level=0, N=4
h PRIMARY
0 Next=0PAGES 00
h
1
Level=0
h PRIMARY
OVERFLOW PAGES
0 00
PAGES
32*44*
36*
32*
000 001
010
011 100
Next=1
9*
25*
5*
9*
25*
5*
01 10 11
Data entry r
with h(r)=5
14*18*10*30* Primary bucket page
11*
(The actual contents of the linear hashed file)
01
10 14*18*10*30*
11 11* 00
43*
31* 35* 7*
31* 35* 7*
44* 36*
Insert 37
Level=0
29
h1 h0 000 00
PRIMARY PAGES
OVERFLOW PAGES
Level=0
h h 10
PRIMARY PAGES
OVERFLOW PAGES
32*
32*
000 00
001 01 010 10
011 100
001
010 10
9* 25*
Next=1
01 Next=2 011 11
9*
25*
5*
37
14*18*10* 30*
31*35* 7* 11*
43*
11 00
14* 18*10*30* 11*
100 101
00 01
44*36*
43*
31* 35* 7*
5* 37*29*
44* 36*
23/3/20
18
Level=0
Insert 34
PRIMARY PAGES
66 22
OVERFLOW PAGES
Level=0
h1 h0 000 00
h1 000
001
010
011
100
101
110
h0 00
01 10
PRIMARY PAGES
OVERFLOW PAGES
32*
32*
001
010 10
01 Next=2
9* 25*
9* 25*
14*18*10* 30*
66*18*10* 34*
011 11
100 101
Next=3
11 00 01 10
31*35* 7* 11*
43*
31*35* 7* 11*
43*
00 01
44*36*
44*36*
5* 37*29*
5* 37*29*
14*30* 22*
23/3/20
19
Insert 50
h1 000
001
010
011
100
101
110
h0 00
01 10
PRIMARY PAGES
OVERFLOW PAGES
Level=0
h1 000
001
010
011
100
101
110 111
h0 00
01 10
11 00 11 10 11
Level=1 Next=0
PRIMARY PAGES
OVERFLOW PAGES
Next=3
11 00 01 10
23/3/20
20
32*
9* 25*
32*
9* 25*
66*18*10* 34*
66* 18* 10* 34*
50*
43* 35* 11*
31*35* 7* 11*
44*36*
43*
44* 36*
5* 37*29*
14*30* 22*
5* 37* 29*
14* 30* 22*
31* 7*
LH Described as a Variant of EH • The two schemes are actually quite similar:
– Begin with an EH index where directory has N elements.
– Use overflow pages, split buckets round-robin.
– First split is at bucket 0. (Imagine directory being doubled at this point.) But elements <1,N+1>, <2,N+2>, … are the same. So, need only create directory element N, which differs from 0, now.
• When bucket 1 splits, create directory element N+1, etc. • So, directory can double gradually. Also, primary
bucket pages are created in order. If they are
23/3/20 21 allocated in sequence too (so that finding i’th is
Summary
• Hash-based indexes: best for equality searches, cannot support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)
– Directory to keep track of buckets, doubles periodically.
– Can get large with skewed data; additional I/O if this
does not fit in main memory.
23/3/20 22
Summary (Contd.)
§ Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages.
§ Overflow pages not likely to be long.
§ Duplicates handled easily.
§ Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas.
§Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.
§ For hash-based indexes, a skewed data distribution is one
in which the hash values of data entries are not uniformly
distributed!
23/3/20 23