Hash-Based Indexes
23/3/20 1
Introduction
As for any index, 3 alternatives for data entries k*:
1. Data record with key value k
2.
3.
§ Choice orthogonal to the indexing technique
§ Hash-based indexes are the best for equality
selections. Cannot support range searches.
§ Static and dynamic hashing techniques exist;
trade-offs similar to ISAM vs. B+ trees.
23/3/20 2
Static Hashing
• # primary pages (index data entry pages)
fixed, allocated sequentially, never de-
allocated; overflow pages if needed.
• h(k) mod M = bucket to which data entry
with key k belongs. (M = # of buckets)
23/3/20 3
h(key) mod N
h
key
Primary bucket pages Overflow pages
1
0
N-1
Static Hashing (Contd.)
• Buckets contain data entries.
• Hash function works on search key field of
record r. Must distribute values over range 0 …
M-1.
– h(key) = (a * key + b) usually works well.
– a and b are constants; lots known about how to
tune h.
• Long overflow chains can develop and degrade
performance.
– Keep 80% full initially and/or re-hashing
– Extendible and Linear Hashing: Dynamic techniques
to fix this problem.
23/3/20 4
Extendible Hashing
• Situation: Bucket (primary page) becomes full.
Why not re-organize file by doubling # of
buckets?
– Reading and writing all pages is expensive!
– Idea: Use directory of pointers to buckets, double #
of buckets by doubling the directory, splitting just
the bucket that overflowed!
– Directory much smaller than file, so doubling it is
much cheaper. Only one page of data entries is
split. Ensure no overflow page!
– Trick lies in how hash function is adjusted!
23/3/20 5
• Directory is array of size 4.
• To find bucket for r, take last
`global depth’ # bits of h(r); we
denote r by h(r).
– If h(r) = 5 = binary 101, it
is in bucket pointed to by
01.
v Insert: If bucket is full, split it (allocate new page, re-distribute).
v If necessary, double the directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
DATA PAGES
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
Example
Insert h(r)=20 (Causes Doubling)
23/3/20 7
DIRECTORY
2
Bucket A32*16*
20*
2
Bucket A2
(`split image’
of Bucket A)
4* 12*
00
01
10
11
2
000
001
010
011
100
101
110
111
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
10*
1* 21*
4* 12* 32*16*
15* 7* 19*
5*
20
After inserting h(r)=20
23/3/20 8
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
10*
1* 21*
4* 12* 32*16*
15* 7* 19*
5*
20
Bucket C10*
2
2
Bucket B1* 5* 21*13*
19*
2
Bucket D15* 7*
3
Bucket A2
(`split image’
of Bucket A)
4* 20*12*
LOCAL DEPTH
3
DIRECTORY
000
001
010
011
100
101
110
111
GLOBAL DEPTH
3
Bucket A32*16*
3
Points to Note
• 20 = binary 10100. Last 2 bits (00) cannot tell us r
belongs in A or A2. Last 3 bits can tell us the which
bucket.
– Global depth of directory: Max # of bits needed to tell
which bucket an entry belongs to.
– Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket.
• When does bucket split cause directory doubling?
– Before insert, local depth of bucket = global depth. Insert
causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split
image page. (Use of least significant bits enables efficient
doubling via copying of directory!)
23/3/20 9
Directory Doubling
23/3/20 10
Why use least significant bits in directory?
§ Hard to decide where to start
§ Quite biased in the most significant bids
vs.
00
01
10
11
2
000
001
010
011
3
100
101
110
111
0
1
1
6*
6*
6*
6 = 110
00
10
01
11
2
3
0
1
1
6*
6*
6*
6 = 110
000
100
010
110
001
101
011
111
Least Significant Most Significant
Comments on Extendible Hashing
• If directory fits in memory, equality search
answered with one disk access; else two.
– 100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
– Directory grows in spurts, and, if the distribution of hash
values is skewed, directory can grow large.
– Multiple entries with same hash value cause problems!
• Delete: If removal of data entry makes bucket
empty, can be merged with `split image’. If each
directory element points to same bucket as its split
image, can halve directory.
23/3/20 11
2N, 2N+1, 2N+2, 2N+4, 2N+5
N = 100
00, 000, 0000, …., 000…000,
2101
23/3/20 12
Linear Hashing
§ This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
§ LH handles the problem of long overflow chains
without using a directory, and handles duplicates.
§ Idea: Use a family of hash functions h0, h1, h2, …
– hi(key) = h(key) mod(2iN); N = initial # buckets
– h is some hash function (range is not 0 to N-1)
– If N = 2d0, for some d0, hi consists of applying h and
looking at the last di bits, where di = d0 + i.
– hi+1 doubles the range of hi (similar to directory
doubling)
23/3/20 13
Linear Hashing (Contd.)
• Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.
– Splitting proceeds in `rounds’. Round ends when all NR
initial (for round R) buckets are split. Buckets 0 to Next-
1 have been split; Next to NR yet to be split.
– Current round number is Level.
– Search: To find bucket for data entry r, find hLevel(r):
• If hLevel(r) in range `Next to NR’ , r belongs here.
• Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR;
must apply hLevel+1(r) to find out.
23/3/20 14
Overview of LH File
In the middle of a round.
23/3/20 15
Levelh
Buckets that existed at the
beginning of this round:
this is the range of
Next
Bucket to be split
of other buckets) in this round
Levelh search key value )(
search key value )(
Buckets split in this round:
If
is in this range, must use
h Level+1
`split image’ bucket.
to decide if entry is in
created (through splitting
`split image’ buckets:
Linear Hashing (Contd.)
§ Insert: Find bucket by applying hLevel / hLevel+1:
§ If bucket to insert into is full:
§ Add overflow page and insert data entry.
§ (Maybe) Split Next bucket and increment
Next.
§ Can choose any criterion to `trigger’ split.
§ Since buckets are split round-robin, long overflow
chains don’t develop!
§ Doubling of directory in Extendible Hashing is
similar; switching of hash functions is implicit in
how the # of bits examined is increased.
23/3/20 16
Usually, when a
new overflow
page is created.
Example of Linear Hashing
§ Insert 43*
0
h
Level=0, N=4
00
01
10
11
(The actual contents
of the linear hashed
file)
Next=0
PRIMARY
PAGES
Data entry r
with h(r)=5
Primary
bucket page
44* 36*32*
25*9* 5*
14* 18*10*30*
31*35* 11*7*
Next=1
Level=0
PRIMARY
PAGES
OVERFLOW
PAGES0
hh
1
00
01
10
11
000
001
010
011
00100
32*
25*9* 5*
14* 18*10*30*
44* 36*
31*35* 11*7* 43*
23/3/20 18
Next=2
Level=0
PRIMARY
PAGES
OVERFLOW
PAGES
30*
0hh1
00
01
10
11
000
001
010
011
00100
01101
32*
9* 25*
14* 10*18*
44*36*
5* 37*29*
35*31* 7* 11* 43*
Next=1
Level=0
PRIMARY
PAGES
OVERFLOW
PAGES0
hh
1
00
01
10
11
000
001
010
011
00100
32*
25*9* 5*
14* 18*10*30*
44* 36*
31*35* 11*7* 43*
Insert 37 29
37
23/3/20 19
Next=3
OVERFLOW
PAGES0hh1
00
01
10
11
000
001
010
011
00100
01
10
101
110
Level=0
PRIMARY
PAGES
32*
9* 25*
10*18* 34*
14*30*
5*
35*31* 7* 11* 43*
44*36*
37*29*
Next=2
Level=0
PRIMARY
PAGES
OVERFLOW
PAGES
30*
0hh1
00
01
10
11
000
001
010
011
00100
01101
32*
9* 25*
14* 10*18*
44*36*
5* 37*29*
35*31* 7* 11* 43*
Insert 34 66 22
66*
22*
23/3/20 20
001
Next=0
Level=1 PRIMARY
PAGES
OVERFLOW
PAGES0hh1
00
01
10
11
000
010
011
00100
10
101
110
111
11
11
32*
9* 25*
35* 11*43*
37*
44* 36*
5* 29*
14* 30* 22*
31*7*
66* 18* 10* 34* 50*
Next=3
OVERFLOW
PAGES0hh1
00
01
10
11
000
001
010
011
00100
01
10
101
110
Level=0
PRIMARY
PAGES
32*
9* 25*
10*18* 34*
14*30*
5*
35*31* 7* 11* 43*
44*36*
37*29*
66*
22*
Insert 50
LH Described as a Variant of EH
• The two schemes are actually quite similar:
– Begin with an EH index where directory has N
elements.
– Use overflow pages, split buckets round-robin.
– First split is at bucket 0. (Imagine directory being
doubled at this point.) But elements <1,N+1>, <2,N+2>,
… are the same. So, need only create directory
element N, which differs from 0, now.
• When bucket 1 splits, create directory element N+1, etc.
• So, directory can double gradually. Also, primary
bucket pages are created in order. If they are
allocated in sequence too (so that finding i’th is
23/3/20 21
Summary
• Hash-based indexes: best for equality searches,
cannot support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to
be added to it. (Duplicates may require overflow
pages.)
– Directory to keep track of buckets, doubles
periodically.
– Can get large with skewed data; additional I/O if this
does not fit in main memory.
23/3/20 22
Summary (Contd.)
§ Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.
§ Overflow pages not likely to be long.
§ Duplicates handled easily.
§ Space utilization could be lower than Extendible
Hashing, since splits not concentrated on `dense’ data
areas.
§ Can tune criterion for triggering splits to trade-off
slightly longer chains for better space utilization.
§ For hash-based indexes, a skewed data distribution is one
in which the hash values of data entries are not uniformly
distributed!
23/3/20 23