CS计算机代考程序代写 scheme chain Hash-Based Indexes

Hash-Based Indexes

23/3/20 1

Introduction

As for any index, 3 alternatives for data entries k*:
1. Data record with key value k
2.
3.

§ Choice orthogonal to the indexing technique
§ Hash-based indexes are the best for equality

selections. Cannot support range searches.
§ Static and dynamic hashing techniques exist;

trade-offs similar to ISAM vs. B+ trees.

23/3/20 2

Static Hashing
• # primary pages (index data entry pages)

fixed, allocated sequentially, never de-
allocated; overflow pages if needed.

• h(k) mod M = bucket to which data entry
with key k belongs. (M = # of buckets)

23/3/20 3

h(key) mod N

h
key

Primary bucket pages Overflow pages

1
0

N-1

Static Hashing (Contd.)
• Buckets contain data entries.
• Hash function works on search key field of

record r. Must distribute values over range 0 …
M-1.
– h(key) = (a * key + b) usually works well.
– a and b are constants; lots known about how to

tune h.
• Long overflow chains can develop and degrade

performance.
– Keep 80% full initially and/or re-hashing
– Extendible and Linear Hashing: Dynamic techniques

to fix this problem.
23/3/20 4

Extendible Hashing
• Situation: Bucket (primary page) becomes full.

Why not re-organize file by doubling # of
buckets?
– Reading and writing all pages is expensive!
– Idea: Use directory of pointers to buckets, double #

of buckets by doubling the directory, splitting just
the bucket that overflowed!

– Directory much smaller than file, so doubling it is
much cheaper. Only one page of data entries is
split. Ensure no overflow page!

– Trick lies in how hash function is adjusted!

23/3/20 5

• Directory is array of size 4.
• To find bucket for r, take last

`global depth’ # bits of h(r); we
denote r by h(r).
– If h(r) = 5 = binary 101, it

is in bucket pointed to by
01.

v Insert: If bucket is full, split it (allocate new page, re-distribute).

v If necessary, double the directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)

DATA PAGES

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Example

Insert h(r)=20 (Causes Doubling)

23/3/20 7

DIRECTORY

2
Bucket A32*16*

20*

2
Bucket A2
(`split image’
of Bucket A)

4* 12*

00
01
10
11

2

000
001
010
011
100
101

110
111

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

10*

1* 21*

4* 12* 32*16*

15* 7* 19*

5*

20

After inserting h(r)=20

23/3/20 8

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

10*

1* 21*

4* 12* 32*16*

15* 7* 19*

5*

20

Bucket C10*

2

2

Bucket B1* 5* 21*13*

19*
2

Bucket D15* 7*

3
Bucket A2

(`split image’
of Bucket A)

4* 20*12*

LOCAL DEPTH

3

DIRECTORY

000
001
010
011
100
101

110
111

GLOBAL DEPTH

3

Bucket A32*16*
3

Points to Note
• 20 = binary 10100. Last 2 bits (00) cannot tell us r

belongs in A or A2. Last 3 bits can tell us the which
bucket.
– Global depth of directory: Max # of bits needed to tell

which bucket an entry belongs to.
– Local depth of a bucket: # of bits used to determine if an

entry belongs to this bucket.
• When does bucket split cause directory doubling?

– Before insert, local depth of bucket = global depth. Insert
causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split
image page. (Use of least significant bits enables efficient
doubling via copying of directory!)

23/3/20 9

Directory Doubling

23/3/20 10

Why use least significant bits in directory?
§ Hard to decide where to start
§ Quite biased in the most significant bids

vs.

00

01
10

11

2

000

001
010

011

3

100

101
110

111

0

1

1

6*
6*

6*

6 = 110

00

10
01

11

2

3

0

1

1

6*
6*

6*

6 = 110
000

100
010

110

001

101
011

111

Least Significant Most Significant

Comments on Extendible Hashing

• If directory fits in memory, equality search
answered with one disk access; else two.
– 100MB file, 100 bytes/rec, 4K pages contains 1,000,000

records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.

– Directory grows in spurts, and, if the distribution of hash
values is skewed, directory can grow large.

– Multiple entries with same hash value cause problems!
• Delete: If removal of data entry makes bucket

empty, can be merged with `split image’. If each
directory element points to same bucket as its split
image, can halve directory.

23/3/20 11

2N, 2N+1, 2N+2, 2N+4, 2N+5

N = 100

00, 000, 0000, …., 000…000,

2101

23/3/20 12

Linear Hashing
§ This is another dynamic hashing scheme, an

alternative to Extendible Hashing.
§ LH handles the problem of long overflow chains

without using a directory, and handles duplicates.
§ Idea: Use a family of hash functions h0, h1, h2, …

– hi(key) = h(key) mod(2iN); N = initial # buckets
– h is some hash function (range is not 0 to N-1)
– If N = 2d0, for some d0, hi consists of applying h and

looking at the last di bits, where di = d0 + i.
– hi+1 doubles the range of hi (similar to directory

doubling)
23/3/20 13

Linear Hashing (Contd.)

• Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.
– Splitting proceeds in `rounds’. Round ends when all NR

initial (for round R) buckets are split. Buckets 0 to Next-
1 have been split; Next to NR yet to be split.

– Current round number is Level.
– Search: To find bucket for data entry r, find hLevel(r):

• If hLevel(r) in range `Next to NR’ , r belongs here.
• Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR;

must apply hLevel+1(r) to find out.

23/3/20 14

Overview of LH File
In the middle of a round.

23/3/20 15

Levelh

Buckets that existed at the
beginning of this round:

this is the range of

Next
Bucket to be split

of other buckets) in this round

Levelh search key value )(

search key value )(

Buckets split in this round:
If
is in this range, must use
h Level+1

`split image’ bucket.
to decide if entry is in

created (through splitting
`split image’ buckets:

Linear Hashing (Contd.)
§ Insert: Find bucket by applying hLevel / hLevel+1:

§ If bucket to insert into is full:
§ Add overflow page and insert data entry.
§ (Maybe) Split Next bucket and increment

Next.
§ Can choose any criterion to `trigger’ split.
§ Since buckets are split round-robin, long overflow

chains don’t develop!
§ Doubling of directory in Extendible Hashing is

similar; switching of hash functions is implicit in
how the # of bits examined is increased.

23/3/20 16

Usually, when a
new overflow
page is created.

Example of Linear Hashing
§ Insert 43*

0
h

Level=0, N=4

00

01

10

11
(The actual contents
of the linear hashed
file)

Next=0
PRIMARY

PAGES

Data entry r
with h(r)=5

Primary
bucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

Next=1

Level=0

PRIMARY
PAGES

OVERFLOW
PAGES0

hh
1

00

01

10

11

000

001

010

011

00100

32*

25*9* 5*

14* 18*10*30*

44* 36*

31*35* 11*7* 43*

23/3/20 18

Next=2

Level=0
PRIMARY
PAGES

OVERFLOW
PAGES

30*

0hh1

00

01

10

11

000

001

010

011

00100

01101

32*

9* 25*

14* 10*18*

44*36*

5* 37*29*

35*31* 7* 11* 43*

Next=1

Level=0

PRIMARY
PAGES

OVERFLOW
PAGES0

hh
1

00

01

10

11

000

001

010

011

00100

32*

25*9* 5*

14* 18*10*30*

44* 36*

31*35* 11*7* 43*

Insert 37 29

37

23/3/20 19

Next=3

OVERFLOW
PAGES0hh1

00

01

10

11

000

001

010

011

00100

01

10

101

110

Level=0
PRIMARY
PAGES

32*

9* 25*

10*18* 34*

14*30*

5*

35*31* 7* 11* 43*

44*36*

37*29*

Next=2

Level=0
PRIMARY
PAGES

OVERFLOW
PAGES

30*

0hh1

00

01

10

11

000

001

010

011

00100

01101

32*

9* 25*

14* 10*18*

44*36*

5* 37*29*

35*31* 7* 11* 43*

Insert 34 66 22

66*

22*

23/3/20 20

001

Next=0

Level=1 PRIMARY

PAGES

OVERFLOW

PAGES0hh1

00

01

10

11

000

010

011

00100

10

101

110

111

11

11

32*

9* 25*

35* 11*43*

37*

44* 36*

5* 29*

14* 30* 22*

31*7*

66* 18* 10* 34* 50*

Next=3

OVERFLOW
PAGES0hh1

00

01

10

11

000

001

010

011

00100

01

10

101

110

Level=0
PRIMARY
PAGES

32*

9* 25*

10*18* 34*

14*30*

5*

35*31* 7* 11* 43*

44*36*

37*29*

66*

22*

Insert 50

LH Described as a Variant of EH
• The two schemes are actually quite similar:

– Begin with an EH index where directory has N
elements.

– Use overflow pages, split buckets round-robin.
– First split is at bucket 0. (Imagine directory being

doubled at this point.) But elements <1,N+1>, <2,N+2>,
… are the same. So, need only create directory
element N, which differs from 0, now.
• When bucket 1 splits, create directory element N+1, etc.

• So, directory can double gradually. Also, primary
bucket pages are created in order. If they are
allocated in sequence too (so that finding i’th is
23/3/20 21

Summary
• Hash-based indexes: best for equality searches,

cannot support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by

splitting a full bucket when a new data entry is to
be added to it. (Duplicates may require overflow
pages.)
– Directory to keep track of buckets, doubles

periodically.
– Can get large with skewed data; additional I/O if this

does not fit in main memory.
23/3/20 22

Summary (Contd.)
§ Linear Hashing avoids directory by splitting buckets

round-robin, and using overflow pages.
§ Overflow pages not likely to be long.
§ Duplicates handled easily.
§ Space utilization could be lower than Extendible

Hashing, since splits not concentrated on `dense’ data
areas.
§ Can tune criterion for triggering splits to trade-off

slightly longer chains for better space utilization.
§ For hash-based indexes, a skewed data distribution is one

in which the hash values of data entries are not uniformly
distributed!

23/3/20 23