• •
All of Chapter 11 of Ramakrishnan & Gehrke (textbook, pages 370-386)
Chapter 11: Hash-Based Indexes Extendible Hashing and Linear Hashing
Ed Knorr, Slides from Fall 2020
Based on:
Your hashing knowledge from CPSC 221 or equivalent
1
Some Learning Goals
Compare and contrast the performance of hash-based indexes versus tree-based indexes (e.g., B+ tree) for equality and range searches. Provide the best-case, average-case, and worst-case complexities for such searches.
Explain how collisions are handled for open addressing and chaining implementations of hash structures. [from CPSC 221]
Explain the advantages that dynamic hashing provides over static hashing.
Show how insertions and deletions are handled in extendible hashing.
Build an extendible hash index using a given set of data.
Show how insertions and deletions are handled in linear hashing.
Build a linear hash index using a given set of data.
Describe some of the major differences between extendible hashing and linear hashing (e.g., how the directory is handled, how skew is handled).
2
Introduction
Hash-based indexes are usually the best choice for equality selections.
No traversal of trees
Direct computation of where k* should be
The 3 alternatives for index data entries still apply.
Static and dynamic hashing techniques exist; their
trade-offs are similar to B+ trees.
Question: Hash-based indexes cannot support range searches efficiently. Why not?
3
Motivation
Udi Manber, Chief Scientist, Yahoo! : “The three most important algorithms at Yahoo”, he said, “were hashing, hashing, and hashing.”
• Quoted in: Skiena, Steven S. The Algorithm Design Manual, 2nd edition, Springer, 2008, p. 92.
4
Review (on Your Own) from CPSC 221 or Equivalent
1. What are the characteristics of a good hash function?
2. What is the difference between open addressing and chaining?
3. How are collisions handled in the above scenarios?
4. What is the complexity of a good hash function (for an “average” or typical search)?
5. What is the worst case complexity of a good hash function?
6. What “hashing” lesson can we learn from Statistics with respect to “The Birthday Problem”? (i.e., How many people are needed in a group before you’re likely to find two people in that group with the same birthday?)
5
key
h
Static Hashing
# primary pages fixed, allocated sequentially, never de-allocated; overflow pages, if needed
h(k) mod N = bucket to which data entry with key k belongs (N = # of buckets or pages or “nodes”—they contain lots of
h(key) mod N
0 1
Primary bucket pages
Overflow pages
N-1
6
Static Hashing (cont.)
Buckets contain data entries.
Buckets are analogous to leaf pages in a B+ tree.
A hash function works on a search key field of record r, and must distribute values over the range 0 … [N-1].
h(key) = (a * key + b) % N … usually works well
• Themodulusoperatoractsasacompressionmaptoallowthe (potentially long) hash value to fit into the given range of cells in the hash structure.
a and b are constants; lots known about how to tune h
7
7.
In static hashing (e.g., CPSC 221), what is the solution when:
a) A hash table becomes full or nearly full in open addressing?
Static Hashing (cont.)
Long overflow chains can develop and degrade performance.
Extendible and Linear Hashing are dynamic techniques to fix this problem.
b) Long overflow chains develop in chained addressing?
8
Extendible Hashing (EH)
If a bucket (primary page) becomes full, why not re-organize the file by doubling the total number of buckets?
Reading and writing all pages is expensive (slow)!
Idea: Use a directory of pointers to buckets. Double the number of potential buckets by doubling the directory, and splitting just the bucket that overflowed.
The directory is much smaller than the file; so, doubling it is much cheaper. Typically, only one page of data entries is split. No overflow pages!
The trick lies in how the hash function is adjusted.
9
Example
4* 12* 32* 16*
Bucket A
e.g., the directory might start as an 00
22
1* 5* 21* 13*
Bucket B
array of size 4
01
To find the bucket for key r, take 10 2
last ‘global depth’ # of bits of h(r): e.g., suppose h(r) = binary value of r
11
10*
Bucket C
Ifr=5,thenh(r)=101; so,the last 2 bits of h(r) map to directory entry 01, which points to a specific bucket.
DIRECTORY
2
15* 7* 19*
Bucket D
LOCAL DEPTH
GLOBAL DEPTH
2
Insert: If bucket is full, split it (allocate new page, re-distribute)
If necessary, double the directory. (As we shall see, splitting a bucket does not always require doubling—we can tell by comparing the full bucket’s local depth with the global depth.)
DATA PAGES
10
LOCAL DEPTH 2 GLOBAL DEPTH
32* 16*
Bucket A
• Bucket A is split into two using an extra bit, with entries re-distributed as follows:
00
1* 5* 21*13*
Bucket B
01 10 11
2 10*
Bucket C
• A2 is divisible by 4, i.e., 100 • ends in these 3 bits: 100
Insertkeyr=20 –Slide1of2
22
• A is divisible by 8, i.e., 1000 • ends in these 3 bits: 000
DIRECTORY
2
Bucket D
2
15* 7* 19*
• Note that only one bucket need to be re-distributed (i.e., re-hashed).
4* 12*20*
(`split image’ • But, where to link A2? of Bucket A)
Bucket A2
• B, C, & D remain unchanged
11
Insert r = 20 — Slide 2 of 2
• Double the directory.
• Add 1 to global depth.
• Now, we can distinguish
LOCAL DEPTH
GLOBAL DEPTH
3
32* 16*
Bucket A
between A and A2. • Note the difference
000
001
010
011
32
1* 5* 21* 13*
Bucket B
in local depth among
2 10*
Bucket C
the buckets.
• Note that multiple pointers
100 101
2
15* 7*19*
exist to the unsplit buckets.
110 111
BucketD
DIRECTORY
3
4* 12* 20*
Bucket A2
(`split image’ of Bucket A)
12
Points to Note
20 = binary 10100. The last 2 bits (00) tell us that r belongs in A or A2; but, the last 3 bits tell us which of A or A2 the key will reside in. In summary:
Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to
Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket
When does a split bucket cause the doubling of the directory?
Before insertion, suppose the local depth of the bucket = the global depth. The insertion will cause local depth to become > global depth; therefore, the directory is doubled by copying it and adjusting the pointers to the appropriate pages.
13
Comments on Extendible Hashing
If the directory already exists in memory, an equality search is answered with one disk access; otherwise, two.
100 MB file, 100 bytes/rec 1,000,000 records (as data entries) and 25,000 directory elements using 4K pages
• Chances are high that directory will fit in memory (15-bit addr.)
• 15 bits = 215 slots or pointers = 32,768 pointers (about 10 bytes each, so can keep in BP)
• Enough to point to 32,768 pages (buckets)
• How many pages (at full occupancy) do we get from 1,000,000 records, if we use Alt. 1 (meaning we store the whole record in an “index” page)?
14
Comments on Extendible Hashing (cont.)
How many pages (at full occupancy) do we get from 1,000,000 records, if we use Alt. 2 (meaning we store
The directory grows in spurts; and, if the distribution of hash values is skewed, the directory can grow large.
Multiple entries with the same key can cause problems.
15
Comments on Extendible Hashing, cont.
Deletions: If the removal of a data entry makes a bucket empty, the bucket can be merged with its ‘split image’.
If each directory element points to same bucket as its split image, we can halve the directory.
Former UBC CPSC professor Nick Pippenger was on the team that developed extendible hashing. Their research paper was published in 1979 in ACM’s Transactions on Database Systems ([Fagin, et al., 1979]).
16
Linear Hashing (LH)
This is another dynamic hashing scheme—an alternative to EH.
LH handles the potential problem of long overflow chains without using a directory; and handles duplicates in a nice way.
Idea: Use a family of hash functions h0, h1, h2, … to map the key to the appropriate bucket:
• hi(key) = h(key) mod(2i N); N = initial # of buckets • h0 handles the case with N buckets
• h1 handles the case with 2N buckets
• h2 handles the case with 4N buckets
hi+1 doubles the range of hi (similar to directory doubling)
17
Linear Hashing (cont.)
Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin.
Important: The bucket being split isn’t necessarily a full one!
Splitting proceeds in ‘rounds’. Round ends when all NR initial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR–1 are yet to be split.
Current round number is Level
Search: To find bucket for data entry r, find hLevel(r):
• If hLevel(r) in range ‘Next to NR – 1 ’ , r belongs here • Else, r belongs to either bucket hLevel(r) or bucket
hLevel(r) + NR; must apply hLevel+1(r) to find out which 18
(This is the range for: h Level ).
Linear Hashing (cont.)
In the middle of a round: Bucket to be split Next
Buckets already split in this round:
NLevel buckets existed at the beginning of this round.
If hLevel (search key value) is in the above range, use h Level+1 (search key value) to decide if the entry is in the ‘split image’ bucket.
‘split image’ buckets were created (through splitting of buckets) in this round
19
Linear Hashing (cont.)
Insert: Find bucket by applying hLevel / hLevel+1 : If bucket to insert into is full:
• Add overflow page and insert data entry • Split Next bucket and increment Next
The implementors can choose any criterion to ‘trigger’ a split.
Unless told otherwise, the rule we’ll follow in this course is: “At most one split per insertion”. In particular:
• Only split when you have to allocate a new bucket, that is, when the data entry that you are trying to insert will not fit in the existing page(s) (i.e., because both the target bucket and the rest of its overflow chain—if any—are full).
• We’ll assume that any cascading effects (e.g., reallocation of overflow pages) won’t cause further splits for the same insertion.
20
Linear Hashing (cont.)
Since buckets are split round-robin, long overflow chains typically don’t develop.
Over time, the chains will shrink, as the index “matures”.
A doubling of the directory in Extendible Hashing was similar. There, the switching of hash functions was implicit in how the number of bits examined is increased.
21
h1 h0 000 00
PRIMARY Next=0 PAGES
001 01 010 10
32*44* 36* 9* 25* 5*
Data entry r
with h(r)=5
011
11
(This directory information is for illustration only.)
(Here are the actual contents of the linear hashed file.)
Example of a Linear Hashing Index
On a split, hLevel+1 is used to re-distribute entries.
e.g., 9 = (1001)2
Level=0, N=4
14* 18*10*30* 31*35*7* 11*
Primary
bucket page
22
h 1
h 0
PRIMARY Next=0 PAGES
PAGES
OVERFLOW PAGES
000 001
00 01
32*44* 36* 9* 25* 5*
000 00 32*
010 011
10 11
14*18*10*30* Primary bucket page
001 01
010 10 14*18*10*30*
Example of Linear Hashing: Add in 43*
9 = (1001)2
43 = (101011)2
Level=0, N=4
Level=0, N=4
h 1 h 0 PRIMARY
31*35*7* 11*
011 11 31*35*7* 11*
43*
Data entry r
with h(r)=5
9* 25* 5*
100
00 44* 36*
Next=1
23
Example of Linear Hashing (cont.)
Let’s insert 37*, 29*, 22*, 66*, and 34*. 37 = (100101)2
29 = (11101)2 22 = (10110)2
9 = (1001)2 … 25 = (11001)2 … 5 = (101)2
66 = (1000010)2 34 = (100010)2
24
14 = (1110)2 … 18 = (10010)2 … 10 = (1010) 30 = (11110)2
h1 h0 000 00
PRIMARY PAGES
OVERFLOW PAGES
001
010 10
001
010 10
011 11
31* 35* 7* 11* 44* 36*
43*
100 101
00
37* 29*
01 110 10
5* 37* 29* 14* 30* 22*
110 10 111 11
01
Example (cont.): End of a Round
Let’s insert 50* = (110010)2. Level=1 PRIMARY
OVERFLOW PAGES
Level=0
h1 h0 000 00
Next=0 PAGES 32*
Next=3
100 101
00
44* 36*
32*
9* 25*
66* 18* 10* 34* 43* 35* 11*
50*
66* 18* 10* 34*
011 11
01
9* 25*
01
5*
14* 30* 22* 31*7*
25
So, insert 50* into bucket 010. The bucket was full.
What Happened?
h0(50) = (110010)2 mod 20N = (110010)2 % 4 = 10
Is 10 in the unsplit range [Next, N0 – 1]? i.e., [3,3]? No; therefore, use h1
h1(50) = (110010)2 mod 21N = (110010)2 % 8 = 010
Put 50* in a newly allocated overflow bucket. Trigger the split of the “Next” bucket.
Redistribute the split bucket’s entries.
Advance Next to: Next = 0, Level = 1.
26
Summary
Hash-based indexes: best for equality searches, cannot support range searches.
Static hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it.
Lots of duplicates may require overflow pages. Why? The directory keeps track of buckets. It doubles
periodically .
The index can become large if there’s a lot of skewed data. • AdditionalI/Oifthisdoesnotfitinmainmemory
27
Summary (cont.)
A variation to extendible hashing is extensible hashing: where we use the leading bits (on the left) to discriminate, rather than the last bits.
We’ll stick with extendible hashing in this course.
Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages
Overflow pages are not likely to be long. • Overtime,theymaydisappear.
Duplicates are handled easily.
Space utilization could be lower than that for Extendible Hashing since splits are not concentrated on ‘dense’ data areas.
28