CS3402
1
CS3402 : Chapter 9 Indexing Techniques
Semester B, 2020/2021
Indexing
Q: Given that a file already exists with some primary organization (e.g., unordered, ordered, or hash organizations), how to speed up the retrieval of records?
We will describe additional auxiliary access structures called index.
CS3402
2
Without an index, we may use the linear search (for unordered file or non-order attribute in ordered file)
Indexes as Access Paths
Index are auxiliary files on disk that provide secondary access paths, which provide fast access to records without affecting the physical placement of records in the primary data file on disk.
An index is an ordered sequence of entries
It is ordered by the indexing field value
The pointer provides the address of the required record
Instead of searching the data file sequentially, we search the index
to get the address of the required records
The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller
CS3402 3
Types of Single Level Indexes
There are several types of single-level ordered indexes:
Primary index: Specified on the ordering key field of an ordered
file of records
Cluster index: The ordering field of the index is not a key field and numerous records in the data file can have the same value for the ordering field (repeating value attributes)
Secondary index: The index field is specified on any non-ordering field of a data file
CS3402 4
Dense vs sparse index
A dense index has an index entry for every search key value (and hence every record) in the data file.
Thus larger index size
A sparse index, on the other hand, has index entries for only some of
the search values.
Thus smaller index size
CS3402 5
Primary Index
We can physically order the records of a file on disk based on the values of the ordering field.
This leads to an ordered file.
Ordering key field: The ordering field is also a key field of the file—a field guaranteed to have a unique value in each record.
Q. How to create a primary index on this ordered file?
Assumption: Name is the ordering key field.
CS3402 6
Primary Index
Block anchor: First record in the block
A primary index is an ordered file whose records are of fixed length with two fields:
Primary key value of block anchor => One index entry for each
block in the data file Block pointer
Q: Is the primary index a sparse index?
CS3402 7
Q. # Block Accesses WITHOUT
Indexing?
Supposethat:
Record size fixed and unspanned
R=100 bytes
Block size B=1024 bytes
Number of record r=30,000 records
Then,weget:
Blocking factor Bfr= B/R =10
records/block
Number of file blocks b= (r/Bfr)= (30000/10)= 3,000 blocks
A binary search on the data file need Q: What is the average number
approximate log23000 = 12 block of block accesses needed?
accesses
Note:Onceablockisstoredinthemain memory, the searching cost of the data/records within the block can be ignored (very fast)
CS3402 8
Q. # Block Accesses WITH Indexing?
Assume the Ordering key field size V=9 bytes and assume the record pointer size PR=6 bytes. Then:
An index entry size RI=(V+ PR)=(9+6)=15 bytes
Index blocking factor BfrI= B/RI = 1024/15 = 68
entries/block
The total number of index entries = number of blocks = 3000
Number of index blocks bI= (r/ BfrI)= (3000/68)= 45 blocks
Binary search needs log b = log 45= 6 block accesses 2I2
Q: What is the average number of block accesses
To get the required record = 6 + 1 = 7 block accesses needed?
CS3402 9
Clustering Index
If file records are physically ordered on a nonkey field— which does not have a distinct value for each record—that field is called the clustering field
The data file is called a clustered file.
We can create a different type of index, called a clustering index, to speed up retrieval of all the records that have the same value for the clustering field.
Q. How to create a clustering index?
CS3402 10
Clustering Index
Clustering Index
Defined on an ordered data file, which is ordered on a non-key
CS3402
11
field (repeating values)
A clustering index is also an ordered file with two fields
One index entry for each distinct value of the field
The pointer points to the first block in the data file that has a record with that value for its clustering field
A Clustering Index Example
CS3402 12
Another Clustering Index Example
Reserving the whole block for handling overflow due to record insertion (anchoring block)
CS3402 13
Secondary Index
Secondary Index
A secondary index provides a secondary means of accessing a
CS3402
14
file for which some primary access already exists
There can be several secondary indexes (and hence, indexing
fields) for the same file
The secondary index may be on a key field or a non-key field The index is an ordered file with two fields
The index field is not used for physical ordering of the file.
Example of Secondary Index on Key Field
CS3402 15
Example: Without Secondary Index
Suppose:
Record size (fixed and unspanned)
R=100 bytes
Block size B=1024 bytes Number of record r=30,000 records
Then, we get:
Blocking factor Bfr= B/R =10
records/block
Number of file blocks b= (r/Bfr)= (30000/10)= 3,000 blocks
Without the secondary index, a linear
Q: What is the average number of block
search on the data file requires b/2 = accesses needed?
3000/2 = 1,500 block accesses on
average
CS3402 16
Example: With Secondary Index
Assume the non-ordering key field size V=9 bytes and the record pointer size PR=6 bytes.
An index entry size RI=(V+ PR)=(9+6)=15 bytes
Index blocking factor BfrI= B/RI = 1024/15 = 68 entries/block
In a dense index, the total number of index entries = number of records = r = 30000
Number of index blocks b= (r/ BfrI)= (30000/68)= 442 blocks
Binary search needs log b= log 442= 9 block Q: What is the average n2 umber2 of block
accesses
accesses needed?
To get the required record = 9 + 1 = 10 block accesses
CS3402 17
Example of Secondary Index on non-Key Field
Using blocks of record pointers for handling same value non- key records
CS3402 18
Properties of Index Types
CS3402 19
Multi-Level Indexes
Because a single-level index is an ordered file, we can create a primary index to the index itself
In this case, the original index file is called the first-level index and the index to the index is called the second-level index
We can repeat the process, creating a third, fourth, …, top level until all entries of the top level fit in one disk block
A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block
CS3402 20
A Two-Level Primary Index
CS3402 21
Multi-Level Indexes Example
Suppose we consider a multi-level index that the index blocking factor (also called fan out, fo) bfri = 68 index entries per block
Number of blocks = 30000
The number of first level blocks b1 = ceil(30000/68) = 442
The no. of second-level blocks b2 = ceil(b1/fo) = ceil(442/68) = 7 blocks
The no. of third-level blocks b3 = ceil(b2/fo) = ceil(7/68) = 1 Hence the third-level is the top level of the index t =3
To access a record, access one block at each level plus one block from the data file = t + 1 = 3 + 1 = 4 block accesses
CS3402 22
Tree Data Structure
A tree is formed of multi-level nodes
Except the root node, each node in the tree has one parent node and zero or more
child node
A node that does not have any child nodes is called a leaf node
A non-leaf node is called internal node
A sub-tree of a node consists of the node and all its descendant
If the leaf nodes are at different levels, the tree is called unbalanced
CS3402 23
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
Most multi-level indexes use B-tree or B+-tree data structures for solving the insertion and deletion problem
They reserves spaces in each tree node (disk block) to allow for new index entries
In B-Tree and B+-Tree, each node corresponds to a disk block Each node is kept between half-full and completely full
Need to restructure the tree in case of node overflow (insertion when full) or underflow (deletion when half full)
Less than half-full: Wastes of space and the tree will have many levels (higher retrieval cost)
CS3402 24
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
An insertion into a node that is not full is quite efficient
If a node is full the insertion causes a split into two nodes Splitting may propagate to higher tree levels
A deletion is quite efficient if a node does not become less than half full
CS3402
25
Otherwise, if a deletion causes a node to become less than half full, it may merge with neighboring nodes (may propagate to higher levels)
Internal Node in B-tree Index
Each internal node in a B-tree is of the form
Each Pri is a data pointer points to the record whose search key field value = Ki
Within each node, K1 < K2 < ... < Kq-1
For all search key field values X in the subtree pointed by Pi, we have Ki-1 < X < Ki for 1
where q ≤ p
Pi : tree pointer
Ki : search key field value
Within each node, K1 < K2 < ... < Kq-1
For all search key field values X in the subtree pointed by Pi, we
haveKi-1
Pnext>
Pri is a data pointer pointing to the record whose search field is Ki Pnext points to the next leaf node of the tree
Within each node, K1 < K2 < ... < Kq-1 Each leaf node has at least ceil(p/2) values All leaf nodes are at the same level
CS3402 30
B+-tree Index (Example)
CS3402
31
Internal node
B+-tree example
Leaf node
B+-tree Index Example
Suppose the search key field is V = 9 bytes, block size is B = 512 bytes, a data pointer is Pr = 7 bytes, and a tree node pointer is P = 6 bytes
An internal node of B+-tree can have up to p tree pointers and p-1 search field value. These must fit into a single block.
(p * P) + ((p – 1) * V) <= B
Q: What is the maximum number of tree
pointers in an internal node? (p * 6) + (p – 1) * 9) <= 512
p = 34
CS3402 32
B+-tree Index Example
Suppose the search key field is V = 9 bytes, block size is B = 512 bytes, a data pointer is Pr = 7 bytes, and a tree node pointer is P = 6 bytes
The order p for the leaf node
Q: What is the maximum number of data
leaf
pointers in a leaf node? (pleaf *(Pr+V))+P<=B
(pleaf *(7+9))+6<=512 16 * pleaf <= 506
pleaf = 31
CS3402 33
Difference between B-tree and B+-tree
In a B-tree, data pointers exist at all levels of the tree In a B+-tree, data pointers exist at the leaf-level nodes
A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree since its entry is smaller in size
The internal nodes of B+-tree contains tree pointers only
CS3402 34
B+-tree Index Insertion
Step1: Descend to the leaf node where the key fits Step 2:
(Case 1): If the node has an empty space, insert the key into the node.
(Case 2) If the node is already full, split it into two nodes by the middle key value, distributing the keys evenly between the two nodes, so each node is half full.
CS3402
35
(Case 2a) If the node is a leaf, take a copy of the middle key value, and repeat step 2 to insert it into the parent node.
(Case 2b) If the node is a non-leaf, exclude the middle key value during the split and repeat step 2 to insert this excluded value into the parent node.
B+-tree Index Insertion
Example #1: insert 28 into the tree below
CS3402
36
The node has an empty space
B+-tree Index Insertion
Example #1: insert 28 into the tree below
Result: insert 28 into the appropriate leaf node
CS3402 37
B+-tree Index Insertion
Example #2: insert 70 into the tree below
CS3402 38
B+-tree Index Insertion
Example #2: insert 70 into the tree below
CS3402 Split 39
The leaf node is full ->
B+-tree Index Insertion
Example #2: insert 70 into the tree below
Result: split leaf node from the middle, chose the middle key 60, and place it into the parent node.
CS3402 40
B+-tree Index Insertion
Example #3: insert 95 into the tree below
The leaf node is full -> Split
CS3402 41
B+-tree Index Insertion
Example #3: insert 95 into the tree below
Result: split leaf node from the middle, and split the parent node.
CS3402 42
B+-tree Index Deletion
Step1: Descend to the leaf node where the key fits
Step 2: Remove the required key and associated reference from the
node.
(Case 1): If the node still has half-full keys, repair the keys in parent
node to reflect the change in child node if necessary and stop.
(Case 2): If the node is less than half-full, but left or right sibling node is more than half-full, redistribute the keys between this node and its sibling. Repair the keys in the level above to represent that these nodes now have a different “split point” between them.
(Case 3): If the node is less than half-full, and left and right sibling node are just half-full, merge the node with its sibling. Repeat step 2 to delete the unnecessary key in its parent.
CS3402 43
B+-tree Index Deletion
Example #1: Delete 70 from the tree below
CS3402 44
B+-tree Index Deletion
Example #1: Delete 70 from the tree below
Result: Just delete 70 from the leaf node.
CS3402 45
B+-tree Index Deletion
Example #2: Delete 25 from the tree below
CS3402 46
B+-tree Index Deletion
Example #2: Delete 25 from the tree below
CS3402 47
B+-tree Index Deletion
Example #2: Delete 25 from the tree below
Result: Delete 25 from the leaf node and internal node,
replace it with 28 in the internal node.
CS3402 48
B+-tree Index Deletion
Example #3: Delete 60 from the tree below
CS3402 49
B+-tree Index Deletion
Example #3: Delete 60 from the tree below
CS3402
50
Less than half-full
B+-tree Index Deletion
Example #3: Delete 60 from the tree below
Result: Delete 60 from the leaf node, combine leaf nodes and then internal nodes
CS3402 51
References
7e
Chapter 17
Online resource (B+-tree visualization): https://www.cs.usfca.edu/~galles/visualization/BPlusTree. html
CS3402 52