CS计算机代考程序代写 data structure CS3402

CS3402
1
CS3402 : Chapter 9 Indexing Techniques
Semester B, 2020/2021

Indexing
 Q: Given that a file already exists with some primary organization (e.g., unordered, ordered, or hash organizations), how to speed up the retrieval of records?
 We will describe additional auxiliary access structures called index.
CS3402
2
Without an index, we may use the linear search (for unordered file or non-order attribute in ordered file)

Indexes as Access Paths
 Index are auxiliary files on disk that provide secondary access paths, which provide fast access to records without affecting the physical placement of records in the primary data file on disk.
 An index is an ordered sequence of entries :
It is ordered by the indexing field value
The pointer provides the address of the required record
 Instead of searching the data file sequentially, we search the index
to get the address of the required records
 The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller
CS3402 3

Types of Single Level Indexes
 There are several types of single-level ordered indexes:
Primary index: Specified on the ordering key field of an ordered
file of records
Cluster index: The ordering field of the index is not a key field and numerous records in the data file can have the same value for the ordering field (repeating value attributes)
Secondary index: The index field is specified on any non-ordering field of a data file
CS3402 4

Dense vs sparse index
 A dense index has an index entry for every search key value (and hence every record) in the data file.
Thus larger index size
 A sparse index, on the other hand, has index entries for only some of
the search values.
Thus smaller index size
CS3402 5

Primary Index
 We can physically order the records of a file on disk based on the values of the ordering field.
This leads to an ordered file.
 Ordering key field: The ordering field is also a key field of the file—a field guaranteed to have a unique value in each record.
 Q. How to create a primary index on this ordered file?
 Assumption: Name is the ordering key field.
CS3402 6

Primary Index
 Block anchor: First record in the block
 A primary index is an ordered file whose records are of fixed length with two fields:
 Primary key value of block anchor => One index entry for each
block in the data file  Block pointer
 Q: Is the primary index a sparse index?
CS3402 7

Q. # Block Accesses WITHOUT
Indexing?
 Supposethat:
 Record size fixed and unspanned
R=100 bytes
 Block size B=1024 bytes
 Number of record r=30,000 records
 Then,weget:
 Blocking factor Bfr= B/R =10
records/block
 Number of file blocks b= ⌈r/Bfr⌉=
⌈30000/10⌉= 3,000 blocks
 A binary search on the data file need approximate ⌈ log23000 ⌉ = 12 block accesses
 Note:Onceablockisstoredinthemain memory, the searching cost of the data/records within the block can be ignored (very fast)
CS3402 8

Q. # Block Accesses WITH Indexing?
 Assume the Ordering key field size V=9 bytes and assume the record pointer size PR=6 bytes. Then:
 An index entry size RI=(V+ PR)=(9+6)=15 bytes
 Index blocking factor BfrI= B/RI  = 1024/15  = 68
entries/block
 The total number of index entries rI = number of blocks = 3000
Number of index blocks bI= ⌈ rI/ BfrI ⌉ = ⌈ 3000/68 ⌉ = 45 blocks
 Binary search needs ⌈ log2bI⌉ = ⌈ log245 ⌉ = 6 block accesses
To get the required record = 6 + 1 = 7 block accesses
CS3402 9

Clustering Index
 If file records are physically ordered on a nonkey field— which does not have a distinct value for each record—that field is called the clustering field
 The data file is called a clustered file.
 We can create a different type of index, called a clustering index, to speed up retrieval of all the records that have the same value for the clustering field.
 Q. How to create a clustering index?
CS3402 10

Clustering Index
 Clustering Index
Defined on an ordered data file, which is ordered on a non-key
CS3402
11
field (repeating values)
A clustering index is also an ordered file with two fields
One index entry for each distinct value of the field
The pointer points to the first block in the data file that has a record with that value for its clustering field

A Clustering Index Example
CS3402 12

Another Clustering Index Example
Reserving the whole block for handling overflow due to record insertion (anchoring block)
CS3402 13

Secondary Index
 Secondary Index
A secondary index provides a secondary means of accessing a
CS3402
14
file for which some primary access already exists
There can be several secondary indexes (and hence, indexing
fields) for the same file
The secondary index may be on a key field or a non-key field The index is an ordered file with two fields

The index field is not used for physical ordering of the file.

Example of Secondary Index on Key Field
CS3402 15

Example: Without Secondary Index
 Suppose:
Record size (fixed and unspanned)
R=100 bytes
Block size B=1024 bytes Number of record r=30,000 records
 Then, we get:
Blocking factor Bfr= B/R =10
records/block
Number of file blocks b= ⌈ r/Bfr ⌉
= ⌈ 30000/10 ⌉ = 3,000 blocks
Without the secondary index, a linear search on the data file requires b/2 = 3000/2 = 1,500 block accesses on
average
CS3402 16

Example: With Secondary Index
 Assume the non-ordering key field size V=9 bytes and the record pointer size PR=6 bytes.
An index entry size RI=(V+ PR)=(9+6)=15 bytes
Index blocking factor BfrI= B/RI  = 1024/15  = 68 entries/block
In a dense index, the total number of index entries = number of records = r = 30000
Number of index blocks b= ⌈ r/ BfrI ⌉ = ⌈ 30000/68 ⌉ = 442 blocks
Binary search needs ⌈ log2b ⌉ = ⌈ log2442 ⌉ = 9 block accesses
To get the required record = 9 + 1 = 10 block accesses
CS3402 17

Example of Secondary Index on non-Key Field
Using blocks of record pointers for handling same value non- key records
CS3402 18

Properties of Index Types
CS3402 19

Multi-Level Indexes
 Because a single-level index is an ordered file, we can create a primary index to the index itself
In this case, the original index file is called the first-level index and the index to the index is called the second-level index
 We can repeat the process, creating a third, fourth, …, top level until all entries of the top level fit in one disk block
 A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block
CS3402 20

A Two-Level Primary Index
CS3402 21

Multi-Level Indexes Example
 Suppose we consider a multi-level index that the index blocking factor (also called fan out, fo) bfri = 68 index entries per block
Number of blocks = 30000
The number of first level blocks b1 = ceil(30000/68) = 442
The no. of second-level blocks b2 = ceil(b1/fo) = ceil(442/68) = 7 blocks
The no. of third-level blocks b3 = ceil(b2/fo) = ceil(7/68) = 1 Hence the third-level is the top level of the index t =3
To access a record, access one block at each level plus one block from the data file = t + 1 = 3 + 1 = 4 block accesses
CS3402 22

Tree Data Structure
 A tree is formed of multi-level nodes
 Except the root node, each node in the tree has one parent node and zero or more
child node
 A node that does not have any child nodes is called a leaf node
 A non-leaf node is called internal node
 A sub-tree of a node consists of the node and all its descendant
 If the leaf nodes are at different levels, the tree is called unbalanced
CS3402 23

Dynamic Multilevel Indexes Using B-Trees and B+-Trees
 Most multi-level indexes use B-tree or B+-tree data structures for solving the insertion and deletion problem
They reserves spaces in each tree node (disk block) to allow for new index entries
 In B-Tree and B+-Tree, each node corresponds to a disk block  Each node is kept between half-full and completely full
Need to restructure the tree in case of node overflow (insertion when full) or underflow (deletion when half full)
Less than half-full: Wastes of space and the tree will have many levels (higher retrieval cost)
CS3402 24

Dynamic Multilevel Indexes Using B-Trees and B+-Trees
 An insertion into a node that is not full is quite efficient
If a node is full the insertion causes a split into two nodes Splitting may propagate to higher tree levels
 A deletion is quite efficient if a node does not become less than half full
CS3402
25
Otherwise, if a deletion causes a node to become less than half full, it may merge with neighboring nodes (may propagate to higher levels)

Internal Node in B-tree Index
 Each internal node in a B-tree is of the form
, P2, , …, ,Pq> where q ≤ p  Each Pi is a tree pointer to another node in the B-tree
 Each Pri is a data pointer points to the record whose search key field value = Ki
Withineachnode,K1
where q ≤ p
Pi : tree pointer
Ki : search key field value
Within each node, K1 < K2 < ... < Kq-1 For all search key field values X in the subtree pointed by Pi, we haveKi-1 ,, …,,
Pnext>
Pri is a data pointer pointing to the record whose search field is Ki Pnext points to the next leaf node of the tree
Within each node, K1 < K2 < ... < Kq-1 Each leaf node has at least ceil(p/2) values All leaf nodes are at the same level CS3402 30 B+-tree Index (Example) CS3402 31 Internal node B+-tree example Leaf node B+-tree Index Example  Suppose the search key field is V = 9 bytes, block size is B = 512 bytes, a data pointer is Pr = 7 bytes, and a tree node pointer is P = 6 bytes  An internal node of B+-tree can have up to p tree pointers and p-1 search field value. These must fit into a single block. (p * P) + ((p – 1) * V) <= B (p * 6) + (p – 1) * 9) <= 512 p = 34 CS3402 32 B+-tree Index Example  Suppose the search key field is V = 9 bytes, block size is B = 512 bytes, a data pointer is Pr = 7 bytes, and a tree node pointer is P = 6 bytes  The order pleaf for the leaf node (pleaf *(Pr+V))+P<=B (pleaf *(7+9))+6<=512 16 * pleaf <= 506 pleaf = 31 CS3402 33 Difference between B-tree and B+-tree  In a B-tree, data pointers exist at all levels of the tree  In a B+-tree, data pointers exist at the leaf-level nodes  A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree since its entry is smaller in size The internal nodes of B+-tree contains tree pointers only CS3402 34 B+-tree Index Insertion  Step1: Descend to the leaf node where the key fits  Step 2: (Case 1): If the node has an empty space, insert the key into the node. (Case 2) If the node is already full, split it into two nodes by the middle key value, distributing the keys evenly between the two nodes, so each node is half full. CS3402 35 (Case 2a) If the node is a leaf, take a copy of the middle key value, and repeat step 2 to insert it into the parent node. (Case 2b) If the node is a non-leaf, exclude the middle key value during the split and repeat step 2 to insert this excluded value into the parent node. B+-tree Index Insertion Example #1: insert 28 into the tree below CS3402 36 The node has an empty space B+-tree Index Insertion Example #1: insert 28 into the tree below Result: insert 28 into the appropriate leaf node CS3402 37 B+-tree Index Insertion Example #2: insert 70 into the tree below CS3402 38 B+-tree Index Insertion Example #2: insert 70 into the tree below CS3402 The leaf node is full -> Split
39

B+-tree Index Insertion
Example #2: insert 70 into the tree below
Result: split leaf node from the middle, chose the middle key 60, and place it into the parent node.
CS3402 40

B+-tree Index Insertion
Example #3: insert 95 into the tree below
The leaf node is full -> Split
CS3402 41

B+-tree Index Insertion
Example #3: insert 95 into the tree below
Result: split leaf node from the middle, and split the parent node.
CS3402 42

B+-tree Index Deletion
 Step1: Descend to the leaf node where the key fits
 Step 2: Remove the required key and associated reference from the
node.
(Case 1): If the node still has half-full keys, repair the keys in parent
node to reflect the change in child node if necessary and stop.
(Case 2): If the node is less than half-full, but left or right sibling node is more than half-full, redistribute the keys between this node and its sibling. Repair the keys in the level above to represent that these nodes now have a different “split point” between them.
(Case 3): If the node is less than half-full, and left and right sibling node are just half-full, merge the node with its sibling. Repeat step 2 to delete the unnecessary key in its parent.
CS3402 43

B+-tree Index Deletion
Example #1: Delete 70 from the tree below
CS3402 44

B+-tree Index Deletion
Example #1: Delete 70 from the tree below
Result: Just delete 70 from the leaf node.
CS3402 45

B+-tree Index Deletion
Example #2: Delete 25 from the tree below
CS3402 46

B+-tree Index Deletion
Example #2: Delete 25 from the tree below
CS3402 47

B+-tree Index Deletion
Example #2: Delete 25 from the tree below
Result: Delete 25 from the leaf node and internal node,
replace it with 28 in the internal node.
CS3402 48

B+-tree Index Deletion
Example #3: Delete 60 from the tree below
CS3402 49

B+-tree Index Deletion
Example #3: Delete 60 from the tree below
CS3402
50
Less than half-full

B+-tree Index Deletion
Example #3: Delete 60 from the tree below
Result: Delete 60 from the leaf node, combine leaf nodes and then internal nodes
CS3402 51

References
 7e
Chapter 17
 Online resource (B+-tree visualization): https://www.cs.usfca.edu/~galles/visualization/BPlusTree. html
CS3402 52