程序代写代做代考 flex PowerPoint Presentation

PowerPoint Presentation

Index Files and
B+Tree Refinements
R & G – Chapter 9-10

1

General characteristics of an index: An Outline
Issues to consider in any index structure (not just B+-trees)
Query support: what class of queries does the index allow?
Choice of Search Key
Affects the queries for which we can use an index.
Data Entry Storage
Affects performance of the index
Variable-length key tricks
Affects performance of the index
Cost Model for Index vs Heap vs Sorted File

Query support

Indexes: Basic Selection
Basic Selection:
Equality selections (op is =)
Range selections (op is one of <, >, <=, >=, BETWEEN)
B+-trees provide both
Linear Hash indexes provide only equality (but are interesting!)

4

Indexes: Other Selections
More Exotic Selections:
2-d box (current map boundaries)
2-d circle (“within 2 miles of Empire State Building”)
Common n-dimensional indexes: R-tree, KD-tree, etc.
Beware of the curse of dimensionality
Near-neighbor queries (“10 restaurants closest to Empire State Building”)
Regular expression matches, genome string matches, etc.
See Postgres’ GiST indexes for a flexible structure developed at Berkeley

Image Url

For Today
In the remainder of our discussion, we’ll focus on traditional 1-d range search
And equality as a special case
As in B+-trees

Search Key and Ordering
Can index on any ordered subset of columns. Order matters!
Determines the queries supported
In an ordered index (e.g. B+-tree) the
keys are ordered lexicographically by
the search key columns:
Ordered by the 1st column
2 items match on 1st column? Ordered by 2nd
Match on 1st and 2nd column? Ordered by 3rd
Etc.
E.g. table to right ordered lexicographically
by the search key
SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400

Search Key and Ordering, Pt 2.
Defn: A composite search key on columns (k1, k2, …, kn) “matches” a query if:
The query is a conjunction of m >= 0 equality clauses of the form:
k1 = AND k2 = AND .. AND km =
and at most 1 additional range clause of the form:
AND km+1 op , where op is one of {<, >}
Why does this “match”? Lookup and scan in lexicographic order
Can do a lookup on equality conjuncts to find start-of-range
Can do a scan of contiguous data entries at leaves
satisfy the m+1st conjunct
or if there is no m+1st conjunct
scan the entire set of matches to the first m conjuncts

8

Search Key and Ordering, Pt 3
Composite Keys: more than one column
Lexicographic order
Search a range?

Legend
Green for rows we visit that are in the range
Red for rows we visit that are not in the range
SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 4
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 5
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400

11

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 6
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 6, cont
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt. 7
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 8
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400

Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.


SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 9
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31
Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.



SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 10
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31

Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.



Search Key and Ordering, Pt 11
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31
Age > 31
Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.





SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 12
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31
Age > 31
Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.





Search Key and Ordering, Pt 13
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31
Age > 31
Salary = 300
Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.






SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Search Key and Ordering, Pt 14
Composite Keys: more than one column
Lexicographic order
Search a range?
:
Age = 31 & Salary = 400
Age = 55 & Salary > 200
Age > 31 & Salary = 400
Age = 31
Age > 31
Salary = 300
Not a lexicographic range. Either visits useless rows or has to “bounce through” the index.







SSN Last Name First Name Age Salary
123 Adams Elmo 31 $300
443 Grouch Oscar 32 $400
244 Oz Bert 55 $140
134 Sanders Ernie 55 $400
176 Grump Donald 79 $300

Data Entry Storage Intro
What is the representation of data in the index?
Actual data or pointer to the data
How is the data stored in the data file?
Clustered or unclustered with respect to the index
Big Impact on Performance
We’ll learn each of these next

Three basic alternatives for data entries in any index
Three basic alternatives for data entries in any index
Alternative 1: By Value
Alternative 2: By Reference
Alternative 3: By List of references
We’ll look in the context of B+-trees, but applies to any index

Alternative 1 Index (B+ Tree)
Record contents are stored in the index file
No need to follow pointers

17

5

24

(2, Joe)
(3, Jim)
(5, Kay)
(7, Dan)
(20, Tim)

Root Node
(24, Kit)

Data Entries
Interior Nodes
uid name
2 Joe
3 Jim
5 Kay
7 Dan
20 Tim
24 Kit

Alternative 2 Index
Alternative 2: By Reference,
We used in slides above
uid name
2 Joe
3 Jim
5 Kay
7 Dan
20 Tim
24 Kit

(2, Joe)
(3, Jim)
(5, Kay)
(7, Dan)
(20, Tim)
(24, Kit)

17

5

24

(2, [1,1])
(3, [1,2])
(5, [2,1])
(7, [2,2])
(20, [3,1])

Root Node
(24, [3,2])

Data Entries
Interior Nodes
Index File
Index Contains
(Key, Record Id) Pairs

25

Alternative 3 Index
Alternative 3: By List of references,
Alternative 3 more compact than alternative 2
For very large rid lists, single data entry spans multiple blocks

(2, Joe)
(2, Jim)
(2, Kay)
(3, Dan)
(3, Tim)
(20, Kit)

17

5

24

(2, {[1,1], [1,2], [2, 1]}
(3, {[2,2], [3, 1]})
(20, {3, 2}])

Root Node

Data Entries
Interior Nodes
Index File
Index Contains
(Key, {list of record Id}) Pairs
Key Record Id
2 {[1,1], [1,2], [1,3]}
3 4

26

Indexing By Reference
Both Alternative 2 and Alternative 3 index data by reference
By-reference is required to support multiple indexes per table
Otherwise we would be replicating entire tuples
Replicating data leads to complexity when we’re doing updates, so it’s something we want to avoid

Alternative 2
Index data entries
Alternative 3
Index data entries

27

Alternative 2 vs Alternative 3 Table Illustration
SSN Last Name First Name Salary
123 Gonzalez Amanda $400
443 Gonzalez Joey $300
244 Gonzalez Jose $140
134 Hong Sue $400

Key Record Id
Gonzalez [3, 1]
Gonzalez [3, 2]
Gonzalez [3, 3]
Hong [3, 4]

Key Record Id
Gonzalez [3, {1, 2, 3}]
Hong [3,4]

Alternative 2
Index data entries
Alternative 3
Index data entries

28

Clustered vs. Unclustered Index
By-reference indexes (Alt 2 and 3) can be clustered or unclustered
Really this is a property of the heap file associated with the index!
Clustered index:
Heap file records are kept mostly ordered according to search keys in index
Heap file order need not be perfect: this is just a performance hint
Cost of retrieving data records through index varies greatly based on whether index is clustered or not!
Note: different definition of “clustering” in AI:
grouping nearby items in n-space

29

Clustered vs. Unclustered Index Visualization 1
To build a clustered index, first sort the heap file
Leave some free space on each block for future inserts
Index entries direct search for data entries

Index
Clustered

Index
Unclustered

Clustered vs. Unclustered Index Visualization 2
To build a clustered index, first sort the heap file
Leave some free space on each block for future inserts
Index entries direct search for data entries

Index

Clustered

Index
Unclustered

Clustered vs. Unclustered Index Visualization 3

Index

Clustered

Index
Unclustered

To build a clustered index, first sort the heap file
Leave some free space on each block for future inserts
Index entries direct search for data entries

32

To build a clustered index, first sort the heap file
Leave some free space on each block for future inserts
Blocks at end of file may be needed for inserts
Order of data records is “close to”, but not identical to, the sort order
Clustered vs. Unclustered Index Visualization 5

Index
Clustered

Clustered vs. Unclustered Index Visualization 6

Index
Clustered

To build a clustered index, first sort the heap file
Leave some free space on each block for future inserts
Blocks at end of file may be needed for inserts
Order of data records is “close to”, but not identical to, the sort order

Clustered vs. Unclustered Indexes Pros
Clustered Index Pros
Efficient for range searches
Potential locality benefits
Sequential disk access, prefetching, etc.
Support certain types of compression
More soon on this topic

Clustered vs. Unclustered Indexes Cons
Clustered Cons
More expensive to maintain
Need to periodically update heap file order
Solution: on the fly or “lazily” via reorganizations
Heap file usually only packed to 2/3 to accommodate inserts

B+Tree Refinement:
Variable-Length Keys

Variable Length Keys & Records
So far we have been using integer keys
What would happen to our occupancy invariant with variable length keys?
What about data in leaf pages:
25

13

Dan Ha

Danielle Yogurt
Davey Jones
David Yu
Diana Murthy

Dan Ha : {3, 14, 30, 50, 75, 90}
Dan Ham: {1}}
Danielle Yogurt: {12, 13}
Dannon Smith: {1}

38

Redefine Occupancy Invariant
Order (d) makes little sense with variable-length entries
Different nodes have different numbers of entries.
Index pages often hold many more entries than leaf pages
Even with fixed length fields, Alternative 3 gives variable length data entries
Use a physical criterion in practice: at-least half-full
Measured in bytes
Many real systems are even sloppier than this
Only reclaim space when a page is completely empty.
Basically the deletion policy we described above…

Prefix Compress Keys?
How can we get more keys on a page?
What if we compress the keys?
Are these the same
David Jones?
Not the same partitioning of possible keys
But why would we care??
Dan

Dani
Dav

Davi
Di

Dan Ha

Danielle Yogurt
Davey Jones
David Yu
Diana Murthy

40

Prefix Key Compression
What if we compress starting at leaf:
On split, determine minimum splitting prefix and copy up
Sarah Z

Sarah Lee

Sarah Manning

Sarah Zhu
Sarita Adve
Saruman The White

Sarah Manning
Sarah Zhu
Sarita Adve
Saruman The White
Sarah Lee

Suffix Key Compression
All keys have large common prefix
Move common prefix to header, leave only (compressed) suffix next to pointer
When might this be especially useful?
Composite Keys. Example?

“Sar”
ah L

ah M

ah Z

i

u

Saruman

Sarah L

Sarah M

Sarah Z

Sarita

Suffix compression
Prefix compression as on previous slide

B+-TREE COSTS

Recall: Cost of Operations
Can we do better with indexes?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

Cost of Operations
Can we do better with indexes?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
Heap File Sorted File Clustered Index
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

Cost of Operations, cont
Heap File Sorted File Clustered Index
Scan all records B*D B*D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block

Clustered vs. Unclustered Index Assumptions
Store data by reference (Alternative 2)
Clustered index with 2/3 full heap file pages
Clustered  Heap file is initially sorted
Fan-out (F): relatively large. Why?
Page of pairs ~ O(R)
Assume static index
Index

1, 2, _
3, 4, _
5, 6, _
7, 8, _
9, 10, _

Scan all the Records
Recall assumption from before regarding clustered indexes: heap file pages only 2/3 full.
Do we need an Index?
No
Cost? = 1.5 * B * D
Why?
Index

1, 2, _
3, 4, _
5, 6, _
7, 8, _
9, 10, _

Cost of Operations: Scan
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block

Cost of Operations: Equality Search?
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Find the record with key 3, pt 1
Search the index:= (logF (BR/E) + 1) * D
BR is the total number of records; E is the #records per leaf
the +1 is an “off by 1” thing: catches the cost of the root
E.g. F = 4, BR/E = 16: root, intermediate, leaf levels.
Log4(16) = 2, and I/O cost is 3!
Index

1, 2, _
3, 4, _
5, 6, _
7, 8, _
9, 10, _

51

Find the record with key 3, pt 2
Search the index:= (logF (BR/E) + 1) * D
Lookup record in heap file by record-id = 1 * D
Recall record-id = Index

1, 2, _
3, 4, _
5, 6, _
7, 8, _
9, 10, _

52

Cost of Operations: Equality Search
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

53

Cost of Operations: Range Search?
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Find keys between 3 and 7
Search the index: = (logF (BR/E) + 1) * D
Scan the leaf level and lookup each matching record in the heap file by record-id
Recall record-id = Heap file access: (3/2 * #pages) * D
Scanning the leaf level is similar to heap file access: assume same (3/2 * #pages) * D
In total (logF (BR/E) + 3 * # pages) * D since one leaf page is overcounted in searching index and scanning leaf level
Index

1, 2, _
3, 4, _
5, 6, _
7, 8, _
9, 10, _
3, 4, _
5, 6, _
7, 8, _

Cost of Operations: Range Search
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D (logF(BR/E)+3*pages)*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Cost of Operations: Insert?
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D (logF(BR/E)+3*pages)*D
Insert 2*D ((log2B) + B)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Cost of Operations: Insert
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D (logF(BR/E)+3*pages)*D
Insert 2*D ((log2B) + B)*D (logF(BR/E)+4)*D
Delete (0.5*B+1)*D ((log2B) + B)*D

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Cost of Operations: Delete
Heap File Sorted File Clustered Index
Scan all records B*D B*D 3/2 * B * D
Equality Search 0.5*B*D (log2B)*D (logF(BR/E)+2)*D
Range Search B*D ((log2B)+pages))*D (logF(BR/E)+3*pages)*D
Insert 2*D ((log2B) + B)*D (logF(BR/E)+4)*D
Delete (0.5*B+1)*D ((log2B) + B)*D (logF(BR/E)+4)*D

Why “+4” in Insert/Delete?
B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Cost of Operations: Big O Notation
Heap File Sorted File Clustered Index
Scan all records O(B) O(B) O(B)
Equality Search O(B) O(log2B) O(logFB)
Range Search O(B) O(log2B) O(logFB)
Insert O(1) O(B) O(logFB)
Delete O(B) O(B) O(logFB)

B: The number of data blocks
R: Number of records per block
D: Average time to read/write disk block
F: Average internal node fanout
E: Average # data entries per leaf

Constant factors
Assume you can do 100 sequential I/Os in the time of 1 random I/O
For a particular lookup, is a B+-tree better than a full-table scan?
Had better be very “selective”
Visit < ~1% of pages! Or do mostly sequential I/O at leaf level Clustered index Or use SSD SSDs make indexes attractive Especially for read-mostly workloads Summary Query Structure Understand composite search keys Lexicographic order and search key prefixes Data Storage Data Entries: Alt 1 (tuples), Alt 2 (recordIds), Alt 3 (lists of recordIds) Clustered vs. Unclustered Only Alt 2 & 3! Summary Cont Variable length key refinements Fill factors for variable-length keys Prefix and suffix key compression B+-tree Cost Model Attractive big-O Don’t forget constant factors of random I/O Hard to beat sequential I/O of scans unless very selective Indexes beyond B+-trees for more complex searches /docProps/thumbnail.jpeg