程序代写代做代考 chain algorithm database CS W4111.001 Introduction to Databases Fall 2020

CS W4111.001 Introduction to Databases Fall 2020
Computer Science Department Columbia University
1

Assumptions in Performance Analysis
• Heapfile:
• Equalitysearchisonanattributethatistheprimarykeyora candidate key for relation, so at most one match possible
• Sortedfile:
• Equalityandrangesearchareoverthesearchkeyattribute(s)on which file is sorted
• Fileiscompactedafterdeletions
• Hashedfile:
• Enoughbucketssothatinitiallywehave,onaverage,80%page occupancy (i.e., every page/bucket is only 80% full); file size is then 1.25*B (i.e., we need 25% more space than if we stored all tuples compactly)
• Wethenassumenooverflowpagesneededforthebuckets
• Equalitysearchisoverthesearchkeyattribute(s)onwhichthe hashed file is organized
16

Cost of Operations
Heap file
Scan
Equality search Range search Insertion Deletion
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk 17
Sorted file
Hashed file

Heap file
Sorted file
Hashed file
Scan
B*D
B*D
1.25*B*D
Equality search
0.5*B*D
(log2B)*D
D
Range search
B*D
(log2B + # of pages with matches)*D
1.25*B*D
Insertion
2*D
cost of equality search + B*D
2*D
Deletion
cost of equality search + D
cost of equality search + B*D
2*D
Cost of Operations
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk
18

Indexes
• An index on a file speeds up selections on the search key fields for the index
• One or more of the attributes of a relation can be the search key for an index on the relation
• Search key is not the same as primary key or candidate key
19

Indexes
• An index on a file speeds up selections on the search key fields for the index
• One or more of the attributes of a relation can be the search key for an index on the relation
• Search key is not the same as primary key or candidate key
• An index organizes a collection of data entries
• A data entry consists of an rid for a tuple together with the value of the search key attribute(s) for the tuple
• An index supports efficient retrieval of all data entries k* with a given value k of the search key for the index
20

Non-leaf pages
Leaf pages, sorted on search key attributes
index entry
P0
K1
P1
K2
P2

Km
Pm
B+ Tree Indexes
Leaf pages contain data entries, and are chained together Non-leaf pages have index entries, only used to direct searches
21

17
Entries < 17 Entries ≥ 17 5 13 27 30 27* 29* 33* 34* 38* 39* 2* 3* 5* 7* 8* 14* 16* 22* 24* Example B+ Tree Note how data entries in leaf level are sorted Root 22 17 Entries < 17 Entries ≥ 17 5 13 27 30 27* 29* 33* 34* 38* 39* 2* 3* 5* 7* 8* 14* 16* 22* 24* Example B+ Tree Note how data entries in leaf level are sorted Root • Search:Find28*(i.e.,ridofalltupleswith28asthevalueof the search key attribute)? 29*? All data entries with search key values > 15 and < 30? • Insertion,deletion:Find(placeof)dataentryinleaf,then proceed with insertion or deletion Sometimes need to adjust parent and propagate changes up tree 23 Hash Indexes • Good for equality selections • Index is a collection of buckets • Each bucket consists of one primary page (or block) plus zero or more overflow pages (or blocks) • Buckets contain data entries • A data entry for a tuple is assigned to one bucket according to a hash function, applied to the search key attribute values for the tuple • No need for “index entries” in this scheme 24 Choosing Indexes • What indexes should we create? • Which relations should have indexes? What attribute(s) should be the search key? Should we build several indexes? • For each index, what kind of an index should it be? • Hash or B+ tree index? • Additional important alternatives covered in COMS W4112-Database System Implementation 25 Choosing Indexes Based on Expected Workload • For each important query in workload: • Which relations does it access? • Which attributes are retrieved? • Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be? 26 Choosing Indexes Based on Expected Workload • For each important query in workload: • Which relations does it access? • Which attributes are retrieved? • Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be? • For each important update in workload: • Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be? • What type is the update (INSERT/DELETE/UPDATE)? • What attributes are affected? 27 Index Selection Guidelines • Attributes in WHERE clause of important queries are candidates for search keys • Exact match condition suggests hash index • Range condition suggests B+ tree index 28 Index Selection Guidelines • Attributes in WHERE clause of important queries are candidates for search keys • Exact match condition suggests hash index • Range condition suggests B+ tree index • We should choose indexes that benefit as many queries as possible • We should also account for performance penalty of updating indexes, which might make certain insertions, deletions, and updates slower • Index selection process can be automated (“self-tuning” or “auto-admin” DBMSs) 29 Overview of Query Optimization • Goal: Choose an efficient execution plan for a query (hence avoiding the really inefficient plans) 2 Overview of Query Optimization • Goal: Choose an efficient execution plan for a query (hence avoiding the really inefficient plans) • Given a SQL query, define and analyze cost of many different execution plans that produce the correct answer for the query, but with potentially very different run times • First, we will focus on how to evaluate individual relational algebra operators efficiently • Afterwards, we will discuss how to evaluate full-fledged SQL queries efficiently 3 Overview of Query Optimization • We will express each alternative query execution plan as a tree of relational algebra operations, annotated with choice of algorithm for each operator • We will discuss two main issues: • What family of execution plans should we consider for a given query? • How do we estimate the cost (or running time) of each execution plan in this search space of possible plans, so that we can choose a good plan? • We will not attempt to find the best possible plan, but rather to find a good plan and avoid the worst plans 4 Database Statistics and Catalogs • Toanalyzeexecutionplans,weneedinformationaboutthe relations and indexes involved • Suchinformationiskeptinthedatabasecatalogs,whichat least contain: • Foreachrelation,#tuplesand#pages • Foreachindex,#distinctvaluesforthesearchkeyattributesofthe index and # pages occupied by index • For each B+ tree index, index height, lowest and highest values for the search key attributes of index • Catalogsareupdatedperiodically • Updatingwheneverdatachangeswouldbetooexpensive;lotsof approximation anyway, so slight inconsistencies are fine • Moredetailedinformation(e.g.,histogramsofthevaluesin some attribute) is often available as well 5 Access Paths: Methods for Retrieving Tuples of a Relation • Alternative access paths: scan relation file (always available); or use indexes that match a selection condition • Consider an attribute A as search key • Case1:AB+treeindexonAmatchesselection condition A op k if op is =, <, >, ≤, ≥
• Case2:AhashindexonAmatchesonly equality selection condition A = k
6

Processing a Selection Condition
• Consider all access paths that could be used for the relation and selection condition: scan plus any indexes that match condition
• Find the most selective access path (i.e., requiring fewest I/Os), retrieve tuples using it, and apply any remaining conditions that don’t match the index
7

Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
8

Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
9

Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
10

Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
3. Use hash table on did to find the set D of rids of all tuples with did=214; retrieve those tuples and filter with other conditions (i.e., name=‘Jane’ and sal>50K)
11

other conditions (i.e., name=‘Jane’ and did=214)
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
3. Use hash table on did to find the set D of rids of all tuples with did=214; retrieve those tuples and filter with other conditions (i.e., name=‘Jane’ and sal>50K)
4. UseB+ treeindexonsaltofindthesetSofridsofall tuples with sal>50K; retrieve those tuples and filter with
12

Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Yet another possible plan uses all three indexes: how?
13

Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Yet another possible plan uses all three indexes:
5. Retrieve three sets of rids using the three indexes that match the selection condition:
• Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’
• Use hash table on did to find the set D of rids of all tuples with did=214
• Use B+ tree index on sal to find the set S of rids of all tuples with sal>50K
• Retrieve and return only the tuples in N ∩ D ∩ S (i.e., the tuples whose rid is in the intersection of the three sets of rids and hence satisfy all three conditions)
Sometimes “index-only” query execution plans are possible
Processing a Selection Condition: Example
14

Factors Influencing Choice of Plan
• # of tuples matching index condition
• index characteristics
For our example, we know (from database catalogs) that Employees has:
• 100,000 tuples in 1,000 blocks (i.e., 100 tuples/block)
• 1,000 distinct name values
• 500 distinct did values
• values of sal between 20K and 250K
Cost of plan (1), Scan? 1,000 I/Os
15

Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
16

Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
17

Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
18

Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• inworstcase,100extraI/Os(eachtuplemightbeinadifferentblock in worst case)
19

Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• inworstcase,100extraI/Os(eachtuplemightbeinadifferentblock in worst case)
Cost of plan (2), B+ tree on name: 102 or 103 I/Os
20

Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
21

Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
22

Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• onaverage,100,000/500=200suchdataentries,basedonthe number of employees and the number of distinct did values
• inworstcase,200extraI/Os(eachtuplemightbeinadifferentblock in worst case)
23

Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• onaverage,100,000/500=200suchdataentries,basedonthe number of employees and the number of distinct did values
• inworstcase,200extraI/Os(eachtuplemightbeinadifferentblock in worst case)
Cost of plan (3), hash table on did: 201 I/Os
Note this is worse than plan (2), because the condition on did is less selective than that on name
24