CS W4111.001 Introduction to Databases Fall 2020
Computer Science Department Columbia University
1
Assumptions in Performance Analysis
• Heapfile:
• Equalitysearchisonanattributethatistheprimarykeyora candidate key for relation, so at most one match possible
• Sortedfile:
• Equalityandrangesearchareoverthesearchkeyattribute(s)on which file is sorted
• Fileiscompactedafterdeletions
• Hashedfile:
• Enoughbucketssothatinitiallywehave,onaverage,80%page occupancy (i.e., every page/bucket is only 80% full); file size is then 1.25*B (i.e., we need 25% more space than if we stored all tuples compactly)
• Wethenassumenooverflowpagesneededforthebuckets
• Equalitysearchisoverthesearchkeyattribute(s)onwhichthe hashed file is organized
16
Cost of Operations
Heap file
Scan
Equality search Range search Insertion Deletion
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk 17
Sorted file
Hashed file
Heap file
Sorted file
Hashed file
Scan
B*D
B*D
1.25*B*D
Equality search
0.5*B*D
(log2B)*D
D
Range search
B*D
(log2B + # of pages with matches)*D
1.25*B*D
Insertion
2*D
cost of equality search + B*D
2*D
Deletion
cost of equality search + D
cost of equality search + B*D
2*D
Cost of Operations
B: Number of data pages the relation occupies on disk
D: Average time to read (write) one data page from (to) disk
18
Indexes
• An index on a file speeds up selections on the search key fields for the index
• One or more of the attributes of a relation can be the search key for an index on the relation
• Search key is not the same as primary key or candidate key
19
Indexes
• An index on a file speeds up selections on the search key fields for the index
• One or more of the attributes of a relation can be the search key for an index on the relation
• Search key is not the same as primary key or candidate key
• An index organizes a collection of data entries
• A data entry consists of an rid for a tuple together with the value of the search key attribute(s) for the tuple
• An index supports efficient retrieval of all data entries k* with a given value k of the search key for the index
20
Non-leaf pages
Leaf pages, sorted on search key attributes
index entry
P0
K1
P1
K2
P2
…
Km
Pm
B+ Tree Indexes
Leaf pages contain data entries, and are chained together Non-leaf pages have index entries, only used to direct searches
21
17
Entries < 17 Entries ≥ 17
5
13
27
30
27* 29*
33* 34* 38* 39*
2* 3*
5* 7* 8*
14* 16*
22* 24*
Example B+ Tree
Note how data entries in leaf level are sorted
Root
22
17
Entries < 17 Entries ≥ 17
5
13
27
30
27* 29*
33* 34* 38* 39*
2* 3*
5* 7* 8*
14* 16*
22* 24*
Example B+ Tree
Note how data entries in leaf level are sorted
Root
• Search:Find28*(i.e.,ridofalltupleswith28asthevalueof the search key attribute)? 29*? All data entries with search key values > 15 and < 30?
• Insertion,deletion:Find(placeof)dataentryinleaf,then proceed with insertion or deletion
Sometimes need to adjust parent and propagate changes up tree
23
Hash Indexes
• Good for equality selections
• Index is a collection of buckets
• Each bucket consists of one primary page (or block) plus zero or more overflow pages (or blocks)
• Buckets contain data entries
• A data entry for a tuple is assigned to one bucket according to a hash function, applied to the search key attribute values for the tuple
• No need for “index entries” in this scheme
24
Choosing Indexes
• What indexes should we create?
• Which relations should have indexes? What attribute(s) should be the search key? Should we build several indexes?
• For each index, what kind of an index should it be?
• Hash or B+ tree index?
• Additional important alternatives covered in
COMS W4112-Database System Implementation
25
Choosing Indexes Based on Expected Workload
• For each important query in workload:
• Which relations does it access?
• Which attributes are retrieved?
• Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be?
26
Choosing Indexes Based on Expected Workload
• For each important query in workload:
• Which relations does it access?
• Which attributes are retrieved?
• Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be?
• For each important update in workload:
• Which attributes are involved in selection/join conditions? • How selective are these conditions likely to be?
• What type is the update (INSERT/DELETE/UPDATE)?
• What attributes are affected?
27
Index Selection Guidelines
• Attributes in WHERE clause of important queries are candidates for search keys
• Exact match condition suggests hash index • Range condition suggests B+ tree index
28
Index Selection Guidelines
• Attributes in WHERE clause of important queries are candidates for search keys
• Exact match condition suggests hash index • Range condition suggests B+ tree index
• We should choose indexes that benefit as many queries as possible
• We should also account for performance penalty of updating indexes, which might make certain insertions, deletions, and updates slower
• Index selection process can be automated (“self-tuning” or “auto-admin” DBMSs)
29
Overview of Query Optimization
• Goal: Choose an efficient execution plan for a query (hence avoiding the really inefficient plans)
2
Overview of Query Optimization
• Goal: Choose an efficient execution plan for a query (hence avoiding the really inefficient plans)
• Given a SQL query, define and analyze cost of many different execution plans that produce the correct answer for the query, but with potentially very different run times
• First, we will focus on how to evaluate individual relational algebra operators efficiently
• Afterwards, we will discuss how to evaluate full-fledged SQL queries efficiently
3
Overview of Query Optimization
• We will express each alternative query execution plan as a tree of relational algebra operations, annotated with choice of algorithm for each operator
• We will discuss two main issues:
• What family of execution plans should we consider for a given query?
• How do we estimate the cost (or running time) of each execution plan in this search space of possible plans, so that we can choose a good plan?
• We will not attempt to find the best possible plan, but rather to find a good plan and avoid the worst plans
4
Database Statistics and Catalogs
• Toanalyzeexecutionplans,weneedinformationaboutthe relations and indexes involved
• Suchinformationiskeptinthedatabasecatalogs,whichat least contain:
• Foreachrelation,#tuplesand#pages
• Foreachindex,#distinctvaluesforthesearchkeyattributesofthe
index and # pages occupied by index
• For each B+ tree index, index height, lowest and highest values for the search key attributes of index
• Catalogsareupdatedperiodically
• Updatingwheneverdatachangeswouldbetooexpensive;lotsof approximation anyway, so slight inconsistencies are fine
• Moredetailedinformation(e.g.,histogramsofthevaluesin some attribute) is often available as well
5
Access Paths: Methods for Retrieving Tuples of a Relation
• Alternative access paths: scan relation file (always available); or use indexes that match a selection condition
• Consider an attribute A as search key
• Case1:AB+treeindexonAmatchesselection
condition A op k if op is =, <, >, ≤, ≥
• Case2:AhashindexonAmatchesonly equality selection condition A = k
6
Processing a Selection Condition
• Consider all access paths that could be used for the relation and selection condition: scan plus any indexes that match condition
• Find the most selective access path (i.e., requiring fewest I/Os), retrieve tuples using it, and apply any remaining conditions that don’t match the index
7
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
8
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
9
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
10
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
3. Use hash table on did to find the set D of rids of all tuples with did=214; retrieve those tuples and filter with other conditions (i.e., name=‘Jane’ and sal>50K)
11
other conditions (i.e., name=‘Jane’ and did=214)
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Alternative access paths and corresponding execution plans:
1. Scan relation and check full selection condition for each retrieved tuple
2. Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’; retrieve those tuples and filter with other conditions (i.e., did=214 and sal>50K)
3. Use hash table on did to find the set D of rids of all tuples with did=214; retrieve those tuples and filter with other conditions (i.e., name=‘Jane’ and sal>50K)
4. UseB+ treeindexonsaltofindthesetSofridsofall tuples with sal>50K; retrieve those tuples and filter with
12
Processing a Selection Condition: Example
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Yet another possible plan uses all three indexes: how?
13
Employees(ssn, name, sal, did), with
B+ tree on name, hash table on did, B+ tree on sal
𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
Yet another possible plan uses all three indexes:
5. Retrieve three sets of rids using the three indexes that match the selection condition:
• Use B+ tree index on name to find the set N of rids of all tuples with name=‘Jane’
• Use hash table on did to find the set D of rids of all tuples with did=214
• Use B+ tree index on sal to find the set S of rids of all tuples with sal>50K
• Retrieve and return only the tuples in N ∩ D ∩ S (i.e., the tuples whose rid is in the intersection of the three sets of rids and hence satisfy all three conditions)
Sometimes “index-only” query execution plans are possible
Processing a Selection Condition: Example
14
Factors Influencing Choice of Plan
• # of tuples matching index condition
• index characteristics
For our example, we know (from database catalogs) that Employees has:
• 100,000 tuples in 1,000 blocks (i.e., 100 tuples/block)
• 1,000 distinct name values
• 500 distinct did values
• values of sal between 20K and 250K
Cost of plan (1), Scan? 1,000 I/Os
15
Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
16
Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
17
Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
18
Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• inworstcase,100extraI/Os(eachtuplemightbeinadifferentblock in worst case)
19
Cost of Plan (2), B+ Tree on Name 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 2 or 3 I/Os to navigate from root of B+ tree to leaf node for name=‘Jane’
• Costofretrievingallleafnodeswithdataentrieswith name=‘Jane’:
• onaverage,100,000/1,000=100suchdataentries,basedonthe number of employees and the number of distinct name values
• wecanassumethatall100dataentriesfitinoneleaf,henceno further I/Os to get them all
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• inworstcase,100extraI/Os(eachtuplemightbeinadifferentblock in worst case)
Cost of plan (2), B+ tree on name: 102 or 103 I/Os
20
Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
21
Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
22
Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• onaverage,100,000/500=200suchdataentries,basedonthe number of employees and the number of distinct did values
• inworstcase,200extraI/Os(eachtuplemightbeinadifferentblock in worst case)
23
Cost of Plan (3), Hash Table on Did 𝜎name=‘Jane’ AND did=214 AND sal>50K(Employees)
• 1I/Otogethashtablebucket/blockwithalldataentriesfor did=214
• Costofretrievingtheactualtuplesassociatedwiththe matching rids, to filter with other conditions
• onaverage,100,000/500=200suchdataentries,basedonthe number of employees and the number of distinct did values
• inworstcase,200extraI/Os(eachtuplemightbeinadifferentblock in worst case)
Cost of plan (3), hash table on did: 201 I/Os
Note this is worse than plan (2), because the condition on did is less selective than that on name
24