INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 13
Query Optimization Part I
Week 7
1
INFO20003 Database Systems © University of Melbourne
Remember this? Components of a DBMS
DBMS
Query processing module
Parser/ Compiler
Optimizer Executor
TODAY & Next time
Concurrency control module
Transaction mgr.
Lock mgr.
Crash recovery module
Concurrency
control module
Log mgr.
Storage module
File and access methods mgr.
Buffer pool mgr. Disk space mgr.
This is one of several possible architectures; each system has its own slight variations.
Index files Heap files
Database
INFO20003 Database Systems © University of Melbourne
4
Coverage
• Overview
• Query optimization • Cost estimation
Readings: Chapter 12 and 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
5
Query Processing Workflow: Review
Query
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Schema
Statistics
Query Plan Evaluator
Select *
From Blah B
Where B.blah = “foo”
INFO20003 Database Systems © University of Melbourne
6
Query Optimization
• Typically there are many ways of executing a given query, all giving the same answer
• Cost of alternative methods often varies enormously
• Query optimization aims to find the execution strategy
with the lowest cost
• We will cover:
–Relational algebra equivalences –Cost estimation
Result size estimation and reduction factors
–Enumeration of alternative plans
INFO20003 Database Systems © University of Melbourne
7
Query Plan
• A tree, with relational algebra operators as nodes and access paths as leaves
• Each operator labeled with a choice of algorithm
SELECT sname from Sailors NATURAL JOIN Reserves WHERE bid = 100 and rating > 5
Plan: bid=100
rating > 5 (On-the-fly (pipeline) (Page-Oriented
sname
(On-the-fly (pipeline))
sid=sid
Nested loops)
Reserves
(Heap Scan)
Sailors
Cost: 500+500*1000 I/O
(Heap Scan)
INFO20003 Database Systems © University of Melbourne
8
Query Optimization
• Overview
• Query optimization • Cost estimation
Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
9
A Familiar Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Boats (bid: integer, bname: string, color: string)
INFO20003 Database Systems © University of Melbourne
10
Query Optimization Overview
Example:
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Query optimization steps:
1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are considered
4. Plan with the lowest estimated cost is selected
INFO20003 Database Systems © University of Melbourne
11
Step 1: Break query into query blocks
• Query block is any statement starting with select
• Query block = unit of optimization
• Typically inner most block is optimized first, then moving towards outers
SELECT S.sname FROM Sailors S WHERE S.age IN
(SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)
Outer block
Nested block
INFO20003 Database Systems © University of Melbourne
12
Step 2: Convert query block into relational algebra expression
Query:
SELECT S.sid
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
Relational algebra:
π
σ
S.sid( B.color = “red” (Sailors Reserves Boats))
INFO20003 Database Systems © University of Melbourne
13
Step 3: Relational Algebra Equivalences
(Commute) (Cascade)
INFO20003 Database Systems © University of Melbourne
14
Examples
Selection:
σage<18 ۸ rating>5 (Sailors)
↔ σage<18 (σrating>5 (Sailors))
↔ σrating>5 (σage<18 (Sailors))
Projection:
πage,rating (Sailors) ↔ πage (πrating (Sailors)) ? πage,rating (Sailors) ↔ πage,rating (πage,rating,sid (Sailors))
INFO20003 Database Systems © University of Melbourne
15
Another Equivalence
• A projection commutes with a selection that only uses attributes retained by the projection
πage, rating, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, rating, sid (Sailors))
πage, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, sid (Sailors))
?
INFO20003 Database Systems © University of Melbourne
16
Equivalences Involving Joins
R (S T)≡ (R S) T (R S) ≡ (S R)
(Associative) (Commutative)
• These equivalences allow us to choose different join orders
INFO20003 Database Systems © University of Melbourne
17
Mixing Joins with Selections & Projections
• Converting selection + cross-product to join σS.sid = R.sid (Sailors x Reserves)
↔ Sailors S.sid = R.sid Reserves
• Selection on just attributes of S commutes with R S σS.age<18 (Sailors S.sid = R.sid Reserves)
↔ (σS.age<18 (Sailors)) S.sid = R.sid Reserves
• We can also “push down” projection (but be careful...) πS.sname (Sailors S.sid = R.sid Reserves)
↔ πS.sname (πsname,sid(Sailors) S.sid = R.sid πsid(Reserves))
INFO20003 Database Systems © University of Melbourne
18
Query Optimization
• Overview
• Query optimization • Cost estimation
Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
20
Recall: Query Optimization Overview
1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are considered
4. Plan with lowest estimated cost is selected
π
SELECT S.sname
FROM Reserves R, Sailors S WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
sname rating > 5
sid=sid Reserves Sailors
bid=100
π(sname)σ(bid=100 ∧ rating > 5) (Reserves Sailors)
INFO20003 Database Systems © University of Melbourne
21
Cost-based Query Sub-System
Queries
Steps 3 & 4
3. What plans are considered? 4. What is the cost of a plan?
Schema Statistics
Select *
From Blah B
Where B.blah = “foo”
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Query Plan Evaluator
INFO20003 Database Systems © University of Melbourne
22
Cost Estimation
• For each plan considered, must estimate cost:
–Must estimate size of result for each operation in tree
•Use information about input relations (from the system catalogs), and apply rules (discussed next)
–Must estimate cost of each operation in plan tree
•Depends on input cardinalities
•We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins)
•Next time we will calculate the cost of entire plans…
INFO20003 Database Systems © University of Melbourne
23
Statistics and Catalogs
• To decide on the cost, the optimizer needs information about the relations and indexes involved. This information is stored in the system catalogs.
• Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) per relation
– # distinct key values (NKeys) for each index (or relation attribute)
– low/high key values (Low/High) for each index (or relation attribute) – Index height (Height(I)) for each tree index
– # index pages (NPages(I)) for each index
• Statistics in catalogs are updated periodically
INFO20003 Database Systems © University of Melbourne
24
Result size estimation
SELECT attribute list FROM relation list
WHERE predicate1 AND … AND predicate_k
• Consider a query block:
• Maximum number of tuples in the result is the product of the cardinalities of relations in the FROM clause
• Reduction factor (RF) associated with each predicate reflects the impact of the predicate in reducing the result size. RF is also called selectivity.
INFO20003 Database Systems © University of Melbourne
25
Result size estimation calculations
• Single table selection:
ResultSize = 𝐍𝐍𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊 • Joins (over k tables):
ResultSize = ∏𝒋𝒋=𝟏𝟏..𝒌𝒌 𝑵𝑵𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹𝒋𝒋) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊
• If there are no selections (no predicates), reduction factors are simply ignored, i.e. they are ==1
INFO20003 Database Systems © University of Melbourne
26
Calculating Reduction Factors(RF)
•
Depend on the type of the predicate:
1. Col = value
RF = 1/NKeys(Col)
2. Col > value
RF = (High(Col) – value) / (High(Col) – Low(Col))
3. Col < value
RF = (val – Low(Col)) / (High(Col) – Low(Col))
4. Col_A = Col_B (for joins)
RF = 1/ (Max (NKeys(Col_A), NKeys(Col_B)))
5. In no information about Nkeys or interval, use a “magic number” 1/10
RF = 1/10
INFO20003 Database Systems © University of Melbourne
27
Let’s try it..
Sailors (S): NTuples(S) =1000, Nkeys(rating) = 10 interval [1-10], age interval [0-100], Nkeys(sid)=1000
SELECT * FROM Sailors WHERE rating = 3 AND age > 50;
Calculate result size:
NTuples(S) = 1000
RF(rating) = 1/10 = 0.1
RF(age) = (100-50)/(100-0) = 0.5
ResultSize = NTuples(S)*RF(rating)*RF(age)
= 1000*0.1*0.5= 50 tuples
INFO20003 Database Systems © University of Melbourne
28
What’s examinable
• What is query optimization/describe steps? • Equivalence classes
• Result size estimation
• Important for Assignment 3 as well
INFO20003 Database Systems © University of Melbourne
31
Next Lecture
• Query optimization Part II ‒ Planenumeration
INFO20003 Database Systems © Univers-it-y of Melbourne
32