INFO20003 Database Systems
INFO20003 Database Systems 1© University of Melbourne 2018
INFO20003 Database Systems
Lecture 13
Query Optimization Part I
Semester 2 2018, Week 7
Dr Renata Borovica-Gajic
INFO20003 Database Systems 2© University of Melbourne 2018
A1 Collective Feedback
• Assignment 1 Feedback will be sent by the end of this week
• Best way to prepare for the MST is to look at your mistakes, compare with
the provided solution, and try to mark yourself against the provided
assessments criteria
–Self-reflection is an important part of learning
–Pay attention to business rules that your model does not support
–Let’s have a look at one potential solution
• This assignment was a preview of a business analyst’s life
INFO20003 Database Systems 4© University of Melbourne 2018
Remember this? Components of a DBMS
DBMS
Index files
Heap files
Database
Concurrency
control module
File and access
methods mgr.
Buffer pool mgr.
Disk space mgr.
Storage module Concurrency
control module
Transaction
mgr.
Lock mgr.
Crash recovery
module
Log mgr.
Query processing module
Parser/
Compiler
Optimizer Executor
This is one of several possible architectures;
each system has its own slight variations.
TODAY &
Next time
INFO20003 Database Systems 5© University of Melbourne 2018
Coverage
• Overview
• Query optimization
• Cost estimation
Readings: Chapter 12 and 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems 6© University of Melbourne 2018
Query Processing Workflow: Review
Query Parser
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Query Plan Evaluator
Catalog Manager
Schema Statistics
Select *
From Blah B
Where B.blah = “foo”
Query
INFO20003 Database Systems 7© University of Melbourne 2018
Query Optimization
• Typically there are many ways of executing a given query,
all giving the same answer
• Cost of alternative methods often varies enormously
• Query optimization aims to find the execution strategy
with the lowest cost
• We will cover:
–Relational algebra equivalences
–Cost estimation
Result size estimation and reduction factors
–Enumeration of alternative plans
INFO20003 Database Systems 8© University of Melbourne 2018
Query Plan
• A tree, with relational algebra operators as nodes
• Each operator labeled with a choice of algorithm
Sailors Reserves
sid=sid
bid=100 rating > 5
sname
(Page-Oriented
Nested loops)
(On-the-fly)
(On-the-fly)
Plan:
Cost: 500+500*1000 I/O
SELECT sname from Sailors NATURAL JOIN Reserves
WHERE bid = 100 and rating > 5
(Heap Scan) (Heap Scan)
INFO20003 Database Systems 9© University of Melbourne 2018
Query Optimization
• Overview
• Query optimization
• Cost estimation
Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems 10© University of Melbourne 2018
A Familiar Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
Boats (bid: integer, bname: string, color: string)
INFO20003 Database Systems 11© University of Melbourne 2018
Query Optimization Overview
1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are
considered
4. Plan with the lowest estimated cost is selected
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
Example:
Query optimization steps:
INFO20003 Database Systems 12© University of Melbourne 2018
Step 1: Break query into query blocks
• Query block is any statement starting with select
• Query block = unit of optimization
• Typically inner most block is optimized first, then
moving towards outers
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Nested block
Outer block
INFO20003 Database Systems 13© University of Melbourne 2018
Step 2: Convert query block into
relational algebra expression
SELECT S.sid
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
p
S.sid( B.color = “red” (Sailors Reserves Boats))s
Query:
Relational algebra:
INFO20003 Database Systems 14© University of Melbourne 2018
Step 3: Relational Algebra Equivalences
(Commute)
(Cascade)
INFO20003 Database Systems 15© University of Melbourne 2018
Examples
σage<18 ٨ rating>5 (Sailors)
↔ σage<18 (σrating>5 (Sailors))
↔ σrating>5 (σage<18 (Sailors)) πage,rating (Sailors) ↔ πage (πrating (Sailors)) ? πage,rating (Sailors) ↔ πage,rating (πage,rating,sid (Sailors)) Selection: Projection: INFO20003 Database Systems 16© University of Melbourne 2018 Another Equivalence • A projection commutes with a selection that only uses attributes retained by the projection πage, rating, sid (σage<18 ٨ rating>5 (Sailors))
↔ σage<18 ٨ rating>5 (πage, rating, sid (Sailors))
πage, sid (σage<18 ٨ rating>5 (Sailors))
↔ σage<18 ٨ rating>5 (πage, sid (Sailors))
?
INFO20003 Database Systems 17© University of Melbourne 2018
R (S T) (R S) T
Equivalences Involving Joins
• These equivalences allow us to choose different join
orders
(Associative)
(R S) (S R) (Commutative)
INFO20003 Database Systems 18© University of Melbourne 2018
Mixing Joins with Selections & Projections
• Converting selection + cross-product to join
• Selection on just attributes of S commutes with R S
• We can also “push down” projection (but be careful…)
σS.sid = R.sid (Sailors x Reserves)
↔ Sailors S.sid = R.sid Reserves
σS.age<18 (Sailors S.sid = R.sid Reserves) ↔ (σS.age<18 (Sailors)) S.sid = R.sid Reserves πS.sname (Sailors S.sid = R.sid Reserves) ↔ πS.sname (πsname,sid(Sailors) S.sid = R.sid πsid(Reserves)) INFO20003 Database Systems 20© University of Melbourne 2018 Query Optimization • Overview • Query optimization • Cost estimation Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems INFO20003 Database Systems 21© University of Melbourne 2018 Recall: Query Optimization Overview 1. Query first broken into “blocks” 2. Each block converted to relational algebra 3. Then, for each block, several alternative query plans are considered 4. Plan with lowest estimated cost is selected SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
p(sname)s(bid=100 rating > 5) (Reserves Sailors)
π
INFO20003 Database Systems 22© University of Melbourne 2018
Cost-based Query Sub-System
Query Parser
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Query Plan Evaluator
Catalog Manager
Schema Statistics
Select *
From Blah B
Where B.blah = “foo”
Queries
Steps 3 & 4
3. What plans are considered?
4. What is the cost of a plan?
INFO20003 Database Systems 23© University of Melbourne 2018
Cost Estimation
• For each plan considered, must estimate cost:
–Must estimate size of result for each operation in tree
•Use information about input relations (from the system
catalogs), and apply rules (discussed next)
–Must estimate cost of each operation in plan tree
•Depends on input cardinalities
•We’ve already discussed how to estimate the cost of
operations (sequential scan, index scan, joins)
•Next time we will calculate the cost of entire plans…
INFO20003 Database Systems 24© University of Melbourne 2018
Statistics and Catalogs
• To decide on the cost, the optimizer needs information
about the relations and indexes involved. This information
is stored in the system catalogs.
• Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) per relation
– # distinct key values (NKeys) for each index (or relation attribute)
– low/high key values (Low/High) for each index (or relation attribute)
– Index height (Height(I)) for each tree index
– # index pages (NPages(I)) for each index
• Statistics in catalogs are updated periodically
INFO20003 Database Systems 25© University of Melbourne 2018
Result size estimation
• Consider a query block:
• Maximum number of tuples in the result is the product of the
cardinalities of relations in the FROM clause
• Reduction factor (RF) associated with each predicate reflects
the impact of the predicate in reducing the result size. RF is
also called selectivity.
SELECT attribute list
FROM relation list
WHERE predicate1 AND … AND predicate_k
INFO20003 Database Systems 26© University of Melbourne 2018
Result size estimation calculations
• Single table selection:
ResultSize = 𝐍𝑻𝒖𝒑𝒍𝒆𝒔(𝑹) 𝒊=𝟏..𝒏𝑹𝑭𝒊
• Joins (over k tables):
ResultSize = 𝒋=𝟏..𝒌𝑵𝑻𝒖𝒑𝒍𝒆𝒔(𝑹𝒋) 𝒊=𝟏..𝒏𝑹𝑭𝒊
• If there are no selections (no predicates), reduction factors are
simply ignored, i.e. they are ==1
INFO20003 Database Systems 27© University of Melbourne 2018
Calculating Reduction Factors(RF)
1. Col = value
RF = 1/NKeys(Col)
2. Col > value
RF = (High(Col) – value) / (High(Col) – Low(Col))
3. Col < value RF = (val – Low(Col)) / (High(Col) – Low(Col)) 4. Col_A = Col_B (for joins) RF = 1/ (Max (NKeys(Col_A), NKeys(Col_B))) 5. In no information about Nkeys or interval, use a “magic number” 1/10 RF = 1/10 • Depend on the type of the predicate: INFO20003 Database Systems 28© University of Melbourne 2018 Let’s try it.. Sailors (S): NTuples(S) =1000, Nkeys(rating) = 10 interval [1-10], age interval [0-100], Nkeys(sid)=1000 SELECT * FROM Sailors WHERE rating = 3 AND age > 50;
Calculate result size:
NTuples(S) = 1000
RF(rating) = 1/10 = 0.1
RF(age) = (100-50)/(100-0) = 0.5
ResultSize = NTuples(S)*RF(rating)*RF(age)
= 1000*0.1*0.5= 50 tuples
INFO20003 Database Systems 29© University of Melbourne 2018
INFO20003 Database Systems 31© University of Melbourne 2018
What’s examinable
• What is query optimization/describe steps?
• Equivalence classes
• Result size estimation
• Important for Assignment 3 as well
INFO20003 Database Systems 32© University of Melbourne 2018
Next Lecture
– –
• Query optimization Part II
‒ Plan enumeration