程序代写 INFO20003 Database Systems

INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 13
Query Optimization Part I

Copyright By PowCoder代写 加微信 powcoder

INFO20003 Database Systems © University of Melbourne

Remember this? Components of a DBMS
Query processing module
Parser/ Compiler
Optimizer Executor
TODAY & Next time
Concurrency control module
Transaction mgr.
Crash recovery module
Concurrency
control module
Storage module
File and access methods mgr.
Buffer pool mgr. Disk space mgr.
This is one of several possible architectures; each system has its own slight variations.
Index files Heap files
INFO20003 Database Systems © University of Melbourne

• Overview
• Query optimization • Cost estimation
Readings: Chapter 12 and 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne

Query Processing Workflow: Review
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Statistics
Query Plan Evaluator
From Blah B
Where B.blah = “foo”
INFO20003 Database Systems © University of Melbourne

Query Optimization
• Typically there are many ways of executing a given query, all giving the same answer
• Cost of alternative methods often varies enormously
• Query optimization aims to find the execution strategy
with the lowest cost
• We will cover:
–Relational algebra equivalences –Cost estimation
Result size estimation and reduction factors
–Enumeration of alternative plans
INFO20003 Database Systems © University of Melbourne

Query Plan
• A tree, with relational algebra operators as nodes and access paths as leaves
• Each operator labeled with a choice of algorithm
SELECT sname from Sailors NATURAL JOIN Reserves WHERE bid = 100 and rating > 5
Plan: bid=100
rating > 5 (On-the-fly (pipeline) (Page-Oriented
(On-the-fly (pipeline))
Nested loops)
(Heap Scan)
Cost: 500+500*1000 I/O
(Heap Scan)
INFO20003 Database Systems © University of Melbourne

Query Optimization
• Overview
• Query optimization • Cost estimation
Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne

A Familiar Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Boats (bid: integer, bname: string, color: string)
INFO20003 Database Systems © University of Melbourne

Query Optimization Overview
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Query optimization steps:
1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are considered
4. Plan with the lowest estimated cost is selected
INFO20003 Database Systems © University of Melbourne

Step 1: Break query into query blocks
• Query block is any statement starting with select
• Query block = unit of optimization
• Typically inner most block is optimized first, then moving towards outers
SELECT S.sname FROM Sailors S WHERE S.age IN
(SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)
Outer block
Nested block
INFO20003 Database Systems © University of Melbourne

Step 2: Convert query block into relational algebra expression
SELECT S.sid
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
Relational algebra:
S.sid( B.color = “red” (Sailors Reserves Boats))
INFO20003 Database Systems © University of Melbourne

Step 3: Relational Algebra Equivalences
(Commute) (Cascade)
INFO20003 Database Systems © University of Melbourne

Selection:
σage<18 ۸ rating>5 (Sailors)
↔ σage<18 (σrating>5 (Sailors))
↔ σrating>5 (σage<18 (Sailors)) Projection: πage,rating (Sailors) ↔ πage (πrating (Sailors)) ? πage,rating (Sailors) ↔ πage,rating (πage,rating,sid (Sailors)) INFO20003 Database Systems © University of Melbourne Another Equivalence • A projection commutes with a selection that only uses attributes retained by the projection πage, rating, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, rating, sid (Sailors))
πage, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, sid (Sailors))
INFO20003 Database Systems © University of Melbourne

Equivalences Involving Joins
R (S T)≡ (R S) T (R S) ≡ (S R)
(Associative) (Commutative)
• These equivalences allow us to choose different join orders
INFO20003 Database Systems © University of Melbourne

Mixing Joins with Selections & Projections
• Converting selection + cross-product to join σS.sid = R.sid (Sailors x Reserves)
↔ Sailors S.sid = R.sid Reserves
• Selection on just attributes of S commutes with R S σS.age<18 (Sailors S.sid = R.sid Reserves) ↔ (σS.age<18 (Sailors)) S.sid = R.sid Reserves • We can also “push down” projection (but be careful...) πS.sname (Sailors S.sid = R.sid Reserves) ↔ πS.sname (πsname,sid(Sailors) S.sid = R.sid πsid(Reserves)) INFO20003 Database Systems © University of Melbourne Query Optimization • Overview • Query optimization • Cost estimation Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems INFO20003 Database Systems © University of Melbourne Recall: Query Optimization Overview 1. Query first broken into “blocks” 2. Each block converted to relational algebra 3. Then, for each block, several alternative query plans are considered 4. Plan with lowest estimated cost is selected SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
π(sname)σ(bid=100 ∧ rating > 5) (Reserves  Sailors)
sname rating > 5
sid=sid Reserves Sailors
INFO20003 Database Systems © University of Melbourne

Cost-based Query Sub-System
Steps 3 & 4
3. What plans are considered? 4. What is the cost of a plan?
Schema Statistics
From Blah B
Where B.blah = “foo”
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Query Plan Evaluator
INFO20003 Database Systems © University of Melbourne

Cost Estimation
• For each plan considered, must estimate cost:
–Must estimate size of result for each operation in tree
•Use information about input relations (from the system catalogs), and apply rules (discussed next)
–Must estimate cost of each operation in plan tree
•Depends on input cardinalities
•We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins)
•Next time we will calculate the cost of entire plans…
INFO20003 Database Systems © University of Melbourne

Statistics and Catalogs
• To decide on the cost, the optimizer needs information about the relations and indexes involved. This information is stored in the system catalogs.
• Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) per relation
– # distinct key values (NKeys) for each index (or relation attribute)
– low/high key values (Low/High) for each index (or relation attribute) – Index height (Height(I)) for each tree index
– # index pages (NPages(I)) for each index
• Statistics in catalogs are updated periodically
INFO20003 Database Systems © University of Melbourne

Result size estimation
SELECT attribute list FROM relation list
WHERE predicate1 AND … AND predicate_k
• Consider a query block:
• Maximum number of tuples in the result is the product of the cardinalities of relations in the FROM clause
• Reduction factor (RF) associated with each predicate reflects the impact of the predicate in reducing the result size. RF is also called selectivity.
INFO20003 Database Systems © University of Melbourne

Result size estimation calculations
• Single table selection:
ResultSize = 𝐍𝐍𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊 • Joins (over k tables):
ResultSize = ∏𝒋𝒋=𝟏𝟏..𝒌𝒌 𝑵𝑵𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹𝒋𝒋) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊
• If there are no selections (no predicates), reduction factors are simply ignored, i.e. they are ==1
INFO20003 Database Systems © University of Melbourne

Calculating Reduction Factors(RF)
Depend on the type of the predicate:
1. Col = value
RF = 1/NKeys(Col)
2. Col > value
RF = (High(Col) – value) / (High(Col) – Low(Col))
3. Col < value RF = (val – Low(Col)) / (High(Col) – Low(Col)) 4. Col_A = Col_B (for joins) RF = 1/ (Max (NKeys(Col_A), NKeys(Col_B))) 5. In no information about Nkeys or interval, use a “magic number” 1/10 INFO20003 Database Systems © University of Melbourne Let’s try it.. Sailors (S): NTuples(S) =1000, Nkeys(rating) = 10 interval [1-10], age interval [0-100], Nkeys(sid)=1000 SELECT * FROM Sailors WHERE rating = 3 AND age > 50;
Calculate result size:
NTuples(S) = 1000
RF(rating) = 1/10 = 0.1
RF(age) = (100-50)/(100-0) = 0.5
ResultSize = NTuples(S)*RF(rating)*RF(age)
= 1000*0.1*0.5= 50 tuples
INFO20003 Database Systems © University of Melbourne

What’s examinable
• What is query optimization/describe steps? • Equivalence classes
• Result size estimation
• Important for Assignment 3 as well
INFO20003 Database Systems © University of Melbourne

Next Lecture
• Query optimization Part II ‒ Planenumeration
INFO20003 Database Systems © Univers-it-y of Melbourne

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com