程序代写代做代考 algorithm concurrency database compiler INFO20003 Database Systems

INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 13
Query Optimization Part I
Week 7
1
INFO20003 Database Systems © University of Melbourne

Remember this? Components of a DBMS
DBMS
Query processing module
Parser/ Compiler
Optimizer Executor
TODAY & Next time
Concurrency control module
Transaction mgr.
Lock mgr.
Crash recovery module
Concurrency
control module
Log mgr.
Storage module
File and access methods mgr.
Buffer pool mgr. Disk space mgr.
This is one of several possible architectures; each system has its own slight variations.
Index files Heap files
Database
INFO20003 Database Systems © University of Melbourne
4

Coverage
• Overview
• Query optimization • Cost estimation
Readings: Chapter 12 and 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
5

Query Processing Workflow: Review
Query
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Schema
Statistics
Query Plan Evaluator
Select *
From Blah B
Where B.blah = “foo”
INFO20003 Database Systems © University of Melbourne
6

Query Optimization
• Typically there are many ways of executing a given query, all giving the same answer
• Cost of alternative methods often varies enormously
• Query optimization aims to find the execution strategy
with the lowest cost
• We will cover:
–Relational algebra equivalences –Cost estimation
Result size estimation and reduction factors
–Enumeration of alternative plans
INFO20003 Database Systems © University of Melbourne
7

Query Plan
• A tree, with relational algebra operators as nodes and access paths as leaves
• Each operator labeled with a choice of algorithm
SELECT sname from Sailors NATURAL JOIN Reserves WHERE bid = 100 and rating > 5
Plan: bid=100
rating > 5 (On-the-fly (pipeline) (Page-Oriented
sname
(On-the-fly (pipeline))
sid=sid
Nested loops)
Reserves
(Heap Scan)
Sailors
Cost: 500+500*1000 I/O
(Heap Scan)
INFO20003 Database Systems © University of Melbourne
8

Query Optimization
• Overview
• Query optimization • Cost estimation
Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
9

A Familiar Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) Boats (bid: integer, bname: string, color: string)
INFO20003 Database Systems © University of Melbourne
10

Query Optimization Overview
Example:
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Query optimization steps:
1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are considered
4. Plan with the lowest estimated cost is selected
INFO20003 Database Systems © University of Melbourne
11

Step 1: Break query into query blocks
• Query block is any statement starting with select
• Query block = unit of optimization
• Typically inner most block is optimized first, then moving towards outers
SELECT S.sname FROM Sailors S WHERE S.age IN
(SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)
Outer block
Nested block
INFO20003 Database Systems © University of Melbourne
12

Step 2: Convert query block into relational algebra expression
Query:
SELECT S.sid
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”
Relational algebra:
π
σ
S.sid( B.color = “red” (Sailors Reserves Boats))
INFO20003 Database Systems © University of Melbourne
13

Step 3: Relational Algebra Equivalences
(Commute) (Cascade)
INFO20003 Database Systems © University of Melbourne
14

Examples
Selection:
σage<18 ۸ rating>5 (Sailors)
↔ σage<18 (σrating>5 (Sailors))
↔ σrating>5 (σage<18 (Sailors)) Projection: πage,rating (Sailors) ↔ πage (πrating (Sailors)) ? πage,rating (Sailors) ↔ πage,rating (πage,rating,sid (Sailors)) INFO20003 Database Systems © University of Melbourne 15 Another Equivalence • A projection commutes with a selection that only uses attributes retained by the projection πage, rating, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, rating, sid (Sailors))
πage, sid (σage<18 ۸ rating>5 (Sailors))
↔ σage<18 ۸ rating>5 (πage, sid (Sailors))
?
INFO20003 Database Systems © University of Melbourne
16

Equivalences Involving Joins
R (S T)≡ (R S) T (R S) ≡ (S R)
(Associative) (Commutative)
• These equivalences allow us to choose different join orders
INFO20003 Database Systems © University of Melbourne
17

Mixing Joins with Selections & Projections
• Converting selection + cross-product to join σS.sid = R.sid (Sailors x Reserves)
↔ Sailors S.sid = R.sid Reserves
• Selection on just attributes of S commutes with R S σS.age<18 (Sailors S.sid = R.sid Reserves) ↔ (σS.age<18 (Sailors)) S.sid = R.sid Reserves • We can also “push down” projection (but be careful...) πS.sname (Sailors S.sid = R.sid Reserves) ↔ πS.sname (πsname,sid(Sailors) S.sid = R.sid πsid(Reserves)) INFO20003 Database Systems © University of Melbourne 18 Query Optimization • Overview • Query optimization • Cost estimation Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems INFO20003 Database Systems © University of Melbourne 20 Recall: Query Optimization Overview 1. Query first broken into “blocks” 2. Each block converted to relational algebra 3. Then, for each block, several alternative query plans are considered 4. Plan with lowest estimated cost is selected π SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
sname rating > 5
sid=sid Reserves Sailors
bid=100
π(sname)σ(bid=100 ∧ rating > 5) (Reserves  Sailors)
INFO20003 Database Systems © University of Melbourne
21

Cost-based Query Sub-System
Queries
Steps 3 & 4
3. What plans are considered? 4. What is the cost of a plan?
Schema Statistics
Select *
From Blah B
Where B.blah = “foo”
Query Parser
Query Optimizer
Plan Generator
Plan Cost Estimator
Catalog Manager
Query Plan Evaluator
INFO20003 Database Systems © University of Melbourne
22

Cost Estimation
• For each plan considered, must estimate cost:
–Must estimate size of result for each operation in tree
•Use information about input relations (from the system catalogs), and apply rules (discussed next)
–Must estimate cost of each operation in plan tree
•Depends on input cardinalities
•We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins)
•Next time we will calculate the cost of entire plans…
INFO20003 Database Systems © University of Melbourne
23

Statistics and Catalogs
• To decide on the cost, the optimizer needs information about the relations and indexes involved. This information is stored in the system catalogs.
• Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) per relation
– # distinct key values (NKeys) for each index (or relation attribute)
– low/high key values (Low/High) for each index (or relation attribute) – Index height (Height(I)) for each tree index
– # index pages (NPages(I)) for each index
• Statistics in catalogs are updated periodically
INFO20003 Database Systems © University of Melbourne
24

Result size estimation
SELECT attribute list FROM relation list
WHERE predicate1 AND … AND predicate_k
• Consider a query block:
• Maximum number of tuples in the result is the product of the cardinalities of relations in the FROM clause
• Reduction factor (RF) associated with each predicate reflects the impact of the predicate in reducing the result size. RF is also called selectivity.
INFO20003 Database Systems © University of Melbourne
25

Result size estimation calculations
• Single table selection:
ResultSize = 𝐍𝐍𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊 • Joins (over k tables):
ResultSize = ∏𝒋𝒋=𝟏𝟏..𝒌𝒌 𝑵𝑵𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻(𝑹𝑹𝒋𝒋) ∏𝒊𝒊=𝟏𝟏..𝒏𝒏 𝑹𝑹𝑹𝑹𝒊𝒊
• If there are no selections (no predicates), reduction factors are simply ignored, i.e. they are ==1
INFO20003 Database Systems © University of Melbourne
26

Calculating Reduction Factors(RF)

Depend on the type of the predicate:
1. Col = value
RF = 1/NKeys(Col)
2. Col > value
RF = (High(Col) – value) / (High(Col) – Low(Col))
3. Col < value RF = (val – Low(Col)) / (High(Col) – Low(Col)) 4. Col_A = Col_B (for joins) RF = 1/ (Max (NKeys(Col_A), NKeys(Col_B))) 5. In no information about Nkeys or interval, use a “magic number” 1/10 RF = 1/10 INFO20003 Database Systems © University of Melbourne 27 Let’s try it.. Sailors (S): NTuples(S) =1000, Nkeys(rating) = 10 interval [1-10], age interval [0-100], Nkeys(sid)=1000 SELECT * FROM Sailors WHERE rating = 3 AND age > 50;
Calculate result size:
NTuples(S) = 1000
RF(rating) = 1/10 = 0.1
RF(age) = (100-50)/(100-0) = 0.5
ResultSize = NTuples(S)*RF(rating)*RF(age)
= 1000*0.1*0.5= 50 tuples
INFO20003 Database Systems © University of Melbourne
28

What’s examinable
• What is query optimization/describe steps? • Equivalence classes
• Result size estimation
• Important for Assignment 3 as well
INFO20003 Database Systems © University of Melbourne
31

Next Lecture
• Query optimization Part II ‒ Planenumeration
INFO20003 Database Systems © Univers-it-y of Melbourne
32