程序代写代做代考 concurrency database algorithm compiler INFO20003 Database Systems

INFO20003 Database Systems

INFO20003 Database Systems 1© University of Melbourne 2018

INFO20003 Database Systems

Lecture 13

Query Optimization Part I

Semester 2 2018, Week 7

Dr Renata Borovica-Gajic

INFO20003 Database Systems 2© University of Melbourne 2018

A1 Collective Feedback

• Assignment 1 Feedback will be sent by the end of this week

• Best way to prepare for the MST is to look at your mistakes, compare with

the provided solution, and try to mark yourself against the provided

assessments criteria

–Self-reflection is an important part of learning

–Pay attention to business rules that your model does not support

–Let’s have a look at one potential solution

• This assignment was a preview of a business analyst’s life

INFO20003 Database Systems 4© University of Melbourne 2018

Remember this? Components of a DBMS

DBMS

Index files

Heap files

Database

Concurrency

control module

File and access

methods mgr.

Buffer pool mgr.

Disk space mgr.

Storage module Concurrency

control module

Transaction

mgr.

Lock mgr.

Crash recovery

module

Log mgr.

Query processing module

Parser/

Compiler
Optimizer Executor

This is one of several possible architectures;

each system has its own slight variations.

TODAY &

Next time

INFO20003 Database Systems 5© University of Melbourne 2018

Coverage

• Overview

• Query optimization

• Cost estimation

Readings: Chapter 12 and 15, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 6© University of Melbourne 2018

Query Processing Workflow: Review

Query Parser

Query Optimizer

Plan

Generator

Plan Cost

Estimator

Query Plan Evaluator

Catalog Manager

Schema Statistics

Select *

From Blah B

Where B.blah = “foo”

Query

INFO20003 Database Systems 7© University of Melbourne 2018

Query Optimization

• Typically there are many ways of executing a given query,

all giving the same answer

• Cost of alternative methods often varies enormously

• Query optimization aims to find the execution strategy

with the lowest cost

• We will cover:

–Relational algebra equivalences

–Cost estimation

Result size estimation and reduction factors

–Enumeration of alternative plans

INFO20003 Database Systems 8© University of Melbourne 2018

Query Plan

• A tree, with relational algebra operators as nodes

• Each operator labeled with a choice of algorithm

Sailors Reserves

sid=sid

bid=100 rating > 5

sname

(Page-Oriented

Nested loops)

(On-the-fly)

(On-the-fly)
Plan:

Cost: 500+500*1000 I/O

SELECT sname from Sailors NATURAL JOIN Reserves

WHERE bid = 100 and rating > 5

(Heap Scan) (Heap Scan)

INFO20003 Database Systems 9© University of Melbourne 2018

Query Optimization

• Overview

• Query optimization

• Cost estimation

Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 10© University of Melbourne 2018

A Familiar Schema for Examples

Sailors (sid: integer, sname: string, rating: integer, age: real)

Reserves (sid: integer, bid: integer, day: dates, rname: string)

Boats (bid: integer, bname: string, color: string)

INFO20003 Database Systems 11© University of Melbourne 2018

Query Optimization Overview

1. Query first broken into “blocks”

2. Each block converted to relational algebra

3. Then, for each block, several alternative query plans are

considered

4. Plan with the lowest estimated cost is selected

SELECT S.sname

FROM Reserves R, Sailors S

WHERE R.sid=S.sid AND

R.bid=100 AND S.rating>5

Example:

Query optimization steps:

INFO20003 Database Systems 12© University of Melbourne 2018

Step 1: Break query into query blocks

• Query block is any statement starting with select

• Query block = unit of optimization

• Typically inner most block is optimized first, then

moving towards outers

SELECT S.sname

FROM Sailors S

WHERE S.age IN

(SELECT MAX (S2.age)

FROM Sailors S2

GROUP BY S2.rating)
Nested block

Outer block

INFO20003 Database Systems 13© University of Melbourne 2018

Step 2: Convert query block into

relational algebra expression

SELECT S.sid

FROM Sailors S, Reserves R, Boats B

WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red”

p
S.sid( B.color = “red” (Sailors Reserves Boats))s

Query:

Relational algebra:

INFO20003 Database Systems 14© University of Melbourne 2018

Step 3: Relational Algebra Equivalences

(Commute)

(Cascade)

INFO20003 Database Systems 15© University of Melbourne 2018

Examples

σage<18 ٨ rating>5 (Sailors)

↔ σage<18 (σrating>5 (Sailors))

↔ σrating>5 (σage<18 (Sailors)) πage,rating (Sailors) ↔ πage (πrating (Sailors)) ? πage,rating (Sailors) ↔ πage,rating (πage,rating,sid (Sailors)) Selection: Projection: INFO20003 Database Systems 16© University of Melbourne 2018 Another Equivalence • A projection commutes with a selection that only uses attributes retained by the projection πage, rating, sid (σage<18 ٨ rating>5 (Sailors))

↔ σage<18 ٨ rating>5 (πage, rating, sid (Sailors))

πage, sid (σage<18 ٨ rating>5 (Sailors))

↔ σage<18 ٨ rating>5 (πage, sid (Sailors))
?

INFO20003 Database Systems 17© University of Melbourne 2018

R (S T) (R S) T

Equivalences Involving Joins

• These equivalences allow us to choose different join

orders

 (Associative)

(R S) (S R)  (Commutative)

INFO20003 Database Systems 18© University of Melbourne 2018

Mixing Joins with Selections & Projections

• Converting selection + cross-product to join

• Selection on just attributes of S commutes with R S

• We can also “push down” projection (but be careful…)

σS.sid = R.sid (Sailors x Reserves)

↔ Sailors S.sid = R.sid Reserves

σS.age<18 (Sailors S.sid = R.sid Reserves) ↔ (σS.age<18 (Sailors)) S.sid = R.sid Reserves πS.sname (Sailors S.sid = R.sid Reserves) ↔ πS.sname (πsname,sid(Sailors) S.sid = R.sid πsid(Reserves)) INFO20003 Database Systems 20© University of Melbourne 2018 Query Optimization • Overview • Query optimization • Cost estimation Readings: Chapter 15, Ramakrishnan & Gehrke, Database Systems INFO20003 Database Systems 21© University of Melbourne 2018 Recall: Query Optimization Overview 1. Query first broken into “blocks” 2. Each block converted to relational algebra 3. Then, for each block, several alternative query plans are considered 4. Plan with lowest estimated cost is selected SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

p(sname)s(bid=100  rating > 5) (Reserves  Sailors)

INFO20003 Database Systems 22© University of Melbourne 2018

Cost-based Query Sub-System

Query Parser

Query Optimizer

Plan

Generator

Plan Cost

Estimator

Query Plan Evaluator

Catalog Manager

Schema Statistics

Select *

From Blah B

Where B.blah = “foo”
Queries

Steps 3 & 4

3. What plans are considered?

4. What is the cost of a plan?

Cost Estimation

• For each plan considered, must estimate cost:

–Must estimate size of result for each operation in tree

•Use information about input relations (from the system

catalogs), and apply rules (discussed next)

–Must estimate cost of each operation in plan tree

•Depends on input cardinalities

•We’ve already discussed how to estimate the cost of

operations (sequential scan, index scan, joins)

•Next time we will calculate the cost of entire plans…

Statistics and Catalogs

• To decide on the cost, the optimizer needs information
about the relations and indexes involved. This information
is stored in the system catalogs.

• Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) per relation

– # distinct key values (NKeys) for each index (or relation attribute)

– low/high key values (Low/High) for each index (or relation attribute)

– Index height (Height(I)) for each tree index

– # index pages (NPages(I)) for each index

• Statistics in catalogs are updated periodically

Result size estimation

• Consider a query block:

• Maximum number of tuples in the result is the product of the

cardinalities of relations in the FROM clause

• Reduction factor (RF) associated with each predicate reflects

the impact of the predicate in reducing the result size. RF is

also called selectivity.

SELECT attribute list

FROM relation list

WHERE predicate1 AND … AND predicate_k

Result size estimation calculations

• Single table selection:

ResultSize = 𝐍𝑻𝒖𝒑𝒍𝒆𝒔(𝑹) 𝒊=𝟏..𝒏𝑹𝑭𝒊

• Joins (over k tables):

ResultSize = 𝒋=𝟏..𝒌𝑵𝑻𝒖𝒑𝒍𝒆𝒔(𝑹𝒋) 𝒊=𝟏..𝒏𝑹𝑭𝒊

• If there are no selections (no predicates), reduction factors are

simply ignored, i.e. they are ==1

Calculating Reduction Factors(RF)

1. Col = value

RF = 1/NKeys(Col)

2. Col > value

RF = (High(Col) – value) / (High(Col) – Low(Col))

3. Col < value RF = (val – Low(Col)) / (High(Col) – Low(Col)) 4. Col_A = Col_B (for joins) RF = 1/ (Max (NKeys(Col_A), NKeys(Col_B))) 5. In no information about Nkeys or interval, use a “magic number” 1/10 RF = 1/10 • Depend on the type of the predicate: INFO20003 Database Systems 28© University of Melbourne 2018 Let’s try it.. Sailors (S): NTuples(S) =1000, Nkeys(rating) = 10 interval [1-10], age interval [0-100], Nkeys(sid)=1000 SELECT * FROM Sailors WHERE rating = 3 AND age > 50;

Calculate result size:

NTuples(S) = 1000

RF(rating) = 1/10 = 0.1

RF(age) = (100-50)/(100-0) = 0.5

ResultSize = NTuples(S)*RF(rating)*RF(age)

= 1000*0.1*0.5= 50 tuples

What’s examinable

• What is query optimization/describe steps?

• Equivalence classes

• Result size estimation

• Important for Assignment 3 as well

Next Lecture

– –

• Query optimization Part II

‒ Plan enumeration

Related Posts