Query Evaluation 1
Introduction
Metrics
Unary Operators
John Edgar 2
select c.cid, c.cname, c.email, p.pid, p.pname, p.price from customer c, sales s, product p
where c.city = ‘Vancouver’ and p.company = ‘lego’ and s.year = 2019 and s.cid = c.cid and s.pid = p.pid
SQL is procedural Operations are specified
select c.cid, c.cname, c.email, p.pid, p.pname, p.price
from (select cid, cname, email from customer where city = ‘Vancouver’) as c natural inner join
(select cid, pid from sales where year = 2019) as s
natural inner join
(select pid, pname, price from products where company = ‘lego’) as p
There are often equivalent queries
That are more or less efficient
Query optimization entails finding the best* query
select c.cid, c.cname, c.email, p.pid, p.pname, p.price from (select c1.cid, s1.pid
from customer c1, sales s1
where c1.city = ‘Vancouver’ and s1.year = 2019 intersect
select s2.cid, p1.pid
from product p1, sales s2
where p1.company = ‘lego’ and s2.year = 2019) as cp
natural inner join customer
natural inner join product
*Actually a good enough query
John Edgar
3
select … from … where
construct parse tree
estimate cost which is best?
generate physical plans
… …
x RS…
generate equivalent logical plans
logical plan 1
logical plan 2
logical plan n
logical plans are relational algebra queries
physical plan 1
physical plan 2
physical plan n
an algorithm is selected for each operator in a logical plan
convert to relational algebra
( (…))
John Edgar
4
What are these algorithms?
select c.cid, c.cname, c.email, p.pid, p.pname, p.price
from (select cid, cname, email from customer where city = ‘Vancouver’) as c natural inner join
(select cid, pid from sales where year = 2019) as s
natural inner join
(select pid, pname, price from products where company = ‘lego’) as p
cid,cname,email,pid,pname,price(cid,cname,email(city = ‘Vancouver’ (Customer))
pid,pname,price(company = ‘lego’ (Product)))
⋈
cid,pid(year = 2019(Sales))
⋈
Order of operations?
Size of input into next operation – intermediate relations? Are results maintained in main memory?
What is the cost metric?
A physical plan is made up of a sequence of steps
▪ Each step corresponds to a relational algebra operation
▪ Input is one or more relations
▪ Output from each operation
is a relation
Some operations require
low level processes
1 2 3
… maybe …
▪ ▪
Scanning a table
Using an index to access a record
John Edgar
5
1.1
main memory
frames
M frames available
Cost metric – number of blocks read or written from disk
DB – on disk
contains
relations
why?
B(R) – number of blocks of R T(R) – number of records of R bf(R) – records per block of R
Metadata – in system catalog
Cost of disk I/O dominates
For simplicity, assume data is accessed one block at a time
V(R,a) – number of distinct values for attribute a in R
John Edgar
7
B(Customer) = 100,000 T(Customer) = 1,000,000 V(Customer, city) = 100
This section covers algorithms for query operations ▪ There is often more than one for an operation
Operations are considered in isolation
cid,cname,email,pid,pname,price(cid,cname,email(city = ‘Vancouver’ (Customer))
pid,pname,price(company = ‘lego’ (Product)))
⋈
cid,pid(year = 2019(Sales))
⋈
Assume performed first
Cost?
▪ Assume that data is read from disk ▪ In practice this is not always the case
Interaction of operations discussed later
▪ And that the result is retained in memory – not written out
John Edgar 8
1.2
A unary operator is an operation with a single operand
▪ For SQL operators the operand is a table
▪ Either a base table or the result of a previous query operation
Unary operations
▪ Selection salary > 100000(Customer)
id
name
salary
…
105041
kbaotbe
8727000
718546
broibe
12707000
206081
skuaete
8623000
728668
sburie
12603000
▪ Projection
▪ Which may include duplicate removal
▪ Sorting
▪ Aggregations AVG(salary)
name,salary(Customer)
85500
John Edgar
10
don’t do this …
SELECT *
FROM Customer
WHERE city = ‘Vancouver’
city = ‘Vancouver’(Customer)
A simple selection has a single condition
▪ Complex selections are considered later
Selections are satisfied by retrieving the matching
records via an access path
▪ Scanning the file and testing each record to determine if it
matches the selection
▪ Or using binary search if the file is sorted and has no index
▪ Using an index on the attribute in the condition
John Edgar 11
which is unusual …
No index on the selection attribute
▪ Linear search by scanning file, cost is B reads
▪ If the selection attribute is a candidate key the scan can be terminated once a match has been found (cost is B/2)
▪ If the file is sorted use binary search to find record(s) ▪ log2(B) + pages of matching records – 1
Index on the selection attribute
▪ The cost is dependent on
▪ The type of index – B+ tree, hash index, …
▪ Theheightoftheindex
▪ The number of records that match the selection
▪ Whether the index is primary or secondary
But it is unusual to have a sorted file with no primary index
John Edgar 12
Compare selections on SIN
First name City Gender
The cost of satisfying a selection with an index is composed of
▪ Number of disk reads to use the index
▪ i.e. to reach the leaf / bucket that contains the data entry
▪ The number of leaves / size of the bucket
▪ Number of blocks of the file with records that match the selection
▪ Generally larger if the index is secondary Assume that indices are
▪ Hash index – extensible or linear ▪ B+ tree index
…(R)
index
John Edgar
13
file
B+ Tree
▪ To find matching RIDs search tree
▪ RIDs reside in leaf nodes
▪ Cost: 1 disk read per level
Additional leaf pages may have to be read
▪ If index is dense or selection is inequality
▪ i.e. entries are on multiple leaves
Extensible hash index ▪ Read directory
▪ Probably1or2blocks
▪ Read bucket ▪ 1block
Linear hash index ▪ Read bucket
▪ Bucketmayhaveoverflow blocks
Hash indexes only used for equality selections
John Edgar
14
Primary index
▪ File is sorted by search key
▪ Matching records are stored in consecutive blocks
▪ Blocks read is number of records records per block
▪ 1+(records–1)bf(R) ▪ Assumesworstcase
Secondary index
▪ Matching records are not
stored consecutively
▪ Assume one disk read for each matching record
▪ Asrecordsarescattered across the file
▪ For large selections could be worse than a file scan
John Edgar
15
Access Method
Candidate Key Selection
Non Candidate Key Selection
Notes
Linear search
B/2
B
Binary search
log2(B)
log2(B) + x
Must be sorted on selection attribute x = blocks of matching records
Primary B+ tree index
tree height + 1
tree height + x
x = blocks of matching records
Secondary B+ tree index
tree height + 1
tree height + w + y
w = leaf nodes of data entries – 1 y = number of matching records
Primary hash index
index height + 1
indexheight+w +x
w = blocks in bucket – 1
x = blocks of matching records
Secondary hash index
index height + 1
indexheight+w +y
w = blocks in bucket – 1
y = number of matching records
Notes: tree height usually 3 to 5; hash index “height” usually 1 or 2; root node of indexes may be resident in main memory which reduces cost by 1; value for w is usually 1 (particularly for a hash index); difference between x and y can be large; details on how to compute these costs follow
John Edgar 16
A complex selection is made of at least two terms connected by and () and or ()
▪ The terms can reference different or the same attributes
▪ If no index on any of the selection attributes scan the file
▪ Use indices on selection attributes where possible
▪ Use of indices is governed by the type of selection and index
▪ Conjunctions are more selective
and clauses or clauses
▪ Disjunctions are less selective
Complex selections are satisfied in much the same
way as simple selections
John Edgar 17
If only one index is available use the index and apply other selections in main memory
▪ Either there is an index on only one of the attributes
▪ Or an index with a compound key that references multiple
selection attributes
▪ Note the restrictions on the use of hash indices
If multiple indexes are available contrast
the intersection
attributes in selection must form prefix of the key
▪ Either use the most selective
▪ Or collect RIDs from leaves or buckets of indexes and take
firstname = “Emma” lastname = “Lee”(Patient) city = “Vancouver” msp = 555123456(Patient)
John Edgar
18
Selections with disjunctions are stated in conjunctive
normal form (CNF)
By the query optimizer
▪ A collection of conjuncts
▪ Each conjunct consists either of a single term, or multiple terms joined by or
▪ e.g.(AB)CD(ACD)(BCD)
▪ This allows each conjunct to be considered independently
A conjunct can only be satisfied by indices if there is an index on all attributes of all of its disjunctive terms
▪ If all the conjuncts contain at least one disjunction with no matching index a file scan is necessary
John Edgar 19
Consider a selection of this form ▪ (a b c) (d e f)(R)
▪ Where each of a to f is an equality selection on an attribute If each of the terms in either of the conjuncts has a
matching index
▪ Use the indexes to find the rids
▪ Take the union of the rids and retrieve those records
▪ For example, if there are indexes just on a, b, c, and e
▪ Use the a, b, and c indexes and take the union of the rids
▪ Retrieve the resulting records and apply the other criteria
John Edgar 20
if there was no index on b a file scan would be necessary
SELECT fName, lName FROM Customer
fName,lName(Customer)
Only selected columns are retained
▪ Reducing the size of the result relation
▪ Projections can always be pipelined from other operations ▪ Unless the SELECT clause includes DISTINCT
A SELECT DISTINCT clause eliminates duplicates ▪ Which requires sorting the relation
▪ Or building a hash table on the relation
John Edgar 21
Processed without writing out the previous result
But what if the relation does not fit in main memory?