Microsoft PowerPoint – 20- QueryProcessingPart3_QryOpt.
© 2018 A. Alawini & A. Parameswaran
Query Processing:
Physical Operators and
Optimization
Abdu Alawini
University of Illinois at Urbana-Champaign
CS411: Database Systems
November 12, 2018
1
© 2018 A. Alawini & A. Parameswaran
Announcements
• HW 4 is due by Friday, 11/16 (23:59)
• PT1, stage 4 feedback and grades will be
posted today
• 3% extra credit is extended to in-class
participation
2
© 2018 A. Alawini & A. Parameswaran 3
The Big Picture: SQL to Logical Query
Plan to Physical Query Plan
STUDENT
Takes COURSE
Merge
Hash
by cid by cid
Execution
Engine
Storage
Subsystem
Web Server /
UI / etc
Physical Query Plan
STUDENT
Takes COURSE
by cid by cid
Logical Query Plan
Table
scan
Sort
scan
Index
scan
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sid
AND Takes.cid = COURSE.cid
by sid by sid
© 2018 A. Alawini & A. Parameswaran
Physical Operators, so far!
Scan Operator
Basics of Costing
One pass implementations of operators
Nested-loop joins
oTwo pass implementations of operators
•Index-based implementations of operators
4
© 2018 A. Alawini & A. Parameswaran
Today’s Lecture!
•Two-pass implementations of operators
• Finishing up with 2-pass alg. based on sorting
• 2-pass alg. based on hashing
•Index-based implementations of operators
•Query Optimization
•Algebraic laws
5
© 2018 A. Alawini & A. Parameswaran
6
Two-Pass Algorithms Based on Sorting
Join R S. Let’s recap what we’ve seen so far — extremes
(a) min(B(R), B(S)) <= M-2: Load smaller table to memory
and load other table block by block. Cost: B(R)+B(S). This is
the one-pass algorithm.
(b) Min(B(R), B(S)) > M-2: Load to memory M-2 blocks of S;
go over every block of R; repeat. Cost: ~B(R)B(S)/M. This is
the nested-loop join algorithm, can operate whenever M >= 3
Nested loop join is the only option
if Min(B(R), B(S))>M-2,
but is too expensive, quadratic (B(R)B(S)).
© 2018 A. Alawini & A. Parameswaran
7
Two-Pass Algorithms Based on Sorting
Join R S: Take 1: Sort both completely, then merge
•Start by sorting both R and S on the join attribute:
•Cost: 4B(R)+4B(S) (because need to write to disk)
•Read both relations in sorted order, match tuples
•Cost: B(R)+B(S)
•Total cost: 5B(R)+5B(S)
•Assumption: B(R) <= M (M-1), B(S) <= M (M-1)
•One difficulty: many tuples in R may match many in S
• If at least one set of tuples fits in M, we are OK
•Otherwise need nested loop, higher cost
• But let’s assume that this is not the case; we are in a good
situation – can we do even better?
© 2018 A. Alawini & A. Parameswaran
8
Two-Pass Algorithms Based on Sorting
Join R S: Take 2
•Start by writing out runs of R and S on the join attribute:
• Cost: 2B(R)+2B(S)
• “Merge” runs of both relations in sorted order, match
tuples
• Cost: B(R)+B(S)
•Total cost: 3B(R)+3B(S)
•Assumption: B(R) + B(S) <= M (M-1)
•Plus there is even more danger of skewed data hurting us
•See Section 15.4.6.
© 2018 A. Alawini & A. Parameswaran
Outline
• Two pass implementations of operators
Two-pass algorithms based on sorting
•Two-pass algorithms based on hashing
• Index-based implementations of operators
•Query Optimization
•Algebraic laws
9
© 2018 A. Alawini & A. Parameswaran
10
Two Pass Algorithms Based on Hashing
•Idea: partition a relation R into M roughly equal
sized buckets, on disk
•Hashing is crucial to effect this partitioning
•These buckets can then be examined “in one
shot”, independent of other buckets
• Everything that needs to be considered together is in
a small number of buckets (one or two)
•Can therefore do the operation by only looking
at one or two buckets at a time.
© 2018 A. Alawini & A. Parameswaran
11
Two Pass Algorithms Based on Hashing
•How to partition a relation R into (M-1) roughly equal
sized buckets (on disk)?
•Each bucket has size approx. B(R)/(M-1)
•Does each bucket fit in main memory ?
• Yes if B(R)/(M-1) <= M, roughly B(R) < M2
M main memory buffers DiskDisk
Relation R
OUTPUT
2INPUT
1
hash
function
h M-1
Partitions
1
2
M-1
. . .
1
2
B(R)
© 2018 A. Alawini & A. Parameswaran
12
Hash Based Algorithms for δ
•Recall: δ (R) = duplicate elimination
•Step 1. Partition R into M-1 buckets.
•Note: duplicate tuples must fall into same bucket
•Step 2. Apply δ to each bucket
•May use one-pass duplicate elimination here (cost B(R))
• This assumes each bucket fits in memory; plus one for output
buffer.
• That is, B(R)/(M-1) <= M-1, or roughly B(R) < M2
•Cost: 3B(R)
•Roughly same costs and constraints as sorting-based
operator.
© 2018 A. Alawini & A. Parameswaran
13
Hash Based Algorithms for
•Recall: (R) = grouping and aggregation
•Step 1. Partition R into M-1 buckets; use grouping
attributes as hash function.
•Step 2. Apply to each bucket
•Note: all tuples of the same group will be in the same bucket.
• Same as one pass
•Cost: 3B(R)
•Assumption: B(R) <= (M-1) (M-1) < M2
•Roughly same as sort-based operator.
© 2018 A. Alawini & A. Parameswaran
14
Two pass Hashing-based Join
•R S
•We’ve already described a one-pass hash-based join:
• Scan S into memory, build buckets in main memory
• Then scan R, hash the tuples of R, output those that match
•Assuming that the smaller table is smaller than the memory
available.
© 2018 A. Alawini & A. Parameswaran
15
Two pass Hashing-based Join
R S
•Step 1:
•Hash S into (M-1) buckets, using join attribute(s) as hash key
• Send all buckets to disk
•Step 2
•Hash R into (M-1) buckets, using join attribute(s) as hash key
• Send all buckets to disk
•Step 3
• Join every pair of buckets with the same bucket number. Use
the one pass algorithm for this.
•Works when for each bucket no. i, either Ri or Si fits in memory.
•min(B(R), B(S))/(M-1) <= (M-2)
min(B(R), B(S)) <= (M-1)(M-2) < M2
• Cost = 3(B(R)+B(S))
© 2018 A. Alawini & A. Parameswaran 16
Hash-Join
Partition both relations
using hash fn h: R tuples in
partition i will only match
S tuples in partition i.
Partitions
of R & S
Input buffer
for Si
Hash table for partition
(B(Ri) < M-1 pages)
B main memory buffersDisk
Output
buffer
Disk
Join Result
B main memory buffers DiskDisk
Original
Relation OUTPUT
2INPUT
1
hash
function
h M-1
Partitions
1
2
M-1
. . .
1
2
R
1
2
S
© 2018 A. Alawini & A. Parameswaran
Read 15.5.6 (not covered in lecture)
17
© 2018 A. Alawini & A. Parameswaran
Sort-based vs Hash-based (for binary ops)
•For sorting-based implementations of binary operations,
size requirement was B(R)+B(S)<=M(M-1)