Microsoft PowerPoint – 21- QryOptimizationPart2Final
© 2018 A. Alawini & A. Parameswaran
Query Optimization
(Part 2)
Abdu Alawini
University of Illinois at Urbana-Champaign
CS411: Database Systems
November 14, 2018
1
© 2018 A. Alawini & A. Parameswaran
Announcements
• HW 4 deadline is extended to Sunday 11/18
(23:59)
2
© 2018 A. Alawini & A. Parameswaran
3
Optimization: Logical Query Plan
•Given a logical plan, how can we make it better?
• Ingredient 1: Algebraic laws: what are the ways in
which an expression or tree may be rewritten without
changing the meaning
• Ingredient 2: Optimizations: if there are multiple ways
to write the same query, which one to choose.
• Rule-based (heuristics): apply laws that seem to result in
cheaper plans
• Cost-based: estimate size and cost of intermediate results,
search systematically for best plan
© 2018 A. Alawini & A. Parameswaran
4
Algebraic Laws
•Laws involving projections
•M( N(R)) = M ∩ N(R)
• M(R S) = N(P(R) Q(S))
• Where N, P, Q are appropriate subsets of attributes of M that are
“needed afterwards”
•Example R(A,B,C,D), S(E, F, G)
• A,B,G(R D=E S) = ? (?(R) ?(S))
© 2018 A. Alawini & A. Parameswaran
Rule-based Optimization
5
16.2.3
© 2018 A. Alawini & A. Parameswaran
6
Heuristic Based Optimizations
• Rewriting of logical plan based on specific algebraic laws
• Results in better execution plans most of the time
• Heuristic number 1:
• Push selections down
• Heuristic number 2:
• (In some cases) push selections up, then down
• But ultimately pushed down
• C (R) S = C (R S) = C (R) C (S)
Typically: apply selections as close to the relations as possible so as to
reduce size of intermediate results
© 2018 A. Alawini & A. Parameswaran 7
Pushing selection down: Simple Setting
Product Company
maker=name
price>100 AND city=“Urbana”
pname
Product Company
maker=name
price>100
pname
city=“Urbana”
The earlier we process selections, less tuples we need to manipulate
higher up in the tree
Uses C (R S) = C (R) C (S)
© 2018 A. Alawini & A. Parameswaran
Estimating Sizes
8
16.4.1
© 2018 A. Alawini & A. Parameswaran
9
Estimating Sizes
•Need size in order to estimate cost
•Example:
•Cost of partitioned hash-join R S is
3B(R) + 3B(S)
• B(R) = T(R)/ block size
• B(S) = T(S)/ block size
•So, we need to estimate T(R), T(S)
© 2018 A. Alawini & A. Parameswaran
10
Estimating Sizes: Projection
Estimating the size of a projection
•Easy: T(L(R)) = T(R)
•This is because a projection doesn’t eliminate
duplicates
•Slightly tricky because the number of tuples in a block
may change, but we will ignore that for now.
© 2018 A. Alawini & A. Parameswaran
11
Estimating Sizes: Selection
Estimating the size of a selection
•S = A=c(R)
• V(R, A) = number of distinct values of A in R
• T(S) can be anything from 0 to T(R) – V(R,A) + 1
• Why 0? Why T(R) – V(R,A) + 1?
• With the assumption of uniform distribution, value:
• T(S) = T(R)/V(R,A)
© 2018 A. Alawini & A. Parameswaran
12
Estimating Sizes
Estimating the size of a join, R S, extremes:
•When the set of A values are disjoint, then
T(R S) = 0
•When A is a key in S and a foreign key in R, then
T(R S) = T(R)
A
A
A
© 2018 A. Alawini & A. Parameswaran
13
Estimating Sizes
Simplifying assumptions:
•Containment of values: if V(R,A) <= V(S,A), then the set of A values of R is included in the set of A values of S • Note: this holds when A is a foreign key in R, and a key in S; but we assume this more broadly (even if this condition is not true) •Preservation of value sets: for any other attribute B, from one of the relations V(R S, B) = V(R, B) (or V(S, B)) A © 2018 A. Alawini & A. Parameswaran 14 Estimating Sizes Assume V(R,A) <= V(S,A) •Then each tuple t in R joins some tuple(s) in S • How many ? • On average T(S)/V(S,A) • t will contribute T(S)/V(S,A) tuples in R S •Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) A A A © 2018 A. Alawini & A. Parameswaran 15 Estimating Sizes Example: •T(R) = 10000, T(S) = 20000 •V(R,A) = 100, V(S,A) = 200 •How large is R S ?A © 2018 A. Alawini & A. Parameswaran 16 Estimating Sizes Example: •T(R) = 10000, T(S) = 20000 •V(R,A) = 100, V(S,A) = 200 •How large is R S ? Answer: T(R S) = 10000 * 20000/200 = 1M A A © 2018 A. Alawini & A. Parameswaran 17 Estimating Sizes Joins on more than one attribute: •T(R S) = T(R) T(S)/[max(V(R,A),V(S,A))max(V(R,B),V(S,B))] A,B © 2018 A. Alawini & A. Parameswaran Three way join • T(R) = 10,000, T(S) = 20,000, T(U) = 10,000 • R(A,B), S(B, C), U(C, D) • V(R, B) = 100, V(S, B) = 50, V(S, C) = 40, V(U, C) = 20 18 © 2018 A. Alawini & A. Parameswaran Three way join •T(R) = 10,000, T(S) = 20,000, T(U) = 10,000 •R(A,B), S(B, C), U(C, D) •V(R, B) = 100, V(S, B) = 50, V(S, C) = 40, V(U, C) = 20 R join S = 10,000 * 20,000/100 S join U = 20,000 * 10,000/40 R join U = 10,000 * 10,000 S join U join R (in any order) = 10,000*20,000*10,000/(40*100) 19 © 2018 A. Alawini & A. Parameswaran Cost-based Optimization 20 16.5 © 2018 A. Alawini & A. Parameswaran 21 Cost-based Optimizations • Main idea: given a “partial query”: apply algebraic laws, until estimated cost of partial plan is minimal • Start from partial plans, build more complete plans • Will see in a few slides • Problem: there are too many ways to apply the laws, hence too many partial plans 16.5.4 © 2018 A. Alawini & A. Parameswaran 22 Partial Plans Focus: Join Trees •R1 R2 …. Rn • Join includes: Njoin, Theta-join, cross product •Why focus on joins? • Super common, very basic operation • Joins are costly, selections/projections can be applied inlined with other operations (e.g., scan), other operators are usually applied once at the top • Join tree: •A plan = a join tree. •Lots of these ! •A partial plan = a subtree of a join tree R S T U 16.6 © 2018 A. Alawini & A. Parameswaran 23 Types of Join Trees: Left deep R S T U © 2018 A. Alawini & A. Parameswaran 24 Types of Join Trees: Right deep R S T U © 2018 A. Alawini & A. Parameswaran 25 Types of Join Trees: Bushy R S T U © 2018 A. Alawini & A. Parameswaran 26 Problem •Given: a query R1 R2 … Rn •Assume we have a function cost() that gives us the cost of every join tree •Find the best join tree for the query © 2018 A. Alawini & A. Parameswaran 27 Dynamic Programming •Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset •In increasing order of set cardinality: •It is a “bottom-up” strategy Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} © 2018 A. Alawini & A. Parameswaran 28 Dynamic Programming •For each subset Q {R1, …, Rn} compute • Size(Q): estimated size of the join of the relations in Q • Plan(Q): a best plan for Q – a particular join tree. • Cost(Q): the least cost of that plan: the cost is going to be the size of intermediate tables used by the plan. • Assumptions: • Final table, initial Ri’s are both ignored (common to all plans) • We’ll restrict ourselves to left-deep plans (plans are just reordering of relations) • Recall that size of tables (B(R) + B(S)) is a lower bound on join cost. • the least cost of a specific join is at least equal to sum of the sizes of the two relations. © 2018 A. Alawini & A. Parameswaran 29 Dynamic Programming •Step 1: For each {Ri} do: • Size({Ri}) = B(Ri) • Plan({Ri}) = Ri • Cost({Ri}) = 0. (remember: cost of intermediate tables) •Step 2: For each {Ri, Rj} do: • Size({Ri,Rj}) = estimate of size of join • Plan({Ri,Rj}) = either Ri Rj, or Rj Ri • Smaller table on the left (convention + other reasons) • Cost = 0. (no intermediate tables used) © 2018 A. Alawini & A. Parameswaran 30 Dynamic Programming • Step i: For each Q {R1, …, Rn} of cardinality i do: • Compute Size(Q) (as we’ve seen in size estimation earlier) • For every pair of subsets Q’, Q’’ s.t. Q = Q’ U Q’’ compute cost(Q’) + cost(Q’’) + size(Q’) + size(Q’’) • If Q’ or Q’’ is a single table, then dont count its size • Cost(Q) = the smallest such sum • Plan(Q) = the corresponding plan • In the end, Return Plan({R1, …, Rn}) © 2018 A. Alawini & A. Parameswaran 31 Dynamic Programming •Example: •Cost(R5 R7) = 0 (no intermediate results) •Cost((R2 R1) R7) = Cost(R2 R1) + Cost(R7) + size(R2 R1) = size(R2 R1) •Cost (((R2 R1) R7) R8) = Cost ((R2 R1) R7) + Cost (R8) + Size((R2 R1) R7) = Size (R2 R1) + 0 + Size((R2 R1) R7) © 2018 A. Alawini & A. Parameswaran 32 Dynamic Programming •Relations: R, S, T, U •Number of blocks: 2000, 5000, 3000, 1000 •Estimated sizes of values sets for the attributes in each relation: R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1000 © 2018 A. Alawini & A. Parameswaran 33 Subquery Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 34 Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST RSU RTU STU RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 35 Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST 300K 30K (TS)R RSU 500K 20K (UR)S RTU 60K 3K (UT)R STU 30K 3K (UT)S RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 36 Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST 300K 30K (TS)R RSU 500K 20K (UR)S RTU 60K 3K (UT)R STU 30K 3K (UT)S RSTU 3K 33K (UTS)R © 2018 A. Alawini & A. Parameswaran 37 Dynamic Programming • Summary: computes optimal plans for subqueries: • In practice: • heuristics for Reducing the Search Space • Restrict to left linear / deep trees • Restrict to trees “without cartesian product”: R(A,B), S(B,C), T(C,D) (R join T) join S has a cartesian product Step 1: {R1}, {R2}, …, {Rn} Step 2: {R1, R2}, {R1, R3}, …, {Rn-1, Rn} … Step n: {R1, …, Rn}