程序代写代做代考 database Microsoft PowerPoint – 21- QryOptimizationPart2Final

Microsoft PowerPoint – 21- QryOptimizationPart2Final

© 2018 A. Alawini & A. Parameswaran

Query Optimization
(Part 2)

Abdu Alawini
University of Illinois at Urbana-Champaign

CS411: Database Systems

November 14, 2018

1

© 2018 A. Alawini & A. Parameswaran

Announcements

• HW 4 deadline is extended to Sunday 11/18
(23:59)

2

© 2018 A. Alawini & A. Parameswaran
3

Optimization: Logical Query Plan

•Given a logical plan, how can we make it better?

• Ingredient 1: Algebraic laws: what are the ways in
which an expression or tree may be rewritten without
changing the meaning

• Ingredient 2: Optimizations: if there are multiple ways
to write the same query, which one to choose.
• Rule-based (heuristics): apply laws that seem to result in

cheaper plans
• Cost-based: estimate size and cost of intermediate results,

search systematically for best plan

© 2018 A. Alawini & A. Parameswaran
4

Algebraic Laws

•Laws involving projections
•M( N(R)) = M ∩ N(R)
• M(R S) = N(P(R) Q(S))

• Where N, P, Q are appropriate subsets of attributes of M that are
“needed afterwards”

•Example R(A,B,C,D), S(E, F, G)
•  A,B,G(R D=E S) =  ? (?(R) ?(S))

© 2018 A. Alawini & A. Parameswaran

Rule-based Optimization

5

 16.2.3

© 2018 A. Alawini & A. Parameswaran
6

Heuristic Based Optimizations

• Rewriting of logical plan based on specific algebraic laws
• Results in better execution plans most of the time
• Heuristic number 1:

• Push selections down
• Heuristic number 2:

• (In some cases) push selections up, then down
• But ultimately pushed down
•  C (R) S = C (R S) =  C (R)  C (S)
Typically: apply selections as close to the relations as possible so as to

reduce size of intermediate results

© 2018 A. Alawini & A. Parameswaran 7

Pushing selection down: Simple Setting

Product Company

maker=name

 price>100 AND city=“Urbana”

pname

Product Company

maker=name

price>100

pname

city=“Urbana”

The earlier we process selections, less tuples we need to manipulate
higher up in the tree

Uses C (R S) =  C (R)  C (S)

© 2018 A. Alawini & A. Parameswaran

Estimating Sizes

8

 16.4.1

© 2018 A. Alawini & A. Parameswaran
9

Estimating Sizes

•Need size in order to estimate cost
•Example:
•Cost of partitioned hash-join R S is

3B(R) + 3B(S)
• B(R) = T(R)/ block size
• B(S) = T(S)/ block size

•So, we need to estimate T(R), T(S)

© 2018 A. Alawini & A. Parameswaran
10

Estimating Sizes: Projection

Estimating the size of a projection

•Easy: T(L(R)) = T(R)
•This is because a projection doesn’t eliminate

duplicates

•Slightly tricky because the number of tuples in a block
may change, but we will ignore that for now.

© 2018 A. Alawini & A. Parameswaran
11

Estimating Sizes: Selection

Estimating the size of a selection

•S = A=c(R)
• V(R, A) = number of distinct values of A in R
• T(S) can be anything from 0 to T(R) – V(R,A) + 1
• Why 0? Why T(R) – V(R,A) + 1?
• With the assumption of uniform distribution, value:

• T(S) = T(R)/V(R,A)

© 2018 A. Alawini & A. Parameswaran
12

Estimating Sizes

Estimating the size of a join, R S, extremes:

•When the set of A values are disjoint, then
T(R S) = 0

•When A is a key in S and a foreign key in R, then
T(R S) = T(R)

A

A

A

© 2018 A. Alawini & A. Parameswaran
13

Estimating Sizes

Simplifying assumptions:

•Containment of values: if V(R,A) <= V(S,A), then the set of A values of R is included in the set of A values of S • Note: this holds when A is a foreign key in R, and a key in S; but we assume this more broadly (even if this condition is not true) •Preservation of value sets: for any other attribute B, from one of the relations V(R S, B) = V(R, B) (or V(S, B)) A © 2018 A. Alawini & A. Parameswaran 14 Estimating Sizes Assume V(R,A) <= V(S,A) •Then each tuple t in R joins some tuple(s) in S • How many ? • On average T(S)/V(S,A) • t will contribute T(S)/V(S,A) tuples in R S •Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) A A A © 2018 A. Alawini & A. Parameswaran 15 Estimating Sizes Example: •T(R) = 10000, T(S) = 20000 •V(R,A) = 100, V(S,A) = 200 •How large is R S ?A © 2018 A. Alawini & A. Parameswaran 16 Estimating Sizes Example: •T(R) = 10000, T(S) = 20000 •V(R,A) = 100, V(S,A) = 200 •How large is R S ? Answer: T(R S) = 10000 * 20000/200 = 1M A A © 2018 A. Alawini & A. Parameswaran 17 Estimating Sizes Joins on more than one attribute: •T(R S) = T(R) T(S)/[max(V(R,A),V(S,A))max(V(R,B),V(S,B))] A,B © 2018 A. Alawini & A. Parameswaran Three way join • T(R) = 10,000, T(S) = 20,000, T(U) = 10,000 • R(A,B), S(B, C), U(C, D) • V(R, B) = 100, V(S, B) = 50, V(S, C) = 40, V(U, C) = 20 18 © 2018 A. Alawini & A. Parameswaran Three way join •T(R) = 10,000, T(S) = 20,000, T(U) = 10,000 •R(A,B), S(B, C), U(C, D) •V(R, B) = 100, V(S, B) = 50, V(S, C) = 40, V(U, C) = 20 R join S = 10,000 * 20,000/100 S join U = 20,000 * 10,000/40 R join U = 10,000 * 10,000 S join U join R (in any order) = 10,000*20,000*10,000/(40*100) 19 © 2018 A. Alawini & A. Parameswaran Cost-based Optimization 20  16.5 © 2018 A. Alawini & A. Parameswaran 21 Cost-based Optimizations • Main idea: given a “partial query”: apply algebraic laws, until estimated cost of partial plan is minimal • Start from partial plans, build more complete plans • Will see in a few slides • Problem: there are too many ways to apply the laws, hence too many partial plans  16.5.4 © 2018 A. Alawini & A. Parameswaran 22 Partial Plans Focus: Join Trees •R1 R2 …. Rn • Join includes: Njoin, Theta-join, cross product •Why focus on joins? • Super common, very basic operation • Joins are costly, selections/projections can be applied inlined with other operations (e.g., scan), other operators are usually applied once at the top • Join tree: •A plan = a join tree. •Lots of these ! •A partial plan = a subtree of a join tree R S T U  16.6 © 2018 A. Alawini & A. Parameswaran 23 Types of Join Trees: Left deep R S T U © 2018 A. Alawini & A. Parameswaran 24 Types of Join Trees: Right deep R S T U © 2018 A. Alawini & A. Parameswaran 25 Types of Join Trees: Bushy R S T U © 2018 A. Alawini & A. Parameswaran 26 Problem •Given: a query R1 R2 … Rn •Assume we have a function cost() that gives us the cost of every join tree •Find the best join tree for the query © 2018 A. Alawini & A. Parameswaran 27 Dynamic Programming •Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset •In increasing order of set cardinality: •It is a “bottom-up” strategy Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} © 2018 A. Alawini & A. Parameswaran 28 Dynamic Programming •For each subset Q {R1, …, Rn} compute • Size(Q): estimated size of the join of the relations in Q • Plan(Q): a best plan for Q – a particular join tree. • Cost(Q): the least cost of that plan: the cost is going to be the size of intermediate tables used by the plan. • Assumptions: • Final table, initial Ri’s are both ignored (common to all plans) • We’ll restrict ourselves to left-deep plans (plans are just reordering of relations) • Recall that size of tables (B(R) + B(S)) is a lower bound on join cost. • the least cost of a specific join is at least equal to sum of the sizes of the two relations. © 2018 A. Alawini & A. Parameswaran 29 Dynamic Programming •Step 1: For each {Ri} do: • Size({Ri}) = B(Ri) • Plan({Ri}) = Ri • Cost({Ri}) = 0. (remember: cost of intermediate tables) •Step 2: For each {Ri, Rj} do: • Size({Ri,Rj}) = estimate of size of join • Plan({Ri,Rj}) = either Ri Rj, or Rj Ri • Smaller table on the left (convention + other reasons) • Cost = 0. (no intermediate tables used) © 2018 A. Alawini & A. Parameswaran 30 Dynamic Programming • Step i: For each Q {R1, …, Rn} of cardinality i do: • Compute Size(Q) (as we’ve seen in size estimation earlier) • For every pair of subsets Q’, Q’’ s.t. Q = Q’ U Q’’ compute cost(Q’) + cost(Q’’) + size(Q’) + size(Q’’) • If Q’ or Q’’ is a single table, then dont count its size • Cost(Q) = the smallest such sum • Plan(Q) = the corresponding plan • In the end, Return Plan({R1, …, Rn}) © 2018 A. Alawini & A. Parameswaran 31 Dynamic Programming •Example: •Cost(R5 R7) = 0 (no intermediate results) •Cost((R2 R1) R7) = Cost(R2 R1) + Cost(R7) + size(R2 R1) = size(R2 R1) •Cost (((R2 R1) R7) R8) = Cost ((R2 R1) R7) + Cost (R8) + Size((R2 R1) R7) = Size (R2 R1) + 0 + Size((R2 R1) R7) © 2018 A. Alawini & A. Parameswaran 32 Dynamic Programming •Relations: R, S, T, U •Number of blocks: 2000, 5000, 3000, 1000 •Estimated sizes of values sets for the attributes in each relation: R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1000 © 2018 A. Alawini & A. Parameswaran 33 Subquery Size Cost Plan RS RT RU ST SU TU RST RSU RTU STU RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 34 Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST RSU RTU STU RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 35 Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST 300K 30K (TS)R RSU 500K 20K (UR)S RTU 60K 3K (UT)R STU 30K 3K (UT)S RSTU Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K © 2018 A. Alawini & A. Parameswaran 36 Table Size R 2K S 5K T 3K U 1K R(a, b) S(b, c) T(c, d) U(d, a) V(R, a) = 100 V(U, a) = 50 V(R, b) = 200 V(S, b) = 100 V(S, c) = 500 V(T, c) = 20 V(T, d) = 50 V(U, d) = 1K Subquery Size Cost Plan RS 50k 0 RS RT 6M 0 RT RU 20k 0 UR ST 30k 0 TS SU 5M 0 US TU 3k 0 UT RST 300K 30K (TS)R RSU 500K 20K (UR)S RTU 60K 3K (UT)R STU 30K 3K (UT)S RSTU 3K 33K (UTS)R © 2018 A. Alawini & A. Parameswaran 37 Dynamic Programming • Summary: computes optimal plans for subqueries: • In practice: • heuristics for Reducing the Search Space • Restrict to left linear / deep trees • Restrict to trees “without cartesian product”: R(A,B), S(B,C), T(C,D) (R join T) join S has a cartesian product Step 1: {R1}, {R2}, …, {Rn} Step 2: {R1, R2}, {R1, R3}, …, {Rn-1, Rn} … Step n: {R1, …, Rn}