程序代写代做代考 database algorithm SQL Microsoft PowerPoint - 20- QueryProcessingPart3_QryOpt.

Microsoft PowerPoint – 20- QueryProcessingPart3_QryOpt.

Query Processing:
Physical Operators and

Optimization
Abdu Alawini

University of Illinois at Urbana-Champaign

CS411: Database Systems

November 12, 2018

Announcements

• HW 4 is due by Friday, 11/16 (23:59)

• PT1, stage 4 feedback and grades will be
posted today

• 3% extra credit is extended to in-class
participation

The Big Picture: SQL to Logical Query
Plan to Physical Query Plan

STUDENT

Takes COURSE

Merge

Hash

by cid by cid

Execution
Engine

Storage
Subsystem

Web Server /
UI / etc

Physical Query Plan

STUDENT

Takes COURSE
by cid by cid

Logical Query Plan

Table
scan

Sort
scan

Index
scan

SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sid

AND Takes.cid = COURSE.cid

by sid by sid

Physical Operators, so far!

Scan Operator
Basics of Costing
One pass implementations of operators
 Nested-loop joins
oTwo pass implementations of operators
•Index-based implementations of operators

Today’s Lecture!

•Two-pass implementations of operators
• Finishing up with 2-pass alg. based on sorting
• 2-pass alg. based on hashing

•Index-based implementations of operators
•Query Optimization
•Algebraic laws

Two-Pass Algorithms Based on Sorting

Join R S. Let’s recap what we’ve seen so far — extremes

(a) min(B(R), B(S)) <= M-2: Load smaller table to memory and load other table block by block. Cost: B(R)+B(S). This is the one-pass algorithm. (b) Min(B(R), B(S)) > M-2: Load to memory M-2 blocks of S;
go over every block of R; repeat. Cost: ~B(R)B(S)/M. This is
the nested-loop join algorithm, can operate whenever M >= 3

Nested loop join is the only option
if Min(B(R), B(S))>M-2,
but is too expensive, quadratic (B(R)B(S)).



Two-Pass Algorithms Based on Sorting

Join R S: Take 1: Sort both completely, then merge
•Start by sorting both R and S on the join attribute:
•Cost: 4B(R)+4B(S) (because need to write to disk)

•Read both relations in sorted order, match tuples
•Cost: B(R)+B(S)

•Total cost: 5B(R)+5B(S)
•Assumption: B(R) <= M (M-1), B(S) <= M (M-1) •One difficulty: many tuples in R may match many in S • If at least one set of tuples fits in M, we are OK •Otherwise need nested loop, higher cost • But let’s assume that this is not the case; we are in a good situation – can we do even better?  © 2018 A. Alawini & A. Parameswaran 8 Two-Pass Algorithms Based on Sorting Join R S: Take 2 •Start by writing out runs of R and S on the join attribute: • Cost: 2B(R)+2B(S) • “Merge” runs of both relations in sorted order, match tuples • Cost: B(R)+B(S) •Total cost: 3B(R)+3B(S) •Assumption: B(R) + B(S) <= M (M-1) •Plus there is even more danger of skewed data hurting us •See Section 15.4.6.  © 2018 A. Alawini & A. Parameswaran Outline • Two pass implementations of operators Two-pass algorithms based on sorting •Two-pass algorithms based on hashing • Index-based implementations of operators •Query Optimization •Algebraic laws 9 © 2018 A. Alawini & A. Parameswaran 10 Two Pass Algorithms Based on Hashing •Idea: partition a relation R into M roughly equal sized buckets, on disk •Hashing is crucial to effect this partitioning •These buckets can then be examined “in one shot”, independent of other buckets • Everything that needs to be considered together is in a small number of buckets (one or two) •Can therefore do the operation by only looking at one or two buckets at a time. © 2018 A. Alawini & A. Parameswaran 11 Two Pass Algorithms Based on Hashing •How to partition a relation R into (M-1) roughly equal sized buckets (on disk)? •Each bucket has size approx. B(R)/(M-1) •Does each bucket fit in main memory ? • Yes if B(R)/(M-1) <= M, roughly B(R) < M2 M main memory buffers DiskDisk Relation R OUTPUT 2INPUT 1 hash function h M-1 Partitions 1 2 M-1 . . . 1 2 B(R) © 2018 A. Alawini & A. Parameswaran 12 Hash Based Algorithms for δ •Recall: δ (R) = duplicate elimination •Step 1. Partition R into M-1 buckets. •Note: duplicate tuples must fall into same bucket •Step 2. Apply δ to each bucket •May use one-pass duplicate elimination here (cost B(R)) • This assumes each bucket fits in memory; plus one for output buffer. • That is, B(R)/(M-1) <= M-1, or roughly B(R) < M2 •Cost: 3B(R) •Roughly same costs and constraints as sorting-based operator. © 2018 A. Alawini & A. Parameswaran 13 Hash Based Algorithms for  •Recall: (R) = grouping and aggregation •Step 1. Partition R into M-1 buckets; use grouping attributes as hash function. •Step 2. Apply  to each bucket •Note: all tuples of the same group will be in the same bucket. • Same as one pass  •Cost: 3B(R) •Assumption: B(R) <= (M-1) (M-1) < M2 •Roughly same as sort-based operator. © 2018 A. Alawini & A. Parameswaran 14 Two pass Hashing-based Join •R S •We’ve already described a one-pass hash-based join: • Scan S into memory, build buckets in main memory • Then scan R, hash the tuples of R, output those that match •Assuming that the smaller table is smaller than the memory available.  © 2018 A. Alawini & A. Parameswaran 15 Two pass Hashing-based Join R S •Step 1: •Hash S into (M-1) buckets, using join attribute(s) as hash key • Send all buckets to disk •Step 2 •Hash R into (M-1) buckets, using join attribute(s) as hash key • Send all buckets to disk •Step 3 • Join every pair of buckets with the same bucket number. Use the one pass algorithm for this. •Works when for each bucket no. i, either Ri or Si fits in memory. •min(B(R), B(S))/(M-1) <= (M-2)  min(B(R), B(S)) <= (M-1)(M-2) < M2 • Cost = 3(B(R)+B(S))  © 2018 A. Alawini & A. Parameswaran 16 Hash-Join Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Partitions of R & S Input buffer for Si Hash table for partition (B(Ri) < M-1 pages) B main memory buffersDisk Output buffer Disk Join Result B main memory buffers DiskDisk Original Relation OUTPUT 2INPUT 1 hash function h M-1 Partitions 1 2 M-1 . . . 1 2 R 1 2 S © 2018 A. Alawini & A. Parameswaran Read 15.5.6 (not covered in lecture) 17 © 2018 A. Alawini & A. Parameswaran Sort-based vs Hash-based (for binary ops) •For sorting-based implementations of binary operations, size requirement was B(R)+B(S)<=M(M-1)

Related Posts