INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 12
Query Processing Part II
Week 6
1
INFO20003 Database Systems © University of Melbourne
Remember this? Components of a DBMS
Will briefly touch upon …
DBMS
Query processing module
Parser/ Compiler
Optimizer Executor
TODAY Joins
Concurrency control module
Transaction mgr.
Lock mgr.
Crash recovery module
Concurrency
control module
Log mgr.
Storage module
File and access methods mgr.
Buffer pool mgr. Disk space mgr.
This is one of several possible architectures; each system has its own slight variations.
Index files Heap files
Database
INFO20003 Database Systems © University of Melbourne
2
Coverage
• Nested loops join • Sort-merge join
• Hash join
• General joins
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
3
Joins
• •
•
Are very common and can be very expensive (cross product in the worst case)
There are many implementation techniques for join operations
Join techniques we will cover:
1. Nested-loops join 2. Sort-mergejoin 3. Hashjoin
INFO20003 Database Systems © University of Melbourne
4
Equality Joins With One Join Column
Example: SELECT *
FROM Reserves R1, Sailors S1
WHERE R1.sid=S1.sid
• In algebra: RS. They are very common and need to be carefully optimized.
• R X S is large; so, R X S followed by a selection is inefficient.
RS
Left / Outer Right / Inner
• Join is associative and commutative: − AxB == BxA
− Ax(BxC)==(AxB)xC
• Cost metric : Number of pages; Number of I/O
11,80
13,75
12,10
15,80 15,44
13,74
11,20
13,20
12,20
13,75
13,35
12,10
INFO20003 Database Systems © University of Melbourne
5
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)
• Sailors (S):
–80 tuples per page, 500 pages
–NPages(S) = 500, NTuplesPerPage(S) = 80 –NTuples(S) = 500*80 = 40000
• Reserves (R):
–100 tuples per page, 1000 pages –NPages(R) = 1000, NTuplesPerPage(R) =100 –NTuples(R) = 100000
INFO20003 Database Systems © University of Melbourne
6
Simple Nested Loops Join
• For each tuple in the outer relation R, we scan the entire inner relation S
Pseudo code:
foreach tuple r in R do foreach tuple s in S do
if ri == sj then add
• Cost:
RS
…
11,80
13,75
12,10
15,80 15,44
13,74
11,20
13,20
12,20
13,75
13,35
12,10
Cost (SNJL) = NPages(Outer) + NTuples(Outer) * NPages(Inner)
• Our example:
Cost (SNLJ)= 1000+ 100*1000*500
= 50001000 (I/O)
INFO20003 Database Systems © University of Melbourne
7
Page-Oriented Nested Loops Join
• For each page of R
–get each page of S
–write out matching pairs of tuples
Pseudo code:
foreach page bR in R do foreach page bS in S do
foreach tuple r in bR do foreach tuple s in bSdo
if ri == sj then add
• Our example:
Cost (PNLJ)= 1000+1000*500 = 501000 (I/O)
RS
13,75
11,20
p1 p2
p1 p2
11,80
13,20
12,10
12,20
15,80 15,44
13,75
13,74
13,35
12,10
Cost (PNJL) = NPages(Outer) + NPages(Outer) * NPages(Inner)
INFO20003 Database Systems © University of Melbourne
8
Block Nested Loops Join
• Page-oriented NL doesn’t exploit extra memory buffers • Alternative approach:
–Use one page as an input buffer for scanning the inner S, one page as the output buffer, and use all remaining pages to hold ‘block’ of outer R
• For each matching tuple r in R-block, s in S-page, add
R & S Join Result
… …
block of R tuples (B-2 pages)
.. .
Input buffer for S Output buffer
INFO20003 Database Systems © University of Melbourne
12
Block Nested Loops Join Cost
Cost (BNJL) = NPages(Outer) + NBlocks(Outer) * NPages(Inner)
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝑂𝑂𝑂𝑂𝑂𝑂𝑁𝑁𝑂𝑂) • NBlocks(Outer) = 𝐵𝐵−2
RS
B1 B2
• Our example:
Let’s say we have 102 pages of space in memory, and consider
Reserves (R) as the outer and Sailors (S) as the inner table.
NBlocks(R) = 1000/(102-2) = 10 Cost(BNLJ) = 1000 + 10* 500 = 6000 I/O
11,80
13,75
12,10
15,80 15,44
13,74
11,20
13,20
12,20
13,75
13,35
12,10
INFO20003 Database Systems © University of Melbourne
13
Query Processing: Joins
• Nested loops join • Sort-merge join
• Hash join
• General joins
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
14
Sort-Merge Join (R S) i=j
• Sort R and S on the join column, then scan them to do a merge (on join column), and output result tuples
RS
11,80
12,10
13,74
13,75
15,44
15,80
11,20
12,20
12,10
13,75
13,35
13,20
• Sorted R is scanned once;
• Each S group of the same key values is scanned once per matching R tuple (typically means Sorted S is scanned once too).
• Useful when:
–one or both inputs are already sorted on join attribute(s) –output is required to be sorted on join attributes(s)
INFO20003 Database Systems © University of Melbourne
15
Sort-Merge Join Cost
Cost (SMJ) = Sort(Outer) + Sort(Inner)
+ NPages(Outer) + NPages(Inner)
Sort inputs Merge inputs
Sort(R) = External Sort Cost = 2*NumPasses*NPages(R)
Our example:
Let’s say that both Reserves and Sailors can be sorted in 2 passes, then:
Cost(SMJ) = Sort R + Sort S + NPages(R) + NPages(S) = 2*2*NPages(R)+ 2*2*NPages(S)
+ NPages(R) + NPages (S)
= 5*1000 + 5* 500 = 7500 I/O
INFO20003 Database Systems © University of Melbourne
16
Query Processing: Joins
• Nested loops join • Sort-merge join
• Hash join
• General joins
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
18
Hash-Join
• Partition both relations using hash function h: R tuples in partition l will only match S tuples in partition I
• Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, probe hash table for matches
Original Relation
.. .
Partitions
1 2
INPUT hash
B-1 Join Result
Disk B main memory buffers Disk
Partitions of R & S
hash fn h2
Disk
B main memory buffers Disk
OUTPUT
1
2
function
h B-1
Hash table for partition Ri (k < B-1 pages)
h2
Input buffer Output
for Si
buffer
INFO20003 Database Systems © University of Melbourne
19
Hash-Join Cost
1. Inpartitioningphase,weread+writebothrelations 2. Inmatchingphase,wereadbothrelations
Cost (HJ) = 2 * NPages(Outer) + 2* NPages(Inner) Create partitions
+ NPages(Outer) + NPages(Inner)
Match partitions
• Our example:
Cost(HJ) = 2*NPages(R) + 2*NPages(S) + NPages(R) + NPages(S)
= 3 * 1000 + 3* 500 = 4500 I/Os
INFO20003 Database Systems © University of Melbourne
20
Watch this video if you are confused
https://www.youtube.com/watch?v=o1dMJ6-CKzU
From 0:58
INFO20003 Database Systems © University of Melbourne
21
Query Processing: Joins
• Nested loops join • Sort-merge join
• Hash join
• General joins
Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems
INFO20003 Database Systems © University of Melbourne
24
General Join Conditions
• Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname):
–For Sort-Merge and Hash Join, sort/partition on combination of the two join columns
• Inequality conditions (e.g., R.rname < S.sname): –Hash Join, Sort Merge Join not applicable
–Block NL quite likely to be the best join method here
INFO20003 Database Systems © University of Melbourne
25
Summary
• A virtue of relational DBMSs:
– Queries are composed of a few basic operators
– Implementation of operators can be carefully tuned
– Important to do this
• Many alternative implementations for each operator –No universally superior technique for most operators
• Must consider alternatives for each operation in a query and choose best one based on system statistics...
–Part of the broader task of optimizing a query composed of several operations
INFO20003 Database Systems © University of Melbourne
28
What’s examinable
• Understand alternatives for join operator implementations
–Be able to calculate the cost of alternatives • Important for Assignment 3 as well
INFO20003 Database Systems © University of Melbourne
29
Next Lecture
• Query optimization
‒ How does a DBMS pick a good query plan?
INFO20003 Database Systems © Univers-it-y of Melbourne
30