程序代写代做代考 concurrency database algorithm compiler INFO20003 Database Systems

INFO20003 Database Systems

INFO20003 Database Systems 1© University of Melbourne 2018

INFO20003 Database Systems

Lecture 12

Query Processing Part II

Semester 2 2018, Week 6

Dr Renata Borovica-Gajic

INFO20003 Database Systems 2© University of Melbourne 2018

1. Partition data into B partitions with h1 hash function

2. Load each partition, hash it with another hash function (h2) and eliminate

duplicates

Projection based on External Hashing

(from last lecture)

B main memory buffers DiskDisk

Original

Relation OUTPUT

2
INPUT

hash
function

h1
B-1

Partitions

B-1

. . .

INFO20003 Database Systems 3© University of Melbourne 2018

1. Partitioning phase:

–Read R using one input buffer

–For each tuple:

•Discard unwanted fields

•Apply hash function h1 to choose one of B-1 output buffers

–Result is B-1 partitions (of tuples with no unwanted fields)

•2 tuples from different partitions guaranteed to be distinct

2. Duplicate elimination phase:

–For each partition

•Read it and build an in-memory hash table

–using hash function h2 (<> h1) on all fields

•while discarding duplicates

–If partition does not fit in memory

•Apply hash-based projection algorithm recursively to this

partition (we will not do this…)

Projection based on External Hashing

INFO20003 Database Systems 4© University of Melbourne 2018

Our example:

Cost = ReadTable +

WriteProjectedPages +

ReadProjectedPages

= 1000 + 0.25 * 1000 + 250 = 1500 (I/O)

Cost of External Hashing

Cost = ReadTable +
WriteProjectedPages +
ReadProjectedPages

Read the entire table and project attributes

Write projected pages into corresponding partitions

Read partitions one by one, create another hash

table and discard duplicates within a bucket

INFO20003 Database Systems 5© University of Melbourne 2018

Remember this? Components of a DBMS

DBMS

Index files

Heap files

Database

Concurrency

control module

File and access

methods mgr.

Buffer pool mgr.

Disk space mgr.

Storage module Concurrency

control module

Transaction

mgr.

Lock mgr.

Crash recovery

module

Log mgr.

Query processing module

Parser/

Compiler
Optimizer Executor

This is one of several possible architectures;

each system has its own slight variations.

TODAY

Joins

Will briefly

touch upon …

INFO20003 Database Systems 6© University of Melbourne 2018

Coverage

• Nested loops join

• Sort-merge join

• Hash join

• General joins

Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems

INFO20003 Database Systems 7© University of Melbourne 2018

Joins

• Are very common and can be very expensive (cross
product in the worst case)

• There are many implementation techniques for join
operations

• Join techniques we will cover:
1. Nested-loops join
2. Sort-merge join
3. Hash join

INFO20003 Database Systems 8© University of Melbourne 2018

• In algebra: R S. They are very common and need to be
carefully optimized.

• R X S is large; so, R X S followed by a selection is inefficient.

• Join is associative and commutative:
 AxB == BxA
 Ax(BxC)==(AxB)xC

• Cost metric : Number of pages; Number of I/O

Equality Joins With One Join Column

SELECT *

FROM Reserves R1, Sailors S1

WHERE R1.sid=S1.sid



R S

12,10

11,80

13,74

12,20
13,75

13,35
13,7515,80

15,44

13,20
11,20

Example:

Left / Outer Right / Inner

INFO20003 Database Systems 9© University of Melbourne 2018

Schema for Examples

• Sailors (S):

–80 tuples per page, 500 pages

–NPages(S) = 500, NTuplesPerPage(S) = 80

–NTuples(S) = 500*80 = 40000

• Reserves (R):

–100 tuples per page, 1000 pages

–NPages(R) = 1000, NTuplesPerPage(R) =100

–NTuples(R) = 100000

Sailors (sid: integer, sname: string, rating: integer, age: real)

Reserves (sid: integer, bid: integer, day: dates, rname: string)

INFO20003 Database Systems 10© University of Melbourne 2018

Simple Nested Loops Join

• For each tuple in the outer relation R, we scan the entire inner

relation S

• Cost:

Cost (SNJL) = NPages(Outer) +

NTuples(Outer) * NPages(Inner)

• Our example:

Cost (SNLJ)= 1000+ 100*1000*500

= 50001000 (I/O)

foreach tuple r in R do

foreach tuple s in S do

if ri == sj then add to result

R S

12,10

11,80

13,74

12,20
13,75

13,35
13,7515,80

15,44

13,20
11,20

…

Pseudo code:

INFO20003 Database Systems 11© University of Melbourne 2018

Page-Oriented Nested Loops Join

• For each page of R

–get each page of S

–write out matching pairs of tuples , where r is in R-page

and S is in S-page

Cost (PNJL) = NPages(Outer) +

NPages(Outer) * NPages(Inner)

• Our example:

Cost (PNLJ)= 1000+1000*500 = 501000 (I/O)

foreach page bR in R do

foreach page bS in S do

foreach tuple r in bR do

foreach tuple s in bSdo

if ri == sj then add to result

R S

12,10

11,80

13,74

12,20
13,75

13,35
13,7515,80

15,44

13,20
11,20

Pseudo code:

INFO20003 Database Systems 15© University of Melbourne 2018

Block Nested Loops Join

• Page-oriented NL doesn’t exploit extra memory buffers

• Alternative approach:

–Use one page as an input buffer for scanning the inner S,

one page as the output buffer, and use all remaining pages

to hold ‘block’ of outer R

• For each matching tuple r in R-block, s in S-page, add to result. Then read next R-block, scan S, etc

. . .

R & S
block of R tuples

(B-2 pages)

Input buffer for S Output buffer

. . .

Join Result

INFO20003 Database Systems 16© University of Melbourne 2018

Block Nested Loops Join Cost

Cost (BNJL) = NPages(Outer) +

NBlocks(Outer) * NPages(Inner)

• NBlocks(Outer) =
𝑁𝑃𝑎𝑔𝑒𝑠(𝑂𝑢𝑡𝑒𝑟)

𝐵𝑙𝑜𝑐𝑘𝑠𝑖𝑧𝑒−2

• Our example:

Let’s say we have 102 pages of space in memory, and consider

Reserves (R) as the outer and Sailors (S) as the inner table.

NBlocks(R) = 1000/(102-2) = 10

Cost(BNLJ) = 1000 + 10* 500 = 6000 I/O

R S

12,10

11,80

13,74

12,20
13,75

13,35
13,7515,80

15,44

13,20
11,20

Query Processing: Joins

• Nested loops join

• Sort-merge join

• Hash join

• General joins

Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems

• Sort R and S on the join column, then scan them to do a

merge (on join column), and output result tuples

• Sorted R is scanned once;

• Each S group of the same key values is scanned once per matching

R tuple (typically means Sorted S is scanned once too).

• Useful when:

–one or both inputs are already sorted on join attribute(s)

–output is required to be sorted on join attributes(s)

Sort-Merge Join (R S)
i=j

R S

12,10
12,10
11,80

13,75

12,20
13,74

13,35
13,75

15,80
15,44

13,20

11,20

Sort-Merge Join Cost

Cost (SMJ) = Sort(Outer) + Sort(Inner)

+ NPages(Outer) + NPages(Inner)

Sort(R) = External Sort Cost = 2*NumPasses*NPages(R)

Our example:

Let’s say that both Reserves and Sailors can be sorted in 2 passes,

then:

Cost(SMJ) = Sort R + Sort S + NPages(R) + NPages(S)

= 2*2*NPages(R)+ 2*2*NPages(S)

+ NPages(R) + NPages (S)

= 5*1000 + 5* 500 = 7500 I/O

Sort inputs

Merge inputs

Query Processing: Joins

• Nested loops join

• Sort-merge join

• Hash join

• General joins

Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems

Hash-Join

• Partition both relations

using hash function h:

R tuples in partition l

will only match S

tuples in partition I

• Read in a partition of

R, hash it using h2 (<>

h!). Scan matching

partition of S, probe

hash table for matches

Partitions

of R & S

Input buffer
for Si

Hash table for partition

Ri (k < B-1 pages) B main memory buffersDisk Output buffer Disk Join Result hash fn h2 h2 B main memory buffers DiskDisk Original Relation OUTPUT 2INPUT 1 hash function h B-1 Partitions 1 2 B-1 . . . INFO20003 Database Systems 23© University of Melbourne 2018 Hash-Join Cost 1. In partitioning phase, we read+write both relations 2. In matching phase, we read both relations • Our example: Cost(HJ) = 2*NPages(R) + 2*NPages(S) + NPages(R) + NPages(S) = 3 * 1000 + 3* 500 = 4500 I/Os Cost (HJ) = 2 * NPages(Outer) + 2* NPages(Inner) + NPages(Outer) + NPages(Inner) Create partitions Match partitions INFO20003 Database Systems 24© University of Melbourne 2018 Watch this video if you are confused https://www.youtube.com/watch?v=o1dMJ6-CKzU From 0:58 https://www.youtube.com/watch?v=o1dMJ6-CKzU INFO20003 Database Systems 25© University of Melbourne 2018 INFO20003 Database Systems 27© University of Melbourne 2018 Query Processing: Joins • Nested loops join • Sort-merge join • Hash join • General joins Readings: Chapter 14, Ramakrishnan & Gehrke, Database Systems INFO20003 Database Systems 28© University of Melbourne 2018 General Join Conditions • Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname): –For Sort-Merge and Hash Join, sort/partition on combination of the two join columns • Inequality conditions (e.g., R.rname < S.sname): –Hash Join, Sort Merge Join not applicable –Block NL quite likely to be the best join method here INFO20003 Database Systems 31© University of Melbourne 2018 Summary • A virtue of relational DBMSs: – Queries are composed of a few basic operators – Implementation of operators can be carefully tuned – Important to do this • Many alternative implementations for each operator –No universally superior technique for most operators • Must consider alternatives for each operation in a query and choose best one based on system statistics… –Part of the broader task of optimizing a query composed of several operations INFO20003 Database Systems 32© University of Melbourne 2018 What’s examinable • Understand alternatives for join operator implementations –Be able to calculate the cost of alternatives • Important for Assignment 3 as well INFO20003 Database Systems 33© University of Melbourne 2018 Next Lecture - - • Query optimization ‒ How does a DBMS pick a good query plan?

Related Posts