2021/4/28 Sort-merge Join
Sort-merge Join
Sort-Merge Join Sort-Merge Join on Example
>>
COMP9315 21T1 ♢ Sort-merge Join ♢ [0/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
1/12
2021/4/28 Sort-merge Join
❖ Sort-Merge Join Basic approach:
sortbothrelationsonjoinattribute (reminder:Join [i=j] (R,S))
scan together using merge to form result (r,s) tuples
Advantages:
no need to deal with “entire” S relation for each r tuple
deal with runs of matching R and S tuples
Disadvantages:
costofsortingbothrelations (alreadysortedonjoin key?)
some rescanning required when long runs of S tuples
COMP9315 21T1 ♢ Sort-merge Join ♢ [1/10]
∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
2/12
2021/4/28 Sort-merge Join
<< ∧ >>
❖ Sort-Merge Join (cont) Standard merging requires two cursors:
while (r != eof && s != eof) {
if (r.val ≤ s.val) { output(r.val); next(r); } else { output(s.val); next(s); }
}
while (r != eof) { output(r.val); next(r); }
while (s != eof) { output(s.val); next(s); }
COMP9315 21T1 ♢ Sort-merge Join ♢ [2/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
3/12
2021/4/28 Sort-merge Join
<< ∧ >>
❖ Sort-Merge Join (cont)
Merging for join requires 3 cursors to scan sorted
relations:
r = current record in R relation
s = current record in S relation
ss = start of current run in S relation
COMP9315 21T1 ♢ Sort-merge Join ♢ [3/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
4/12
2021/4/28 Sort-merge Join
<< ∧ >>
❖ Sort-Merge Join (cont) Algorithm using query iterators/scanners:
Query ri, si; Tuple r,s;
ri = startScan(“SortedR”);
si = startScan(“SortedS”);
while ((r = nextTuple(ri)) != NULL
&& (s = nextTuple(si)) != NULL) {
// align cursors to start of next common run
while (r != NULL && r.i < s.j)
r = nextTuple(ri);
if (r == NULL) break;
while (s != NULL && r.i > s.j)
s = nextTuple(si);
if (s == NULL) break;
// must have (r.i == s.j) here
…
COMP9315 21T1 ♢ Sort-merge Join ♢ [4/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
5/12
2021/4/28 Sort-merge Join
❖ Sort-Merge Join (cont) …
// remember start of current run in S
TupleID startRun = scanCurrent(si)
// scan common run, generating result tuples
while (r != NULL && r.i == s.j) {
while (s != NULL and s.j == r.i) {
addTuple(outbuf, combine(r,s));
if (isFull(outbuf)) {
writePage(outf, outp++, outbuf);
clearBuf(outbuf);
}
s = nextTuple(si);
}
r = nextTuple(ri);
setScan(si, startRun);
}
}
COMP9315 21T1 ♢ Sort-merge Join ♢ [5/10]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
6/12
2021/4/28 Sort-merge Join
❖ Sort-Merge Join (cont) Buffer requirements:
for sort phase:
as many as possible (remembering that cost is O(logN) )
if insuf cient buffers, sorting cost can dominate
for merge phase:
one output buffer for result
one input buffer for relation R
(preferably) enough buffers for longest run in S
COMP9315 21T1 ♢ Sort-merge Join ♢ [6/10]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
7/12
2021/4/28 Sort-merge Join
❖ Sort-Merge Join (cont)
Cost of sort-merge join. Step1:sorteachrelation (ifnotalreadysorted):
Cost = 2.bR (1 + logN-1(bR /N)) + 2.bS (1 + logN- 1(bS /N))
(where N = number of memory buffers)
Step 2: merge sorted relations:
if every run of values in S ts completely in buffers, mergerequiressinglescan, Cost=bR+bS
if some runs in of values in S are larger than buffers,
need to re-scan run for each corresponding value from R
COMP9315 21T1 ♢ Sort-merge Join ♢ [7/10]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
8/12
2021/4/28 Sort-merge Join
<< ∧ >> ❖ Sort-Merge Join on Example
SQL query on student/enrolment database:
select E.subj, S.name
from Student S join Enrolled E on (S.id = E.stude)
order by E.subj
And its relational algebra equivalent: Sort[subj] ( Project[subj,name] ( Join[id=stude]
(Student,Enrolled) ) )
Database: rS = 20000, cS = 20, bS = 1000, rE =
80000, cE = 40, bE = 2000
We are interested only in the cost of Join, with N buffers
COMP9315 21T1 ♢ Sort-merge Join ♢ [8/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
9/12
2021/4/28 Sort-merge Join
<< ∧ >> ❖ Sort-Merge Join on Example (cont)
Case1: Join[id=stude](Student,Enrolled)
relations are not sorted on id#
memory buffers N=32; all runs are of length < 30
Cost =
sort(S) + sort(E) + bS + bE
= 2bS(1+log31(bS/32)) + 2bE(1+log31(bE/32)) + bS+bE
= 2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000
= 6000+12000+1000+2000
= 21,000
COMP9315 21T1 ♢ Sort-merge Join ♢ [9/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
10/12
2021/4/28 Sort-merge Join
<< ∧ ❖ Sort-Merge Join on Example (cont)
Case2: Join[id=stude](Student,Enrolled)
Student and Enrolled already sorted on id# memory buffers N=4 (S input, 2 × E input, output) 5% of the "runs" in E span two pages
there are no "runs" in S, since id# is a primary key
For the above, no re-scans of E runs are ever needed
Cost = 2,000 + 1,000 = 3,000 (regardless of which relation is outer)
COMP9315 21T1 ♢ Sort-merge Join ♢ [10/10]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
11/12
2021/4/28 Sort-merge Join
Produced: 31 Mar 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-sort-merge/slides.html
12/12