2021/4/28 Implementing Join
Implementing Join
Join
Join Example Implementing Join Join Summary
Join in PostgreSQL
>>
COMP9315 21T1 ♢ Implementing Join ♢ [0/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
1/11
2021/4/28 Implementing Join
∧ >>
❖ Join
DBMSs are engines to store, combine and
lter information.
Join (⨝) is the primary means of combining
information.
Join is important and potentially expensive
Most common join condition: equijoin, e.g.
(R.pk = S.fk)
Join varieties (natural, inner, outer, semi, anti) all behave similarly.
We consider three strategies for implementing join
nested loop … simple, widely applicable, inef cient without buffering
sort-merge … works best if tables are sorted on join attributes
hash-based … requires good hash function and suf cient buffering
COMP9315 21T1 ♢ Implementing Join ♢ [1/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
2/11
2021/4/28 Implementing Join
<< ∧ >>
❖ Join Example
Consider a university database with the
schema:
create table Student(
id integer primary key,
name text, …
);
create table Enrolled(
stude integer references Student(id),
subj text references Subject(code), …
);
create table Subject(
code text primary key,
title text, …
);
We use this example for each join implementation, by way of comparison
COMP9315 21T1 ♢ Implementing Join ♢ [2/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
3/11
2021/4/28 Implementing Join
<< ∧ >>
❖ Join Example (cont)
Goal: List names of students in all subjects,
arranged by subject.
SQL query to provide this information:
select E.subj, S.name
from Student S
join Enrolled E on (S.id = E.stude)
order by E.subj, S.name;
And its relational algebra equivalent: Sort[subj] ( Project[subj,name] ( Join[id=stude]
(Student,Enrolled) ) )
To simplify formulae, we denote Student by S
and Enrolled by E COMP9315 21T1 ♢ Implementing Join ♢ [3/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
4/11
2021/4/28 Implementing Join
❖ Join Example (cont) Some database statistics:
<< ∧ >>
Sym
Meaning
Value
rS
# student records
20,000
rE
# enrollment records
80,000
cS
Student records/page
20
cE
Enrolled records/page
40
bS
# data pages in Student
1,000
bE
# data pages in Enrolled
2,000
Also, in cost analyses later, N = number of memory buffers.
COMP9315 21T1 ♢ Implementing Join ♢ [4/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
5/11
2021/4/28 Implementing Join
<< ∧ >>
❖ Join Example (cont)
Relation statistics for Out = Student ⨝
Enrolled
Sym
Meaning
Value
rOut
# tuples in result
80,000
COut
result records/page
80
bOut
# data pages in result
1,000
Notes:
rOut … one result tuple for each Enrolled tuple
COut … result tuples have only subj and name
in analyses, ignore cost of writing result … same in all methods
COMP9315 21T1 ♢ Implementing Join ♢ [5/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
6/11
2021/4/28 Implementing Join
<< ∧ >>
❖ Implementing Join
A naive join implementation strategy
for each tuple TS in Students { for each tuple TE in Enrolled {
if (testJoinCondition(C,TS,TE)) { T1 = concat(TS,TE)
T2 = project([subj,name],T1)
ResultSet = ResultSet ∪ {T2} }}}
Problems:
join condition is tested rE.rS = 16×108 times
tuples scanned = rS + rS.rE = 20000 + 20000.80000 = 1600020000
COMP9315 21T1 ♢ Implementing Join ♢ [6/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
7/11
2021/4/28 Implementing Join
<< ∧ >> ❖ Implementing Join (cont)
An alternative naive join implementation strategy
for each tuple TE in Enrolled { for each tuple TS in Students {
if (testJoinCondition(C,TS,TE)) { T1 = concat(TS,TE)
T2 = project([subj,name],T1)
ResultSet = ResultSet ∪ {T2} }}}
Relatively minor performance difference …
tuples scanned = rE + rE.rS = 80000 + 80000.20000 = 1600080000
Terminology: relation in outer loop is outer; other relation is inner
COMP9315 21T1 ♢ Implementing Join ♢ [7/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
8/11
2021/4/28 Implementing Join
<< ∧ >>
❖ Join Summary
None of nested-loop/sort-merge/hash join is
superior in some overall sense.
Which strategy is best for a given query depends on:
sizesofrelationsbeingjoined, sizeof buffer pool
anyindexingonrelations, whether relations are sorted
which attributes and operations are used in the query
number of tuples in S matching each tuple inR
distribution of data values (uniform, skew, …)
Given query Q, choosing the “best” join strategy is critical;
the cost difference between best and worst case can be very large.
E.g. Join[id=stude](Student,Enrolled): 3,000… 2,000,000 page reads
COMP9315 21T1 ♢ Implementing Join ♢ [8/9]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
9/11
2021/4/28 Implementing Join
❖ Join in PostgreSQL Join implementations are under:
src/backend/executor
PostgreSQL suports the three join methods that we discuss:
nested loop join (nodeNestloop.c) sort-mergejoin (nodeMergejoin.c)
hashjoin (nodeHashjoin.c) (hybridhash join)
The query optimiser chooses appropriate join, by considering
physical characteristics of tables being joined
estimated selectivity (likely number of result tuples)
COMP9315 21T1 ♢ Implementing Join ♢ [9/9]
<< ∧
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
10/11
2021/4/28 Implementing Join
Produced: 22 Mar 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join/slides.html
11/11