2021/4/28 Hash Join
Hash Join
Hash Join
Simple Hash Join Grace Hash Join Hybrid Hash Join Costs for Join Example Join in PostgreSQL
>>
COMP9315 21T1 ♢ Hash Join ♢ [0/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
1/19
2021/4/28 Hash Join
❖ Hash Join Basic idea:
use hashing as a technique to partition relations to avoid having to consider all pairs of tuples
Requires suf cent memory buffers
to hold substantial portions of partitions
(preferably) to hold largest partition of outer relation
Other issues:
worksonlyforequijoin R.i=S.j (butthisisacommoncase) susceptibletodataskew (orpoorhashfunction)
Variations: simple, grace, hybrid.
COMP9315 21T1 ♢ Hash Join ♢ [1/17]
∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
2/19
2021/4/28 Hash Join
❖ Simple Hash Join Basic approach:
hash part of outer relation R into memory buffers (build) scan inner relation S, using hash to search (probe)
ifR.i=S.j,thenh(R.i)=h(S.j) (hashtosamebuffer)
only need to check one memory buffer for each S tuple repeat until whole of R has been processed
No over ows allowed in in-memory hash table works best with uniform hash function
can be adversely affected by data/hash skew
COMP9315 21T1 ♢ Hash Join ♢ [2/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
3/19
2021/4/28 Hash Join
❖ Simple Hash Join (cont) Data ow in hash join:
<< ∧ >>
COMP9315 21T1 ♢ Hash Join ♢ [3/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
4/19
2021/4/28 Hash Join
❖ Simple Hash Join (cont)
Algorithm for simple hash join Join[R.i=S.j](R,S):
for each tuple r in relation R {
if (buffer[h(R.i)] is full) {
for each tuple s in relation S {
for each tuple rr in buffer[h(S.j)] {
if ((rr,s) satisfies join condition) {
add (rr,s) to result
}}}
clear all hash table buffers
}
insert r into buffer[h(R.i)]
}
Best case: # join tests ≤ rS.cR (cf. nested-loop rS.rR) COMP9315 21T1 ♢ Hash Join ♢ [4/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
5/19
2021/4/28 Hash Join
❖ Simple Hash Join (cont)
Cost for simple hash join …
Best case: all tuples of R t in the hash table
Cost=bR+bS
Same page reads as block nested loop, but less join tests
Good case: re ll hash table m times (where m ≥ ceil(bR / (N-3)) ) Cost=bR+m.bS
More page reads than block nested loop, but less join tests
Worst case: everything hashes to same page Cost = bR + bR.bS
COMP9315 21T1 ♢ Hash Join ♢ [5/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
6/19
2021/4/28 Hash Join
❖ Grace Hash Join Basic approach (for R ⨝ S ):
partition both relations on join attribute using hashing (h1) load each partition of R into N-3*buffer hash table (h2) scan through corresponding partition of S to form results repeat until all partitions exhausted
For best-case cost (O(bR + bS) ):
need ≥ √bR buffers to hold largest partition of outer relation
If < √bR buffers or poor hash distribution
need to scan some partitions of S multiple times
COMP9315 21T1 ♢ Hash Join ♢ [6/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
7/19
2021/4/28 Hash Join
❖ Grace Hash Join (cont) Partition phase (applied to both R and S):
<< ∧ >>
COMP9315 21T1 ♢ Hash Join ♢ [7/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
8/19
2021/4/28 Hash Join
❖ Grace Hash Join (cont) Probe/join phase:
<< ∧ >>
The second hash function (h2) simply speeds up the matching process. Without it, would need to scan entire R partition for each record in S partition.
COMP9315 21T1 ♢ Hash Join ♢ [8/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
9/19
2021/4/28 Hash Join
❖ Grace Hash Join (cont) Cost of grace hash join:
#pagesinallpartition lesofRel≅bRel (maybeslightlymore) partition relation R … Cost = read(bR) + write(≅bR) =
2bR
partition relation S … Cost = read(bS) + write(≅bS) = 2bS
probe/join requires one scan of each (partitioned) relation Cost = bR+bS
allhashingandcomparisonoccursinmemory ⇒ tinycost TotalCost = 2bR+2bS+bR+bS = 3(bR+bS)
COMP9315 21T1 ♢ Hash Join ♢ [9/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
10/19
2021/4/28 Hash Join
<< ∧ >>
❖ Hybrid Hash Join
A variant of grace hash join if we have √bR < N < bR+2
create k≪N partitions, 1 in memory, k-1 on disk buffers: 1 input, k-1 output, p = N-k-2 for in-memory
partition
When we come to scan and partition S relation
any tuple with hash 0 can be resolved (using in-memory partition)
other tuples are written to one of k partition les for S Final phase is same as grace join, but with only k-1 partitions. Comparison:
grace hash join creates N-1 partitions on disk
hybrid hash join creates 1 (memory) + k-1 (disk) partitions
COMP9315 21T1 ♢ Hash Join ♢ [10/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
11/19
2021/4/28 Hash Join
❖ Hybrid Hash Join (cont)
First phase of hybrid hash join (partitioning R):
<< ∧ >>
COMP9315 21T1 ♢ Hash Join ♢ [11/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
12/19
2021/4/28 Hash Join
❖ Hybrid Hash Join (cont)
Next phase of hybrid hash join (partitioning S):
<< ∧ >>
COMP9315 21T1 ♢ Hash Join ♢ [12/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
13/19
2021/4/28 Hash Join
❖ Hybrid Hash Join (cont)
Final phase of hybrid hash join ( nishing join):
<< ∧ >>
COMP9315 21T1 ♢ Hash Join ♢ [13/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
14/19
2021/4/28 Hash Join
❖ Hybrid Hash Join (cont) Some observations:
with k partitions, each partition has expected size ceil(bR/k)
holding 1 partition in memory needs ceil(bR/k) buffers
trade-off between in-memory partition space and #partitions
Other notes:
if N = bR+2, using block nested loop join is simpler
cost depends on N (but less than grace hash join) For k partitions, Cost = (3-2/k).(bR+bS)
COMP9315 21T1 ♢ Hash Join ♢ [14/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
15/19
2021/4/28 Hash Join
❖ Costs for Join Example
SQL query on student/enrolment database:
select E.subj, S.name
from Student S join Enrolled E on (S.id = E.stude)
order by E.subj
And its relational algebra equivalent:
Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )
Database: rS = 20000, cS = 20, bS = 1000, rE = 80000, cE = 40, bE = 2000
We are interested only in the cost of Join, with N buffers COMP9315 21T1 ♢ Hash Join ♢ [15/17]
<< ∧ >>
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
16/19
2021/4/28 Hash Join
<< ∧ >>
❖ Costs for Join Example (cont)
Costs for hash join variants on example (N=103):
Hash Join Method
Cost Analysis
Cost
Hybrid Hash Join
(3-2/k).(bS+bE) = 2.8((1000+2000)
assuming k = 10 … and one partition ts in 91 pages
8700
Grace Hash Join
3(bS+bE) = 3(1000+2000)
9000
Simple Hash Join
bS + bE.ceil(bR/(N-3)) =
1000 + ceil(1000/100).2000 = 1000 + 10.2000
21000
Sort-merge Join
sort(S) + sort(E) + bS + bE = 2.1000.2 + 2.2000.2 + 1000 + 2000
11000
Nested-loop Join
bS + bE.ceil(bS/(N-2)) =
1000 + 2000.ceil(1000/101) = 1000 + 10.2000
21000
COMP9315 21T1 ♢ Hash Join ♢ [16/17]
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
17/19
2021/4/28 Hash Join
❖ Join in PostgreSQL
Join implementations are under: src/backend/executor PostgreSQL suports three kinds of join:
nested loop join (nodeNestloop.c) sort-mergejoin (nodeMergejoin.c) hashjoin (nodeHashjoin.c) (hybridhashjoin)
Query optimiser chooses appropriate join, by considering physical characteristics of tables being joined estimated selectivity (likely number of result tuples)
COMP9315 21T1 ♢ Hash Join ♢ [17/17]
<< ∧
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
18/19
2021/4/28 Hash Join
Produced: 25 Apr 2021
https://cgi.cse.unsw.edu.au/~cs9315/21T1/lectures/join-hash/slides.html
19/19