Query Evaluation 4
R
S
out
…
Assume that both tables to be joined are sorted on the join attribute
▪ The tables may be joined with one pass
▪ Like merging two sorted runs cost = B(R) + B(S)
Read in pages of R and S – join on x
While xr != xs
▪ If xr < xs move to the next R record else
▪ Move to the S next record
Ifxr==xs
▪ Concatenate r and s, and
▪ Add to output buffer
Repeat until all records have been read
RS
id
...
11
17
18
23
34
41
44
47
49
53
61
...
id
...
11
16
20
23
32
40
41
42
46
49
52
...
But R and S may not be sorted on the join attribute
John Edgar 2
The sort-merge join* combines the join operation
with the merge step of external merge sort ▪ R and S are processed independently
*aka sort-join ▪ The first pass makes sorted runs of R and S of size M
▪ Process merge runs of R and S as external merge sort until the combined number of sorted runs is less than M
▪ If M is large or R and S are small this step may not be necessary
▪ The final merge phase of the external sort algorithm is combined with the join, by comparing the runs of R and S
▪ Records that do not meet the join condition are discarded
▪ Records that meet the condition are concatenated and output
John Edgar 3
Given sufficient main memory sort-merge join can be performed in two passes
▪ For a cost of 3(B(R) + B(S))
Main memory must be large enough to allow an
input buffer for each sorted run of both R and S
▪ Main memory must be greater than (B(R) + B(S)) to perform the join in two passes
▪ Initial pass produces B(R) / M + B(S) / M sorted runs of size M
▪ If M is greater than (B(R) + B(S) ) then (B(R) / M + B(S) / M ) must be
less than M
John Edgar 4
cost to write out final result not included
main memory
B(R) = 49
M > (B(R) + B(S)) input page for each
sorted run
M = 10
M < (B(R) + B(S)) insufficient frames
for page for each run
M = 8
out
B(S) = 28
sorted runs of R and S after initial sort pass
sorted runs of R and S after initial sort pass
Must perform another merge pass
John Edgar
5
If both relations have a primary tree index on the join attribute a zig-zag join can be performed
▪ Scan the leaves of the two B+ trees in order from the left ▪ i.e. from the record with the smallest value for the join attribute
▪ When the search key value of one index is higher, scan the other index
▪ When both indexes contain the same search key values matching records are retrieved and concatenated
▪ Recall that the index is typically much smaller than the file Cost = blocks of leaves of both indexes + blocks of matching records
John Edgar 6
The hash join algorithm has two phases M = 7 ▪ Partitioning, and
in
▪ Probing
Partitioning
R (S not shown)
▪ Both relations are partitioned using the same hash function, h, on the join attribute
▪ Records in one partition of R can only match records in the matching partition of S
▪ One input buffer page and M- 1 output buffer pages are used to make M - 1 partitions for each relation
▪ If the largest partition of one relation does not fit in main memory, the relations must be further partitioned
John Edgar 7
Probing
▪ Read in one partition of R, where R is the
smaller relation
▪ To reduce CPU costs, build an in memory
out
M=7 hash table using hash function h2 (h2 h)
▪ Read the corresponding partition of S into an input buffer one page at a time
▪ Join matching records using the hash table
▪ Repeat for each partition of R
Cost
▪ If each partition of one relation fits in main
memory the overall cost is 3(B(R) + B(S)) John Edgar
R after partitioning
S after partitioning
8
Relations must be partitioned until the largest partition of the smallest relation (S) fits in main memory
Ideally only one partitioning step is required
▪ Which requires that M - 2 is (B(S))
▪ BuffersforRinputandforoutputareneeded ▪ PartitioningproducesB(S)-1partitions
▪ Of average size M / (B(S) - 1)
▪ If M - 2 is (B(S)) the cost of hash join is 3(B(R) + B(S))
If M < (B(S)) then B(S) / M must be larger than M, and the
partitions are larger than main memory
▪ Therefore the relations must be further partitioned
John Edgar 9
Assuming that partitions are the same size
Hybrid hash join can be used if M is large
▪ Retain an entire partition of the smaller relation (S) during the partitioning phase
▪ Eliminating the need to write out the partition, and read it back in during the probing phase
▪ Matching R records are joined and written out to the result when R is partitioned
▪ Hence the records of both R and S belonging to that partition are only read once
This approach can be generalized to retain more than one partition where possible
John Edgar 10
Partition S (the smaller relation) into k partitions
▪ Retain t partitions, S1, ... St in main memory
▪ The remaining k – t partitions , St+1, ... Sk are written to disk Partition R into k partitions
▪ The first t partitions are joined to S since those t partitions
of S are still in main memory
The cost improvement is incremental
▪ The remaining k – t partitions are written to disk
Join the remaining k – t partitions as normal
Cost is B(R) + B(S) + 2 * ((k – t)/k) * (B(R) + B(S))
▪ = (3 – 2 * t / k)(B(R) + B(S)) (3 – 2 * M / B(R)) (B(R) + B(S)) John Edgar 11
There must be 1 main memory buffer for each
partition ▪Sok ≤M
k = the number of partitions
t = the number of partitions to be retained in main memory ▪ Hybrid hash join is only used where M B(S), such that
(B(S) / k) < M
The ratio t / k, , should be as large as possible
▪ And t / k * B(S) + k - t ≤ M
sufficient buffers for the other (k – t) partitions
t/k = fraction of S kept in main memory ▪ The retained partitions must fit in main memory with
t = 1, k small ▪ One approach: retain one partition and make as few
partitions as possible t = 1 and k as small as possible
John Edgar 12
Statistics
▪ B(R) = 100,000
▪ B(S) = 1,000
▪ M = 200, note that (B(S)) = 31.6 Choose values for k and t
▪ k is the number of partitions and t is the number to be retained in main memory
▪ Selectt=1
▪ k should be as small as possible while still allowing
▪ one partition to be retained in main memory
▪ one output page for each if the other (k-t) partitions
▪ one input page
k = 6 each partition is 167 blocks, 1 is retained leaving 33 blocks for 1 input buffer and 5 output buffers for the other partitions
John Edgar 13
partition S – read in all of S and write out (k-t) / k = 5/6 of S and retain one partition
use
partition 1 of S (s1)
s2
s3
s4
s5
s6
in
frames
...
...
#
0
1
...
165
166
167
168
169
170
171
172
partition R – read in all of R, write out (k-t) / k = 5/6 of R and join partition 1 of R and S
use
s1
r2
r3
r4
r5
r6
in
result
frames
...
...
...
#
0
1
...
165
166
167
168
169
170
171
172
173
read in second partition of S and scan and join second partition of R
use
s2
in (r2)
result
frames
...
...
#
0
1
...
165
166
167
168
repeat for the remaining four partitions of R and S
John Edgar 14
Statistics
▪ B(R) = 100,000
▪ B(S) = 1,000
▪ k = 6, t = 1 Cost
▪ Read all of S – cost = B(S) = 1,000
▪ Write out 5/6 of S – cost = B(S) * 5/6 = 833
▪ Read all of R – cost = B(R) = 100,000
▪ Write out 5/6 of R – cost = B(S) * 5/6 = 83,333
▪ Read remaining partitions of R – cost = 833
▪ Scan and probe matching partitions of S – cost = 83,333
▪ Total cost = B(R) + B(S) + 2 * (5/6) * (B(R) + B(S)) = 269,333
John Edgar 15
If the smaller relation fits in main memory the costs are identical
▪ The smaller relation is read once
B(R) + B(S)
▪ The larger relation is scanned once to join the records
Otherwise hybrid hash join is more efficient
▪ Block nested loop reads R once
▪ But S once for each clump of R
▪ Hybrid hash join reads one partition of R and S once
▪ Reads the other partitions twice and writes them once
▪ And the records of both R and S belonging to a particular
partition are only read once, after the partitioning phase
John Edgar 16
Sort-join in 2 passes: M > (B(R) + B(S)) insufficient frames!
M = 10 B(R) = 60 B(S) = 80
Hash join in 2 passes: M > (B(smaller)) OK!
M = 10
B(R) = 60 B(S) = 80
But
out
etc.
sort-join not sensitive to data skew
sort-join results sorted on join attribute
sorted runs of R and S after initial sort pass
partitions of R and S
John Edgar
17
Simple nested loop join (read S for each record)
▪ Retains the original order of R Index nested loop join
▪ Retains the original order of R Sort join
▪ Ordered by the join attribute Zig-zag join
▪ Ordered by the join attribute All other join methods
▪ No order
But an awful algorithm …
Order might make an upstream operation is more efficient
Such as a join with a third table on the same join attribute
John Edgar
18
The join process is more complex if the join condition is not simple equality on one attribute
For equalities over several attributes
▪ Sort-merge and hash join must sort (or hash) over all of the
attributes in the selection
▪ An index that matches one of the equalities may be used for the index nested loop join
For inequalities (, , etc.)
▪ Hash indexes cannot be used for index nested loop joins ▪ Sort-merge and hash joins are not possible
▪ Other join algorithms are unaffected
John Edgar 19
SELECT fName, lName FROM Patient INTERSECT
SELECT fName, lName FROM Doctor
fName,lName(Patient) fName,lName(Doctor)
Note that set operations, unlike other operations remove duplicates by default
Intersection R S
▪ A join where the condition is equality on all attributes
Cartesian product R S
▪ A special case of join where there is no join condition ▪ All records are joined to each other
fName,lName(Patient) ⋈ fName,lName(Doctor)
John Edgar 20
Union using sorting
▪ Sort R and S using all fields
▪ Scan and merge the results while removing duplicates
Union using hashing
▪ Partition R and S using a hash function h
▪ For each partition of smaller relation (S)
▪ Build an in-memory hash table (using h2)
▪ Scan the corresponding partition of R, and for each record probe the hash table if it is not in the table, add it
Set difference
▪ Similar to union except that for R – S, if records are not in the hash
table for S add it to the result John Edgar
The result is separate from the S hash table
21
Memory Requirements
Operation
Algorithm
M Requirement
Disk I/O
,
scan
1
B
, *
scan
B
B
, , −,
scan
min(B(R), B(S))
B(R) + B(S)
⋈
nested loop
min(B(R), B(S))
B(R) + B(S)
⋈
nested loop
M2
B(R) + B(R) * B(S)/M
* = duplicate removal, = grouping
cost is greater if M requirement is not met
John Edgar 23
Operation
M Requirement
Disk I/O
Notes
,
B
3B
, , −
(B(R) + B(S))
3(B(R) + B(S))
⋈
(B(R) + B(S))
3(B(R) + B(S))
sort-merge join
cost is greater if M requirement is not met
John Edgar 24
Operation
M Requirement
Disk I/O
Notes
,
B
3B
, , −
B(S)
3(B(R) + B(S))
B(S) is smaller relation
⋈
B(S)
3(B(R) + B(S))
⋈
> B(S)
(3 – 2 * t / k)(B(R) + B(S))note
hybrid hash-join
Assume B(S) B(R), and B(S) M cost is greater if M requirement is not met
note – reduction dependent on relative sizes of M and R
John Edgar
25
Hash indexes can be used if there is an equality condition for every attribute in the search key
▪ e.g. a single hash index on {city, street, number}
▪ city=”London”street=”Baker”number=221(Detective) can be used
▪ city=”LosAngeles”street=”Cahuenga”(Detective)cannot
Tree indexes can be used if there is a selection on
each of the first n attributes of the search key ▪ e.g. B+ index on {city, street, number}
▪ city=”London”street=”Baker”number=221(Detective) can be used ▪ city=”LosAngeles”street=”Cahuenga”(Detective)canbeused
John Edgar 27
If an index matches a subset of the conjuncts
▪ Use the index to return a result that contains some
unwanted records
▪ Scan the result for matches to the other conjuncts
▪ city=”London”street=”Baker”number=221fName=”Sherlock” (Detective)
▪ Use the address index and scan result for Sherlocks
If more than one index matches a conjunct
▪ Either use the most selective index, then scan the result,
discarding records that fail to match to the other criteria
▪ Or use all indexes and retrieve the rids
▪ Then take the intersection of the rids and retrieve those records
John Edgar 28
Consider the relation and selection shown below
▪ Detective = {id, fName, lName, age, city, street, number, author}
▪ city=”New York”author=”Spillane”lName=”Hammer”(Detective) With indexes
▪ Secondary hash index, {city, street, number} ▪ Secondary B+ tree index, {lName, fName}
▪ Secondary hash index, {author}
cannot be used
can be used
There are two strategies:
▪ Use the most selective of the two matching indexes, and search
the results for the remaining criteria
▪ Use both indexes, take the intersection of the rid
can be used
John Edgar 29
What if the B+ tree index is primary?
Consider the selections shown below
▪ (author=”King” age>35)(lName=”Tam” id=11)(Detective)
▪ (author=”King”) (lName=”Tam” id=11)(Detective) Indexes on the relation
▪ Secondary B+ tree index, {lName, fName}
▪ Secondary hash index, {author} Compare the two selections
▪ In the first selection each conjunct contains a disjunction without an index (age, id) so a file scan is required
▪ In the second selection the index on author can be used, and records that don’t meet the other criteria removed
John Edgar 30
If the smaller relation fits in main memory the costs are identical
▪ The smaller relation is read once
B(R) + B(S)
▪ The larger relation is scanned once to join the records
Otherwise hybrid hash join is more efficient
B(R) + 5 * B(S) = 6n
▪ Block nested loop reads R once
▪ But S once for each clump of R
(B(R) + B(S)) / 5 +
(B(R) + B(S)) * 12 / 5
= 2n / 5 + 24n / 5 ▪ Hybrid hash join reads one partition of R and S once
= 26n / 5
▪ Reads the other partitions twice and w=r5i.t2ens them once
▪ And the records of both R and S belonging to a particular partition are only read once, after the partitioning phase
John Edgar 31