Sorting
>>
Sorting
The Sort Operation
Two-way Merge Sort
Comparison for Sorting
Cost of Two-way Merge Sort
n-Way Merge Sort
Cost of n-Way Merge Sort
Sorting in PostgreSQL
COMP9315 21T1 ♢ Sorting ♢ [0/14]
∧ >>
❖ The Sort Operation
Sorting is explicit in queries only in the order by clause
select * from Students order by name;
Sorting is used internally in other operations:
eliminating duplicate tuples for projection
ordering files to enhance select efficiency
implementing various styles of join
forming tuple groups in group by
Sort methods such as quicksort are designed for in-memory data.
For large data on disks, need external sorts such as merge sort.
COMP9315 21T1 ♢ Sorting ♢ [1/14]
<< ∧ >>
❖ Two-way Merge Sort
Example:
COMP9315 21T1 ♢ Sorting ♢ [2/14]
<< ∧ >>
❖ Two-way Merge Sort (cont)
Requires at least three in-memory buffers:
Assumption: cost of Merge operation on two in-memory buffers ≅ 0.
COMP9315 21T1 ♢ Sorting ♢ [3/14]
<< ∧ >>
❖ Comparison for Sorting
Above assumes that we have a function to compare tuples.
Needs to understand ordering on different data types.
Need a function tupCompare(r1,r2,f) (cf. C’s strcmp)
int tupCompare(r1,r2,f)
{
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
return 0;
}
Assume =, <, > are available for all attribute types.
COMP9315 21T1 ♢ Sorting ♢ [4/14]
<< ∧ >>
❖ Comparison for Sorting (cont)
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
— example multi-attribute sort
select * from Students
order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria)
{
foreach (f,ord) in criteria {
if (ord == ASC) {
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
}
else {
if (r1.f > r2.f) return -1;
if (r1.f < r2.f) return 1;
}
}
return 0;
}
COMP9315 21T1 ♢ Sorting ♢ [5/14]
<< ∧ >>
❖ Cost of Two-way Merge Sort
For a file containing b data pages:
require ceil(log2b) passes to sort,
each pass requires b page reads, b page writes
Gives total cost: 2.b.ceil(log2b)
Example: Relation with r=105 and c=50
⇒ b=2000 pages.
Number of passes for sort: ceil(log22000) = 11
Reads/writes entire file 11 times! Can we do better?
COMP9315 21T1 ♢ Sorting ♢ [6/14]
<< ∧ >>
❖ n-Way Merge Sort
Initial pass uses: B total buffers
Reads B pages at a time, sorts in memory, writes out in order
COMP9315 21T1 ♢ Sorting ♢ [7/14]
<< ∧ >>
❖ n-Way Merge Sort (cont)
Merge passes use:
n = B-1 input buffers,
1 output buffer
COMP9315 21T1 ♢ Sorting ♢ [8/14]
<< ∧ >>
❖ n-Way Merge Sort (cont)
Method:
// Produce B-page-long runs
for each group of B pages in Rel {
read B pages into memory buffers
sort group in memory
write B pages out to Temp
}
// Merge runs until everything sorted
numberOfRuns = ceil(b/B)
while (numberOfRuns > 1) {
// n-way merge, where n=B-1
for each group of n runs in Temp {
merge into a single run via input buffers
write run to newTemp via output buffer
}
numberOfRuns = ceil(numberOfRuns/n)
Temp = newTemp // swap input/output files
}
COMP9315 21T1 ♢ Sorting ♢ [9/14]
<< ∧ >>
❖ Cost of n-Way Merge Sort
Consider file where b = 4096, B = 16 total buffers:
pass 0 produces 256 × 16-page sorted runs
pass 1
performs 15-way merge of groups of 16-page sorted runs
produces 18 × 240-page sorted runs (17 full runs, 1 short run)
pass 2
performs 15-way merge of groups of 240-page sorted runs
produces 2 × 3600-page sorted runs (1 full run, 1 short run)
pass 3
performs 15-way merge of groups of 3600-page sorted runs
produces 1 × 4096-page sorted runs
(cf. two-way merge sort which needs 11 passes)
COMP9315 21T1 ♢ Sorting ♢ [10/14]
<< ∧ >>
❖ Cost of n-Way Merge Sort (cont)
Generalising from previous example …
For b data pages and B buffers
first pass: read/writes b pages, gives b0 = ceil(b/B) runs
then need ceil(lognb0) passes until sorted, where n = B-1
each pass reads and writes b pages (i.e. 2.b page accesses)
Cost = 2.b.(1 + ceil(lognb0)), where b0 and n are defined above
COMP9315 21T1 ♢ Sorting ♢ [11/14]
<< ∧ >>
❖ Sorting in PostgreSQL
Sort uses a merge-sort (from Knuth) similar to above:
backend/utils/sort/tuplesort.c
include/utils/sortsupport.h
Tuples are mapped to SortTuple structs for sorting:
containing pointer to tuple and sort key
no need to reference actual Tuples during sort
unless multiple attributes used in sort
If all data fits into memory, sort using qsort().
If memory fills while reading, form “runs” and do disk-based sort.
COMP9315 21T1 ♢ Sorting ♢ [12/14]
<< ∧ >>
❖ Sorting in PostgreSQL (cont)
Disk-based sort has phases:
divide input into sorted runs using HeapSort
merge using N buffers, one output buffer
N = as many buffers as workMem allows
Described in terms of “tapes” (“tape” ≅ sorted run)
Implementation of “tapes”: backend/utils/sort/logtape.c
COMP9315 21T1 ♢ Sorting ♢ [13/14]
<< ∧ ❖ Sorting in PostgreSQL (cont) Sorting comparison operators are obtained via catalog (in Type.o): // gets pointer to function via pg_operator struct Tuplesortstate { ... SortTupleComparator ... }; // returns negative, zero, positive ApplySortComparator(Datum datum1, bool isnull1, Datum datum2, bool isnull2, SortSupport sort_helper); Flags in SortSupport indicate: ascending/descending, nulls-first/last. ApplySortComparator() is PostgreSQL's version of tupCompare() COMP9315 21T1 ♢ Sorting ♢ [14/14] Produced: 1 Mar 2021