>>
Sorting
• The Sort Operation
• Two-way Merge Sort
• Comparison for Sorting
• Cost of Two-way Merge Sort
• n-Way Merge Sort
• Cost of n-Way Merge Sort
• Sorting in PostgreSQL
COMP9315 21T1 ♢ Sorting ♢ [0/14]
∧ >>
❖ The Sort Operation
Sorting is explicit in queries only in the order by clause
select * from Students order by name;
Sorting is used internally in other operations:
• eliminating duplicate tuples for projection
• ordering files to enhance select efficiency
• implementing various styles of join
• forming tuple groups in group by
Sort methods such as quicksort are designed for in-memory data.
For large data on disks, need external sorts such as merge sort.
COMP9315 21T1 ♢ Sorting ♢ [1/14]
<< ∧ >>
❖ Two-way Merge Sort
Example:

COMP9315 21T1 ♢ Sorting ♢ [2/14]
<< ∧ >>
❖ Two-way Merge Sort (cont)
Requires at least three in-memory buffers:

Assumption: cost of Merge operation on two in-memory buffers ≅ 0.
COMP9315 21T1 ♢ Sorting ♢ [3/14]
<< ∧ >>
❖ Comparison for Sorting
Above assumes that we have a function to compare tuples.
Needs to understand ordering on different data types.
Need a function tupCompare(r1,r2,f) (cf. C’s strcmp)
int tupCompare(r1,r2,f)
{
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
return 0;
}
Assume =, <, > are available for all attribute types.
COMP9315 21T1 ♢ Sorting ♢ [4/14]
<< ∧ >>
❖ Comparison for Sorting (cont)
In reality, need to sort on multiple attributes and ASC/DESC, e.g.
— example multi-attribute sort
select * from Students
order by age desc, year_enrolled
Sketch of multi-attribute sorting function
int tupCompare(r1,r2,criteria)
{
foreach (f,ord) in criteria {
if (ord == ASC) {
if (r1.f < r2.f) return -1;
if (r1.f > r2.f) return 1;
}
else {
if (r1.f > r2.f) return -1;
if (r1.f < r2.f) return 1;
}
}
return 0;
}
COMP9315 21T1 ♢ Sorting ♢ [5/14]
<< ∧ >>
❖ Cost of Two-way Merge Sort
For a file containing b data pages:
• require ceil(log2b) passes to sort,
• each pass requires b page reads, b page writes
Gives total cost: 2.b.ceil(log2b)
Example: Relation with r=105 and c=50 ⇒ b=2000 pages.
Number of passes for sort: ceil(log22000) = 11
Reads/writes entire file 11 times! Can we do better?
COMP9315 21T1 ♢ Sorting ♢ [6/14]
<< ∧ >>
❖ n-Way Merge Sort
Initial pass uses: B total buffers

Reads B pages at a time, sorts in memory, writes out in order
COMP9315 21T1 ♢ Sorting ♢ [7/14]
<< ∧ >>
❖ n-Way Merge Sort (cont)
Merge passes use: n = B-1 input buffers, 1 output buffer

COMP9315 21T1 ♢ Sorting ♢ [8/14]
<< ∧ >>
❖ n-Way Merge Sort (cont)
Method:
// Produce B-page-long runs
for each group of B pages in Rel {
read B pages into memory buffers
sort group in memory
write B pages out to Temp
}
// Merge runs until everything sorted
numberOfRuns = ceil(b/B)
while (numberOfRuns > 1) {
// n-way merge, where n=B-1
for each group of n runs in Temp {
merge into a single run via input buffers
write run to newTemp via output buffer
}
numberOfRuns = ceil(numberOfRuns/n)
Temp = newTemp // swap input/output files
}
COMP9315 21T1 ♢ Sorting ♢ [9/14]
<< ∧ >>
❖ Cost of n-Way Merge Sort
Consider file where b = 4096, B = 16 total buffers:
• pass 0 produces 256 × 16-page sorted runs
• pass 1
◦ performs 15-way merge of groups of 16-page sorted runs
◦ produces 18 × 240-page sorted runs (17 full runs, 1 short run)
• pass 2
◦ performs 15-way merge of groups of 240-page sorted runs
◦ produces 2 × 3600-page sorted runs (1 full run, 1 short run)
• pass 3
◦ performs 15-way merge of groups of 3600-page sorted runs
◦ produces 1 × 4096-page sorted runs
(cf. two-way merge sort which needs 11 passes)
COMP9315 21T1 ♢ Sorting ♢ [10/14]
<< ∧ >>
❖ Cost of n-Way Merge Sort (cont)
Generalising from previous example …
For b data pages and B buffers
• first pass: read/writes b pages, gives b0 = ceil(b/B) runs
• then need ceil(lognb0) passes until sorted, where n = B-1
• each pass reads and writes b pages (i.e. 2.b page accesses)
Cost = 2.b.(1 + ceil(lognb0)), where b0 and n are defined above
COMP9315 21T1 ♢ Sorting ♢ [11/14]
<< ∧ >>
❖ Sorting in PostgreSQL
Sort uses a merge-sort (from Knuth) similar to above:
• backend/utils/sort/tuplesort.c
• include/utils/sortsupport.h
Tuples are mapped to SortTuple structs for sorting:
• containing pointer to tuple and sort key
• no need to reference actual Tuples during sort
• unless multiple attributes used in sort
If all data fits into memory, sort using qsort().
If memory fills while reading, form “runs” and do disk-based sort.
COMP9315 21T1 ♢ Sorting ♢ [12/14]
<< ∧ >>
❖ Sorting in PostgreSQL (cont)
Disk-based sort has phases:
• divide input into sorted runs using HeapSort
• merge using N buffers, one output buffer
• N = as many buffers as workMem allows
Described in terms of “tapes” (“tape” ≅ sorted run)
Implementation of “tapes”: backend/utils/sort/logtape.c
COMP9315 21T1 ♢ Sorting ♢ [13/14]
<< ∧
❖ Sorting in PostgreSQL (cont)
Sorting comparison operators are obtained via catalog (in Type.o):
// gets pointer to function via pg_operator
struct Tuplesortstate { ... SortTupleComparator ... };
// returns negative, zero, positive
ApplySortComparator(Datum datum1, bool isnull1,
Datum datum2, bool isnull2,
SortSupport sort_helper);
Flags in SortSupport indicate: ascending/descending, nulls-first/last.
ApplySortComparator() is PostgreSQL's version of tupCompare()
COMP9315 21T1 ♢ Sorting ♢ [14/14]
Produced: 1 Mar 2021