CS5481 Data Engineering Tutorial 6
1. Suppose you need to sort a relation of 40 gigabytes, with 4 kilobyte blocks, using a memory size of 40 megabytes. Assume that tS = 5ms and tT = 0.1ms. Find the cost of sorting the relation, in seconds, with bb = 1, 9, 30, 300, and 900, where bb is the number of buffer blocks allocated to each input run and output run.
The number of blocks in the main memory buffer available for sorting (M) is (40×106) / (4×103) = 104.
The number of blocks containing records of the given relation (br) is (40×109) / (4×103) = 107.
The initial number of runs is br / M = 107/104 =1,000. Here, tS = 5×10−3 seconds and tT = 10−4 seconds.
For bb = 1
The merge can be done in one pass.
The number of block transfers is 107×3 = 30×106.
The number of disk seeks is 2×1000 + 107 = 10,002,000. Therefore, the cost of sorting the relation is:
10,002,000 × (5×10−3) + (30×106) × 10−4 = 53,010 seconds.
For bb = 9
The merge can be done in one pass.
The number of block transfers is 107×3 = 30×106.
The number of disk seeks is 2×1000 + 107/9 = 1,113,112. Therefore, the cost of sorting the relation is:
1,113,112 × (5×10−3) + (30×106) × 10−4 = 8,566 seconds.
For bb = 30
In each pass, the number of runs decreases by a factor of M / bb − 1= (104/30)-1 = 332. The number of merge passes required is log3321000 = 2.
The number of block transfers is 107×5 = 50×106
The number of disk seeks is 2×1000 + 107/30 ×3 = 1,002,002.
Therefore, the cost of sorting the relation is:
1,002,002 × (5×10−3) + (50×106) × 10−4 = 10,010 seconds.
For bb = 300
In each pass, the number of runs decreases by a factor of (104)/300 -1= 32. The number of merge passes required is log321000 = 2.
The number of block transfers is 107×5 = 50×106
The number of disk seeks is 2×1000 + 107/300 ×3 = 102,002.
Therefore, the cost of sorting the relation is:
102,002 × (5×10−3) + (50×106) × 10−4 = 5,510 seconds.
For bb = 900
In each pass, the number of runs decreases by a factor of (104)/900 -1= 10. The number of merge passes required is log101000 = 3.
The number of block transfers is 107×7 = 70×106
The number of disk seeks is 2×1000 + 107/900 ×5 = 57,560.
Therefore, the cost of sorting the relation is:
57,560 × (5×10−3) + (70×106) × 10−4 = 7,288 seconds.
CS5481
Data Engineering Tutorial 6
2.
(a) (b) (c)
(d) (a)
Apply merge-join to the following numerical example used in the lecture. nstudent =5,000 bstudent =100
ntakes =10,000 btakes =400
What will be the cost if the relations are not sorted and the memory size is still 3 blocks?
What will be the cost if the memory size is increased to 25 blocks and the relations are not sorted?
In order to reduce the number of seeks (without increasing the number of block transfers), what is the number of buffer blocks (i.e., bb) that should be allocated
(i) to each run and the output in the merge step of sorting and,
(ii) for buffering each relation and the output in the join step?
What will be the cost of part (c)?
The initial numbers of runs of student and takes are ⌈(100/3)⌉= 34 and ⌈400/3⌉=134, respectively.
The numbers of merge passes required for student and takes are ⌈log234⌉=6 and ⌈log2134⌉=8, respectively.
The number of block transfers in sorting the two relations inclusive of the output = 100(2*6+ 2) + 400(2*8 + 2)
= 1,400 + 7,200
= 8,600 block transfers.
The total number of block transfers
= 8,600 + 100 + 400
= 9,100 block transfers.
The number of seeks in sorting the two relations inclusive of the output = 34*2+100*(2*6) + 134*2+400*(2*8)
= 7,936
The total number of seeks = 7,936+100+400
= 8,436
The initial numbers of runs of student and takes are ⌈100/25⌉= 4 and ⌈400/25⌉=16, respectively.
The numbers of merge passes required for student and takes are both 1.
The number of block transfers in sorting the two relations inclusive of the output = 100(2 + 2) + 400(2 + 2)
= 400 + 1,600
= 2,000 block transfers.
The total number of block transfers
= 2,000 + 100 + 400
= 2,500 block transfers.
The number of seeks in sorting the two relations inclusive of the output = 4*2+100*2+16*2+400*2
= 1,040
The total number of seeks
= 1,040+100+400
= 1,540
(b)
CS5481
Data Engineering Tutorial 6
(c)
(i)
(ii)
• •
To reduce the number of seeks in the merge step of sorting, 5 buffer blocks are allocated to each run and the output for merging the 4 runs of student.
For takes, to keep one merge pass, the number of buffer blocks cannot be increased.
To reduce the number of seeks in the merge-join step, 8 blocks are allocated for buffering each relation and the output.
(d) The number of seeks in sorting the two relations inclusive of the output = 4*2 + ⌈ (100/5) ⌉*2 + 16*2 + 400*2
= 880
The total number of seeks
= 880 + ⌈ (100/8) ⌉ + ⌈ (400/8) ⌉ = 943
CS5481 Data Engineering Tutorial 6
3. Consider two relations r1(A, B, C) and r2(C, D, E). For r1, it has 20,000 tuples and 25 tuples fit on one block. For r2, it has 45,000 tuples and 30 tuples fit on one block. Which join algorithm gives the lowest worst-case cost estimate for r1 r2? Consider only number of block transfers in this exercise. Specify the minimum amount of memory in number of blocks for the worst-case estimates.
Given: br1 = 20,000/25 = 800, br2 = 45,000/30 = 1,500
For nested-loop join:
• Using r1 as the outer relation, we need 20000 1500 + 800 = 30,000,800 block
transfers.
• If r2 is the outer relation we need 45000 800 + 1500 = 36,001,500 block transfers.
• Memory: 1 block for each input and 1 block for output.
For block nested-loop join:
• If r1 is the outer relation, we need 800 1500 + 800 = 1,200,800 block transfers.
• If r2 is the outer relation we need 1500 800 + 1500 = 1,201,500 block transfers.
• Memory: 1 block for each input and 1 block for output.
For merge-join:
• Memory: 3 blocks
• Assuming that r1 and r2 are not initially sorted on the join key, the total sorting cost
inclusive of the output is Bs
= 1500(2⌈log2(1500/3)⌉+ 2) + 800(2⌈log2(800/3)⌉ + 2)
= 30,000 + 16,000
= 46,000 block transfers.
• The total cost is
Bs + 1500 + 800
= 48,300 block transfers.
For hash-join:
(a) We assume that memory is big enough for every partition of the build relation. Since
r1 is smaller, we use it as the build relation and r2 as the probe relation.
(b) Memory:
M > n (for partitioning) and M > 800/n + 1 (for build and probe),
M = 30 blocks.
(c) The cost is
3(1500+800)
= 6,900 block transfers.