程序代写代做代考 Objectives:

Objectives:
This tutorial will cover:
INFO20003 Tutorial – Week 8 Solutions
(Tutorial: Query optimisation)
I. Estimate cost of single-relation plans – 20 mins
II. Estimate cost of multi-relation plans – 35 mins
Exercises:
1. Single-relation plans:
Consider a relation with this schema:
Employees (eid: integer, ename: string, sal: integer, title: string, age: integer)
Suppose that the following indexes exist:
• An unclustered hash index on eid
• An unclustered B+ tree index on sal
• An unclustered hash index on age
• A clustered B+ tree index on (age, sal)
The Employees relation contains 10,000 pages and each page contains 20 tuples. Suppose there are 500 index pages for B+ tree indexes and 500 index pages for hash indexes. There are 40 distinct values of age, ranging from 20 to 60, in the relation. Similarly, sal ranges from 0 to 50,000 and there are up to 50,000 distinct values. eid is a candidate key; its value ranges from 1 to 200,000 and there are 200,000 distinct values.
For each of the following selection conditions, compute the Reduction Factor (selectivity) and the cost of the cheapest access path for retrieving all tuples from Employees that satisfy the condition:
a. sal > 20,000
The reduction factor (RF) is
RF= High(I)−value =50,000−20,000=0.6
High(I) − Low(I) 50,000 − 0 There are two possible access paths for this query:
• The unclustered B+ tree index on sal, with cost
Cost = product of RFs of matching selects × (NTuples(R) + NPages(I)) = 0.6 × �(20 × 10,000) + 500�
= 120,300 I/Os
• Full table scan, with cost 10,000 I/Os.
Other indexes are not applicable here. Hence the cheapest access path is the full table scan, with cost 10,000.
INFO20003 Tutorial – Week 8 Solutions 1

b. age = 25
The reduction factor is
40
= 11,000 I/Os
RF= 1 =1 NKeys(I) 40
Since we have two indexes on age, a hash index and a B+ tree index, there are three possible access paths for this query:
• The clustered B+ tree index on (age, sal), with cost
Cost = product of RFs of matching conditions × (NPages(R) + NPages(I))
= 1 × (500 + 10,000) 40
= 263 I/Os approx.
• The unclustered hash index on age, with cost
Cost = product of RFs of matching conditions × hash lookup cost × NTuples(R) = 1 ×2.2×(20×10,000)
For a hash index, the size does not matter as for each tuple the cost is 2.2; 1.2 is for the bucket check and 1 to fetch the page from the disk.
• Full table scan, with cost 10,000 I/Os.
Therefore, the cheapest access path here is to use the B+ tree index with cost 263 (approx.).
Note that the full scan cost is the same as in the previous case.
c. age > 30
The reduction factor is
We cannot use the hash index over a range, thus the only options to consider are the full table scan vs. B+ tree index. There are two possible access paths for this query:
• The clustered B+ tree index on (age, sal), with cost
Cost = product of RFs of matching conditions × (NPages(R) + NPages(I)) = 0.75 × (500 + 10,000)
= 7875 I/Os
• Full table scan, with cost 10,000 I/Os.
Therefore, the clustered B+ tree index with cost 7875 is the cheapest access path here.
RF= High(I)−value =60−30=0.75 High(I) − Low(I) 60 − 20
INFO20003 Tutorial – Week 8 Solutions 2

d. eid = 1000
As stated earlier, eid is a candidate key. Therefore, we can expect one record per eid. We can use the primary index (hash index on eid) to achieve a lookup cost of roughly
Cost = hash lookup cost + 1 data page access = 1.2 + 1 = 2.2 This is obviously cheaper than the full table scan (cost 10,000).
e. sal>20,000∧age>30
There are two selection conditions joined with “and”. We calculate the RF for each condition:
RF = High(I)−value =60−30=0.75 age High(I) − Low(I) 60 − 20
RF = High(I) − value = 50,000 − 20,000 = 0.6 sal High(I) − Low(I) 50,000 − 0
The selection condition is the same as age > 30 ∧ sal > 20,000. We can use the clustered B+ tree index, but unlike part c, the RF will be product of the RF for the two conditions, since both are applicable.
Alternatively, we can use the unclustered B+ tree on sal and filter age on-the-fly afterwards. For this access path, the age condition does not match the index, so only the RF on sal will be used.
There are three possible access paths for this query:
• The unclustered B+ tree index on sal, with cost
Cost = product of RFs of matching conditions × (NTuples(R) + NPages(I)) = 0.6 × �(20 × 10,000) + 500�
= 120,300 I/Os (same as part a)
• The clustered B+ tree index on (age, sal), with cost
Cost = product of RFs of matching conditions × (NPages(R) + NPages(I)) = 0.75 × 0.6 × (10,000 + 500)
= 4725 I/Os
• Full table scan, with cost 10,000 I/Os.
Thus the clustered B+ tree index on (age, sal), cost 4725, is the cheapest option here.
INFO20003 Tutorial – Week 8 Solutions 3

2. Multi-relation plans:
Consider the following schema:
FK Emp (eid, sal, age, did )
FK
Dept (did, projid, budget, status)
Proj (projid, code, report)
The number of tuples in Emp is 20,000 and each page can hold 20 records. The Dept relation has 5000 tuples and each page contains 40 records. There are 500 distinct dids in Dept. One page can fit 100 resulting tuples of Dept JOIN Emp. Similarly, Proj has 1000 tuples and each page can contain 10 tuples. Assuming that projid is the candidate key of Proj, there can be 1000 unique values for projid. Sort-Merge Join can be done in 2 passes. Let’s assume that, if we join Proj with Dept, 50 resulting tuples will fit on a page. NLJ in this question means ‘Page oriented NLJ’.
Consider the following query:
SELECT E.eid, D.did, P.projid
FROM Emp AS E, Dept AS D, Proj AS P WHERE E.did = D.did
AND D.projid = P.projid;
For this query, estimate the cost of the following plans, focusing on the join order and join types:
⋈⋈
NLJ
Dept
This left-deep plan is joining Dept with Emp using Nested Loop Join and then joining the results with Proj also using Nested Loop Join. The cost analysis is shown below:
Proj
a.
NLJ
Number of resulting tuples for Dept JOIN Emp
= 1 × NTuples(Dept) × NTuples(Emp)
Emp
NKeys(I)
= 1 × 5000 × 20,000
500
= 200,000 tuples
Number of pages for Dept JOIN Emp = 200‚000 = 2000 pages 100
Cost of scanning Dept = 125 I/O
Cost to join with Emp = NPages(Dept) × NPages(Emp) = 125 × 1000 = 125,000 I/O
Cost to join with Proj = NPages(Dept JOIN Emp) × NPages(Proj) = 2000 × 100 = 200,000 I/O
Total cost = 125 + 125,000 + 200,000 = 325,125 I/O
INFO20003 Tutorial – Week 8 Solutions 4

H⋈J
Number of resulting tuples for Dept JOIN Emp
This left-deep plan is joining Dept with Emp using Nested Loop Join and then joining the results with Proj using Hash Join. The cost analysis is shown below:
Dept
⋈ ⋈
SMJ
Dept
b.
N⋈LJ
NKeys(I)
Proj
= 1 × NTuples(Dept) × NTuples(Emp)
Emp
=
= 200,000 tuples
1 × 5000 × 20,000 500
c.
HJ
Number of pages for Dept JOIN Emp = 200‚000 = 2000 pages 100
Cost of scanning Dept = 125 I/O
Cost to join with Emp = NPages(Dept) × NPages(Emp) = 125 × 1000 = 125,000 I/O
Cost to join with Proj = 2 × NPages(Dept JOIN Emp) + 3 × NPages(Proj) = 2 × 2000 + 3 × 100 = 4300 I/O
Total cost = 125 + 125,000 + 4300 = 129,425 I/O
This left-deep plan is joining Dept with Emp using Sort-Merge Join and then
joining the results with Proj using Hash Join. The number of passes of Sort-Merge Join is 2. The cost analysis is shown below:
Proj
Number of resulting tuples for Dept JOIN Emp
= 1 × NTuples(Dept) × NTuples(Emp)
Emp
NKeys(I)
= 1 × 5000 × 20,000
500
= 200,000 tuples
Number of pages for Dept JOIN Emp = 200‚000 = 2000 pages 100
Cost of sorting Dept = 2 × NPasses × NPages(Dept) = 2 × 2 × 125 = 500 I/O
Cost of sorting Emp = 2 × NPasses × NPages(Emp) = 2 × 2 × 1000 = 4000 I/O
Cost of joining sorted Dept and Emp = NPages(Dept) + NPages(Emp) = 125 + 1000 = 1125 I/O
Total cost of SMJ between Dept and Emp = 500 + 4000 + 1125 = 5625 I/O
Cost to join with Proj = 2 × NPages(Dept JOIN Emp) + 3 × NPages(Proj) = 2 × 2000 + 3 × 100 = 4300 I/O
Total cost = 5625 + 4300 = 9925 I/O
INFO20003 Tutorial – Week 8 Solutions 5

d.
This left-deep plan is joining Proj with Dept using Sort-Merge Join (with 2 passes) and then joining the results with Emp using Hash Join. The cost analysis is shown below:
Emp
H⋈J
S⋈MJ
= 1 × NTuples(Proj) × NTuples(Dept)
Proj
Dept
Number of resulting tuples for Proj JOIN Dept NKeys(I)
= 1 ×1000×5000 1000
= 5000 tuples
Number of pages for Proj JOIN Dept = 5000 = 100 pages 50
Cost of sorting Proj = 2 × NPasses × NPages(Proj) = 2 × 2 × 100 = 400 I/O
Cost of sorting Dept = 2 × NPasses × NPages(Dept) = 2 × 2 × 125 = 500 I/O
Cost of joining sorted Proj and Dept = NPages(Proj) + NPages(Dept) = 100 + 125 = 225 I/O
Total cost of SMJ between Proj and Dept = 400 + 500 + 225 = 1125 I/O
Cost to join with Emp = 2 × NPages(Proj JOIN Dept) + 3 × NPages(Emp) = 2 × 100 + 3 × 1000 = 3200 I/O
Total cost = 1125 + 3200 = 4325 I/O Take Home Questions:
1. Multi Relation Plans with Access Methods:
Consider the following schema:
Student (studentid, name, dob, degreename) StudentSubject (studentid, subjectid, grade)
Subject (subjectid, name, level, coordinatorname, budget)
The number of tuples in Student is 20,000 and each page can hold 20 records. The StudentSubject relation has 50,000 tuples and each page contains 50 records. Subject has 1,000 tuples and each page can contain 10 records. One page can fit 100 resulting tuples of Student JOIN StudentSubject. 100 tuples resulting from the join of StudentSubject and Subject also fit onto a page. Assume that Subject.subjectid and Student.studentid are candidate keys. Sorting can be done in 2 passes.
There are 3 available indexes: an unclustered hash index on Student(degreename), an unclustered B+ tree index on Subject(level), and a clustered B+ tree on StudentSubject(subjectid,studentid). All indexes have 50 pages.
INFO20003 Tutorial – Week 8 Solutions 6

There are 10 distinct values for Subject.level, ranging from 1-10. There are known to be 40 distinct values for Student.degreename.
Consider the following query:
SELECT Stu.studentid, Sub.subjectid
FROM Student AS Stu, Subject AS Sub, StudentSubject AS SS WHERE Stu.studentid = SS.studentid
AND SS.subjectid = Sub.subjectid AND Stu.degreename = ‘CompSci’ AND Sub.level > 9
For this query, estimate the cost of the following plans. If selections are not marked on the tree, assume that they are done on the fly (in memory) after having completed all joins.
H⋈J
S⋈MJ
(heapScan)
Sub SS
(heapScan) (heapScan)
Stu
a.

𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
Sub
(indexScan)
HJ

𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒅𝒅𝒍𝒍𝒅𝒅𝒅𝒅𝒍𝒍𝒍𝒍𝒅𝒅𝒅𝒅𝒅𝒅𝒍𝒍 =”𝒄𝒄𝒄𝒄𝒅𝒅𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄”
SS Stu
(indexScan) (indexScan)
SMJ
b.
a. showing implicit Selections
This first plan is using only HeapScan for access, and selections are only performed on the fly (after all joins completed, see diagram to the left with the implicit selections)
𝝈𝝈 𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍⋈
𝒍𝒍𝒍𝒍>𝟗𝟗 ∧ 𝑺𝑺𝑺𝑺𝑺𝑺.𝒅𝒅𝒍𝒍𝒅𝒅𝒅𝒅𝒍𝒍𝒍𝒍𝒅𝒅𝒅𝒅𝒅𝒅𝒍𝒍
Size of result for (Sub ⋈ SS):
= 1 × NTuples(Sub) × NTuples(SS)
=”𝒄𝒄𝒄𝒄𝒅𝒅𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄”
HJ
NKeys(subjectid)
S⋈MJ
(heapScan)
= 1/1000 × 1000 × 50,000 = 50,000 tuples
= 50,000/100 = 500 pages
Cost of sorting Sub = 2 × NPasses × NPages(Sub) = 2 × 2 × 100 = 400 I/O
Cost of sorting SS = 2 × NPasses × NPages(SS) = 2 × 2 × 1000 = 4000 I/O
[NOTE that for this subject, we consider that ‘heapscan’ is not ‘aware’/does not make use of the underlying sort of this structure, even though it is sorted because a clustered B+ index exists]
Stu Sub SS
(heapScan)
(heapScan)
INFO20003 Tutorial – Week 8 Solutions 7

Cost of joining sorted Sub and SS = NPages(Sub) + NPages(SS) = 100 + 1000 = 1100 I/O
Total cost of SMJ between Sub and SS = 400 + 4000 + 1100 = 5500 I/O
Cost of HJ join with Stu:
= 2 × NPages(Sub ⋈ SS) + 3 × NPages(Stu) [due to pipelining] = 2× 500 + 3× 1000 = 4,000 I/O
Total cost = 5500+ 4,000 = 9,500 I/O
H⋈J
S⋈MJ 𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
Sub
(indexScan)
b.
This second plan uses various indexes to access the data instead of heapscan, and is also performing selections along the way.
𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒅𝒅𝒍𝒍𝒅𝒅𝒅𝒅𝒍𝒍𝒍𝒍𝒅𝒅𝒅𝒅𝒅𝒅𝒍𝒍 =”𝒄𝒄𝒄𝒄𝒅𝒅𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄”
SS Stu (indexScan) (indexScan)
First, we calculate the RFs for selections
• Condition (Sub.level > 9) gives a RF of 1/10
• Condition (Stu.degreename = “compsci”) gives a RF of 1/40
Next, let’s do the calculations involving the selections.

First, selection for ‘Sub.level > 9’ using unclustered B+ tree o Numberoftuplesselected
 Ntuples(Sub) * RFs
 = 1000*1/10
 = 100 tuples = 10 pages
o Costofselection
 (Npages(index) + Ntuples(Sub))*RFs  =(50+1000)*1/10
 = 105 I/Os
Next, selection for ‘Stu.degreename = ‘compsci’’ using unclustered hash

indeox Number of resulting tuples  Ntuples(Stu) * RFs
 = 20,000*1/40
 = 500 tuples = 25 pages o Costofselection
 2.2*Ntuples(Stu)*RFs
 = 2.2*20,000*1/40 = 1100 I/Os
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
Now, let’s consider the child join: (𝝈𝝈
(𝑺𝑺𝑺𝑺𝑺𝑺) ⋈ SS)
(𝑺𝑺𝑺𝑺𝑺𝑺))×
8
• First, calculate the size of the intermediate relation
o =
1 MAX(NKeys(subjectid))
×NTuples(𝝈𝝈 𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
NTuples(SS)
o =1/1000×100×50,000 o =5,000tuples
o =5,000/100=50pages
• Now cost out the join
INFO20003 Tutorial – Week 8 Solutions

o We’reaccessingSSusinganIndexScaninsteadofaHeapScan, so we need to calculate the cost of first access. Note RF =1 since getting everything out!
 = (Npages(index) + Npages(SS))*RFs  = (50+1000)*1
 = 1050 I/Os
o Now consider cost of sorting (𝝈𝝈 (𝑺𝑺𝑺𝑺𝑺𝑺) ) for SMJ
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
 = 2 × NPasses × NPages(𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗 (𝑺𝑺𝑺𝑺𝑺𝑺)) –
NPages(𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗 (𝑺𝑺𝑺𝑺𝑺𝑺) ) [subtract Npages because we replaced the first readin with the Index scan of Sub, so we pipeline from IndexScan to sorting]
 =2×2×10-10=30I/O o CostofsortingSS
 = 0 [sorted from the IndexScan already, since index is
sorted!]
o Costofmergingsorted(𝝈𝝈
(𝑺𝑺𝑺𝑺𝑺𝑺))andSS
 = NPages(𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗 (𝑺𝑺𝑺𝑺𝑺𝑺)) + NPages(SS) –
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
Npages(SS) [due to pipelining of SS from IndexScan]  =10+1000-1000=10I/O
o TotalcostofSMJbetween(𝝈𝝈 (𝑺𝑺𝑺𝑺𝑺𝑺))andSS
 = 30+0+10=40I/O
Great, lets now think about the parent join: ((𝝈𝝈 (𝑺𝑺𝑺𝑺𝑺𝑺) ⋈ SS) ⋈
𝝈𝝈𝑺𝑺𝑺𝑺𝑺𝑺.𝒅𝒅𝒍𝒍𝒅𝒅𝒅𝒅𝒍𝒍𝒍𝒍𝒅𝒅𝒅𝒅𝒅𝒅𝒍𝒍 (𝑺𝑺𝑺𝑺𝑺𝑺) ) =”𝒄𝒄𝒄𝒄𝒅𝒅𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄”
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
• Cost of the upper join:
o = 2× NPages((𝝈𝝈 ) ⋈ SS) +2 × NPages(𝝈𝝈 )
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗
𝑺𝑺𝑺𝑺𝑺𝑺.𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍>𝟗𝟗 𝑺𝑺𝑺𝑺𝑺𝑺.𝒅𝒅𝒍𝒍𝒅𝒅𝒅𝒅𝒍𝒍𝒍𝒍𝒅𝒅𝒅𝒅𝒅𝒅𝒍𝒍
[due to pipelining, and replacing access of Student with index scan
cost instead of heapscan] o =2×50+2×25
o =150I/O
So, the total cost is then:
• =1100+105+1050+40+150 • = 2445 I/O
INFO20003 Tutorial – Week 8 Solutions
9
=”𝒄𝒄𝒄𝒄𝒅𝒅𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄”