The University of British Columbia
Computer Science 404: Database Systems Midterm #3, March 11, 2015. Instructor: E.M. Knorr
Time: 48 minutes. Only a simple, non-programmable, non-communicating calculator is permitted. Closed book. No help sheets. No cell phones, no smartphones, or other devices.
Name Student No (PRINT) (Last) (First)
Signature
The examination has 7 pages, but that includes
this cover sheet. Check that you have a complete paper.
Print your name and ID at the top of this page, and provide your signature. Have your student ID ready.
You do not need to print your name on any pages other than the cover page.
A simple, non-programmable, non-communicating calculator is permitted. No other aids are allowed.
Work quickly and do the easy questions first. Part marks are available.
The marks for each question are given in braces. Do not spend too much time on any one question.
To minimize disruptions during the exam, please avoid asking the invigilators for help, tips, or explanations. Please write down any reasonable assumptions that you are making, if you believe that a question is ambiguous.
Marks
Question
Max.
Achieved
1
4
2
4
3
6
4
8
5
6
6
4
7
8
Total
40
Question 1 is a multiple choice question {i.e., 4 marks—you get 3 if you only get 1 wrong, 2 if you get 2 wrong, and 0 otherwise}. Circle all TRUE answers. There may be as few as 0 answers to circle for a given numbered question, and as many as 5 to circle—so be sure to read all parts of each question. In case of ambiguity, please write down any reasonable assumptions that you need to make.
1. {4 marks} Consider the following SQL statements that query the Sailors table from class. Suppose we have only 3 indexes (all are Alternative 2) on the Sailors table. These indexes are: (a) a B+ tree index on the 10 different rating values, (b) a hash index on the primary key sid, and (c) a B+ tree index on the sailor’s first name which is the field called sname. Assume we have 500 data pages for Sailors, and there are 80 records per page—so there are 40,000 records in all. There are other fields in the data table, besides those listed in the indexes above.
Which queries below can be answered using only an index from the ones shown above (i.e., without ever visiting the data table)? You don’t have to specify which index.
a) SELECT age
b) SELECT count(*)
c) SELECT count(*)
d) SELECT *
e) SELECT count(*)
FROM Sailors
FROM Sailors
FROM Sailors;
FROM Sailors
FROM Sailors
WHERE sid = 15000;
WHERE sname = “Angela”;
WHERE sid = 15690 OR sid = 2250;
WHERE rating >= 10;
2. {4 marks} Suppose you have B= 35 buffer pages of size 4K each, and you plan to use external mergesort to sort a file. You want to complete the sort in 3 passes. Let’s call it 3-phase multiway mergesort (3PMMS). What is the largest file, measured in terms of the number of 4K pages, that you can sort in 3 phases (passes)? Show your work.
Page 2
3. {6 marks} Suppose we start with the following Linear Hashing index. As usual, assume that a split occurs every time that we try to enter a new index entry into a bucket that’s already full (and has no spare space on an overflow page, if any)—just like the examples in class. For example, if the main bucket is full, but there’s already an overflow page with room for another entry, then we don’t split; but, if the main bucket is full, and the overflow pages are also full, then we trigger a split. (In case it matters, let us assume that we don’t do more than one split for each newly inserted key.)
There is a maximum of 3 entries per bucket. We’ll hash modulo the last n bits of the key value (i.e., we’ll use the same hash function as in class).
Insert the following 4 keys into the linear hashing index in this order: 10, 18, 17, 64.
Show what the linear hash structure looks like after inserting the above 4 keys, by drawing the final state
of the linear hashing index.
Level 0
000 0 16 32
next = 1 01 1 5
10 2 6
11 3 7 100 4 12
Page 3
4. {8 marks} Note: There may be more information here than you need for this next set of questions. Consider the following hospital or outpatient clinic’s Admissions table. As usual, assume 4K=4096 bytes for the page size; a tuple cannot span pages. All ID fields are integers (4 bytes), and dates and times are both stored in integer format (e.g., 20150311 is March 11, 2015—and 163045 is 4:30:45 PM).
Admissions (admID, admDate, admTime, admDoctorID, patientID, reasonCode, …) This table contains 3 years’ of patient admission data. For example, admID 15399 might be a patient
coming in for a cataract treatment. The reasonCode is a 5-byte text field. admID is the primary key.
There is no patient name in this table, and there is no index on patientID in this table.
The Admissions table has 70 rows per page. There are 105,000 rows in total in the table.
Suppose there is an Alternative 2 B+ tree clustering index on the Admissions table on the composite key: admDate, reasonCode, and admDoctorID (i.e., we have a composite index of 3 fields). All rid and child pointers are 10 bytes long, and as usual, we’ll ignore other pointers and overhead bytes.
a) {5 marks} Compute the number of leaf and internal pages in this clustering index. In other words, compute the number of index pages at each level. Show your work.
b) {3 marks} Estimate the cost in page I/Os for creating a report that lists all the admission dates (admDate) and reason codes (reasonCode). In other words, there are just these 2 fields together on each line of output. Don’t worry about the row order. Don’t write the SQL.
Page 4
5. {6 marks} This question continues with Question 4’s Admission table, but is separately numbered for convenience in marking. Again, you might not need all of the information presented here.
Suppose we also have a Patient table with the following schema. ID fields are integers (4 bytes) and text fields in this table are 25 bytes each. All indexes are Alt. 2, as usual.
Patient (patID, patName, patAddress, patCity, …) – Note: patID is the primary key.
We still have the B+ tree clustering index on the previous page for Admissions. Secondly, there is an unclustered hash index on patID in the Patient table. Thirdly, there is an unclustered B+ tree index on patName in the Patient table.
For simplicity, assume that it take 1.2 page I/Os on average to lookup a
There are 50,000 rows in the Patient table, and there are 50 rows per page. Every person in the Admissions table is in the Patient table.
Estimate the best cost in page I/Os for computing the answer to the following SQL query, using INL (index nested loop) join. List any reasonable assumptions that you make. Note that the order in the FROM clause or the WHERE clause does not necessarily tell you what the outer table is. Don’t worry about sorting the rows of output.
SELECT
FROM
WHERE
P.patName, P.patCity, A.admDate, A.admTime, A.reasonCode
Patient P, Admissions A
P.patID = A.patientID;
Here is some sample output from the query:
Amanda Lee Richmond Greg Heffley Langley
20150228
20141209
142618
112547
ABD15 C5X20
Page 5
6. {4 marks} For convenience, the same Admissions and Patient tables are repeated here. Nothing has changed.
Admissions (admID, admDate, admTime, admDoctorID, patientID, reasonCode, …) The Admissions table has 70 rows per page. There are 105,000 records in total in the table.
Patient (patID, patName, patAddress, patCity, …)
There are 50,000 rows in the Patient table, and there are 50 rows per page. Every person in the
Admissions table is in the Patient table.
Estimate the best cost in page I/Os for computing the block nested loop (BNL) join for these 2 tables, if you had B=17 buffer pages of size 4K.
Page 6
7. {8 marks} This question is about external mergesort. We are trying to sort an 80,000 page file (4K pages). We have B=10,000 buffer pages of size 4K each. There are 1000 pages per cylinder.
All of the sorted runs, including those from Phase 1, and the final sorted run, will be contiguous (consecutive) cylinders. You can assume that you are the only user of the (single) disk drive, and that it has lots of free space on it.
Just like in some of our exercises, let us count two kinds of seeks only. We want to know how many SHORT seeks (which are defined as neighbouring seeks to the next cylinder), and how many LONG seeks (defined as any seek that is over 1 cylinder away from the current position (i.e., beyond a current neighbour)) … that we need, in order to merge the file, and produce the final sorted run.
a) Estimate the number of short seeks (SS) and long seeks (LS) that are needed to create the sorted runs for this file during the SORT phase (i.e. during Phase 1).
b) For the MERGE phase (i.e., Phase 2), we are going to do an 8-way external mergesort. The 8 input buffers are 1000 pages in size. The single output buffer is 2000 pages in size.
Compute the number of short seeks (SS) and the number of long seeks (LS) to read the sorted runs (for input), as well as the number of short seeks and the number of long seeks to write the sorted output file. Justify your answer, by showing your work. List any reasonable assumptions that you make, if any.
Page 7