The University of British Columbia
Computer Science 404: Database Systems Midterm #3, November 15, 2018. Instructor: E.M. Knorr
Time: 75 minutes. Only a simple, non-programmable, non-communicating calculator is permitted. Closed book. No notes. No cell phones. Have your student ID card ready.
Name Student # (PRINT) (Last) (First)
Signature
The examination has 10 pages and that includes this
cover sheet. Check that you have a complete paper. Print your name and ID at the top of this page, and
provide your signature.
You do not need to print your name on any pages other than the cover page because we use serial numbers.
A simple, non-programmable, non-communicating calculator is permitted. No other aids are allowed.
Work quickly and do the easy questions first. Part marks are available.
The marks for each question are given in braces. Do not spend too much time on any one question.
To minimize disruptions during the exam, please avoid asking the invigilators for help, tips, or explanations.
Write down any reasonable assumptions that you are making, if you believe that a question is ambiguous.
Marks
Question
Max.
Achieved
1
4
2
4
3
4
4
4
5
9
6
5
7
5
8
8
9
3
10
8
11
9
Total
63
Questions 1-3 are multiple-choice questions. Circle all TRUE answers. There may be as few as 0 TRUE answers to the question, and as many as 5 TRUE answers—so be sure to read all parts of the question! For each question, you’ll get 3 marks if you only get 1 wrong, 2 marks if you get 2 wrong, and 0 marks otherwise.
In case of ambiguity or “technical loopholes”, please write down any reasonable assumptions that
you need to make.
1. {4 marks} Which of the following statements about joins and query trees are true? As usual, SMJ means Sort-Merge Join and HJ means Hash Join.
a) SMJ and HJ always return the same number of rows (in the result) for a given SQL join.
b) Usually, in SMJ, we can get a smaller I/O cost estimate if we choose the smaller table for the
left (outer) table.
c) If the buffer pool has a very small number of pages, then HJ on two large tables is likely to
require a lot of page I/Os
d) If the number of pages for two relations is M and N, respectively, and the number of buffer
pool pages (B) is one-half of the smaller of M and N, then it is possible to perform SMJ in a
total of M + N page I/Os .
e) If the number of pages for two relations is M and N, respectively, and the number of buffer
pool pages (B) is one-half of the smaller of M and N, then it is possible to perform HJ in a total of M + N page I/Os .
2. {4 marks} Which of the following SQL statements will probably be able to make use of an Alt. 2 clustered hash index on the Sailors table for the following composite search key: rating & age together (e.g., the index allows us to look up an 18-year-old sailor with rating 10)? Assume that there are 10 different values for rating (1-10) and 50 different values for age (15-64)— and all values are equally distributed. Also, assume that the index is much smaller than the table, and that there are many thousands of sailors.
a) SELECT count(*) FROM Sailors;
b) SELECT sname FROM Sailors WHERE age >= 40;
c) SELECT * FROM Sailors WHERE rating <> 47; — unequal d) SELECT count(*) FROM Sailors WHERE age = 20 AND rating = 9; e) SELECT age, rating FROM Sailors;
3. {4 marks} This time, suppose we have an Alt. 2 unclustered B+ tree index on the Reserves table for the composite search key bid & day (e.g., the index allows us to look up the reservation record(s) for all sailors who had a reservation for boat ID #333 on September 1, 2018). Which of the following SQL statements will probably be able to make use of this B+ tree index? Assume that the bid values are in the range 1-500 and there are 1000 different values for day—and all are equally distributed. There are 5,000,000 rows in the Reserves table and there are 100 rows per page.
a) SELECT day, bid
b) SELECT bid
c) SELECT day
d) SELECT count(*)
e) SELECT *
FROM Reserves;
FROM Reserves WHERE bid > 10;
FROM Reserves WHERE bid < 10;
FROM Reserves WHERE day = 20180831;
FROM Reserves WHERE bid = 17 AND day = 20180831;
Page 2
Note: In this exam, unless mentioned otherwise, assume that: all indexes are Alt. 2; the page size is 4K (i.e., 4096 bytes); no record or data entry (i.e., key + rid) or index entry (i.e., key + pointer) can span across two pages; all pointers (including each rid) are 10 bytes long; the pages are filled to capacity; and as usual, we’ll ignore housekeeping bytes (e.g., overhead bytes, sibling pointers) on all the pages. Don’t worry about whether there is an odd or an even number of data entries or records in your answer; just compute one of these two. Always show your work.
4. {4 marks} Let us re-visit Multiple Choice Question #3. We want to create an Alt. 2 unclustered B+ tree index on the Reserves table for the composite search key bid & day (e.g., the index allows us to look up the reservation record(s) for all sailors who had a reservation for boat ID #333onSeptember1,2018). Assumethateachofbidanddayisa4-byteintegerfield. Recall that there are 5,000,000 rows in the Reserves table and there are 100 rows per page. There are 500 different bid values and 1000 different day values in Reserves.
Compute the number of pages at each level of the B+ tree index.
Page 3
5. {9 marks}
a) Let us continue to work with the details in Question 4. The index there is an unclustered B+ tree index. Assuming the large buffer pool is empty, what is the expected number of page I/Os required to answer this query? Show your work.
SELECT * FROM Reserves WHERE bid = 300 AND day = 20181114;
b) Suppose the B+ tree index in Question 4 on the previous page is clustered instead of unclustered. Using the clustered index, and using the same information written in Question 4, answer the following question. Assuming the large buffer pool is empty, what is the expected number of page I/Os required to answer this SQL query? Show your work.
SELECT * FROM Reserves WHERE bid = 300;
Page 4
For the following questions about joins on the next few pages, assume that you do not retain any pages in the buffer pool between steps, unless the question tells you to use pipelining.
6. {5 marks} Suppose there are 640 Sailors pages and 64,000 Reserves pages. We want to do a Sort-Merge Join (SMJ) on the sid field in each table, with Sailors as the outer table. If we have B = 16 buffer pages, how many page I/Os are required to do the join? Show your work.
Page 5
7. {5 marks} Suppose there are 750 Sailors pages and 75,000 Reserves pages. We want to do a BNL join on the sid field in each table, with Sailors as the outer table. What is the minimum number of buffer pages (B) needed in order to complete the BNL join in no more than 301,000 page I/Os? Show your work. Show that a smaller number for B will not work.
Page 6
8. {8 marks} Suppose there are 300 Sailors pages and 60,000 Reserves pages. This time, we want to do a Hash Join (HJ) on the sid field in each table. (Ignore any fudge factor.)
a) If we have B = 20 buffer pages, how many page I/Os are required to do the hash join? Justify your answer. Be sure to list the partition sizes at each stage.
b) Suppose we have B = 50 buffer pages and there are 64,000 Reserves pages. Assuming we’re joining Sailors and Reserves using HJ, what is the maximum number of pages that the Sailors table can be, in order for us to expect to finish in this many page I/Os:
3*(#ofSailorspages + #ofReservespages)?
c) Suppose we have B = 1,000 buffer pages and there are 64,000 Reserves pages. Assuming we’re joining Sailors and Reserves using HJ, how big can the Sailors table be, in order for us to expect to finish in this many page I/Os:
3*(#ofSailorspages + #ofReservespages)?
Page 7
9. Read this page carefully because the following pages also use this information.
The following database schema is about athletes (people who play sports), sporting locations (where people play a sport, e.g., specific soccer fields, swimming pools, gyms), and participation events (i.e., a single instance of one athlete playing one sport at a particular place & time). When joined with the other table(s), here’s an example of the kind of information you can get from a Participation record: Brenda went swimming at the UBC Aquatic Centre on Nov. 6, 2018 (20181106) at 7 PM (1900). Another example: Ben played badminton at the UBC Student Recreation Centre on 20181029 at 1845.
Athlete(aID, aName, aAddress, aTown, aAge, aSex, ...)—Assume that there are 200 bytes per record, and there are 60,000 athletes.
Location(locID, locName, locAddress, locTown, ...)—Assume that there are 400 bytes per record, and there are 1,000 records in this table.
Participation(aID, locID, pDate, pTime, pSport, ...)—Assume that there are 100 bytes per record, and there are 2,000,000 records in this table.
Suppose the optimizer knows (from the catalog tables) that:
There are 75 different ages in the Athlete table, from ages 5-79 inclusive, and there is a uniform distribution over these ages. The number of males and females is the same.
There are 300 towns (or cities). Assume a uniform distribution.
There are 75 different sports in the Participation table. Examples are hockey, running,
golf, swimming, etc. Assume a uniform distribution.
There is exactly one year worth of data in the Participation table (i.e., 365 days in total).
Assume a uniform distribution.
o The primary key of each of the above 3 tables is underlined. Each primary key has an Alt. 2 unclustered hash index on it. It costs 1.2 I/Os on average to probe a hash index (i.e., to reach the first bucket).
o There is a clustering B+ tree index on pSport. It has 4 levels, and there are 19,608 leaf pages in it.
o Numeric fields, including dates, are 4 bytes long; strings (like aTown, locName, pSport, etc.) are 30 bytes long; and rids and other pointers are 10 bytes long. Any field that ends in “ID” is a numeric field.
Question 9: {3 marks} To the nearest integer, how many records from the Participation table do we expect to match the following (combined) set of WHERE-clause conditions?
WHERE (pSport = “Swimming” OR pSport = “Running”) AND pDate = 20180725
Page 8
10. {8 marks} Consider the following SQL query and its query tree. Compute the total number of page I/Os (total of the left and right children) that are required to answer the query using an index nested loop (INL) join, with Participation as the outer table. Do not use temporary tables; use pipelining. Use the clustered B+ tree index on pSport to help you solve this problem. There are only a very small number of buffer pages; so, you won’t be able to keep pages in memory. Show your work.
SELECT
FROM
WHERE
aName, aTown
Participation P, Athlete A
P.aID = A.aID
and aAge = 25
and pSport = “Tennis”;
π aName,aTown
σaAge = 25 (on-the-fly)
►◄
INL join on aID (on-the-fly)
(pipeline)
σ pSport = “Tennis” Athlete
Participation
Page 9
11. {9 marks} Consider the same SQL query from Question 10, and the same information from Question 9. This time, let us make Athlete the outer table, and let us assume that there is an unclusteredhashindexontheaIDfieldfortheParticipationtable. Answerthequeryusing an index nested loop (INL) join, with Participation as the inner (right) table. (a) Draw an appropriate query plan and tree for this situation. (b) Then, compute the total number of page I/Os (total of the left and right children) that are needed to answer the query. Do not use temporary tables; use pipelining. There are only a very small number of buffer pages; so, you won’t be able to keep pages in memory. Show your work. One decimal place of precision is fine.
SELECT
FROM
WHERE
aName, aTown
Athlete A, Participation P
A.aID = P.aID
and aAge = 25
and pSport = “Tennis”;
Page 10