The University of British Columbia
Computer Science 404: Database Systems Midterm #4, March 27, 2015. Instructor: E.M. Knorr
Time: 48 minutes. Only a simple, non-programmable, non-communicating calculator is permitted. Closed book. No help sheets. No cell phones, no smartphones, or other devices.
Name Student No (PRINT) (Last) (First)
Signature
The examination has 8 pages, but that includes
this cover sheet. Check that you have a complete paper.
Print your name and ID at the top of this page, and provide your signature. Have your student ID ready.
You do not need to print your name on any pages other than the cover page.
A simple, non-programmable, non-communicating calculator is permitted. No other aids are allowed.
Work quickly and do the easy questions first. Part marks are available.
The marks for each question are given in braces. Do not spend too much time on any one question.
To minimize disruptions during the exam, please avoid asking the invigilators for help, tips, or explanations. Please write down any reasonable assumptions that you are making, if you believe that a question is ambiguous.
Marks
Question
Max.
Achieved
1
4
2
11
3
4
4
4
5
9
6
3
7
10
Total
45
Question 1 is multiple choice {i.e., 4 marks¡ªyou get 3 if you only get 1 wrong, 2 if you get 2 wrong, and 0 otherwise}. Circle all TRUE answers. There may be as few as 0 answers to circle for the question, and as many as 5 to circle¡ªso be sure to read all parts of the question. (In case of ambiguity, please write down any reasonable assumptions that you need to make.)
1. {4 marks} Which of the following statements are usually true about data warehousing?
a) Data warehouses tend to support lots of end-user transactions throughout the day¡ªusually many more than for OLTP.
b) In terms of pages, the amount of data in a data warehouse tends to be much larger compared to an OLTP database.
c) To achieve better performance in a data warehouse/OLAP application, the DW design is often highly normalized.
d) A data warehouse tends not to have indexes.
e) The same end-users for OLTP databases are likely to be end-users for OLAP, as well.
Page 2
2. {11 marks} Consider the following query tree. We have 5,000,000 Reserves tuples (50,000 pages) and 40,000 Sailors tuples (500 pages). All the sailors have sids in the range of 1-40,000¡ªand the optimizer knows this. Note that S.sid is a primary key. There is an unclustered hash index on S.sid and an unclustered B+ tree index on R.sid. Use 1.2 page I/Os for each use (if any) of the hash index, and 3 page I/Os for each use (if any) of the B+ tree index. Assume that the indexes can hold several hundred data entries per leaf page or bucket.
a) {7 marks} We want to look up all sailors having sids in the range of 750 to 849, inclusive. Estimate the number of page I/Os required for the plan corresponding to the query tree below. We want the cheapest plan using INL. We will pipeline the results (no temporary tables). Show all of your work. List any reasonable assumptions, if you need to make any. ¡°sid in¡± is treated just like ¡°or¡± conditions.
*
sid (INL join)
sid in (750, 751, … 849) Reserves Sailors
b) {2 marks} Would it make a difference to the right child calculations if the B+ tree index were clustered on R.sid instead of unclustered? If so, how would your calculation for that part change? If clustering won¡¯t make a difference, briefly explain why (one sentence is fine).
c) {2 marks} Would it make a difference to the left child calculations if there were no index at all on Sailors? If so, how would that part of your calculation change? If it won¡¯t make a difference, briefly explain why (one sentence is fine).
Page 3
3. {4 marks} We want to do a Sort-Merge Join (SMJ) on 2 tables: X and Y. If we have 370 buffers, and if X is 24,000 pages long, then what is the largest possible size for table Y, so that we require no more than 120,150 page I/Os to do the complete SMJ? Show your work. List any assumptions.
4. {4 marks} We want to do a Hash Join (HJ) on 2 tables: R (20,000 tuples at 100 tuples/page) and S (60,000 tuples at 200 tuples/page). What is the cost of doing the hash join if we have B=31 buffers (4K pages, as usual? Be sure to show the expected page sizes for the partitions for each table. Show all your work.
Page 4
5. {9 marks}
(a) {2 marks} We have 250,000 customer records. Suppose we have an association table (a simple lookup table or array) that has 250,000 rows: 4 bytes for each customer number (1-250,000) in column 1 of the lookup table, and 10 bytes for that same customer¡¯s rid (stored in column 2 of the lookup table). How many 4K pages do we need to store this table? Don¡¯t split a pair/entry over 2 pages.
b) {2 marks} We¡¯re going to see which customers purchase which candies. Each type of candy (e.g., ¡°licorice nibs¡±) will have a bitmap vector. How many 4K pages do we need for a single bitmap vector? Ignore rids for this part.
c) {5 marks} Suppose we have x types of candy. DBAs are evaluating which case takes up more space: Case (1): x bitmap vectors, plus the lookup table from Part (a), plus exactly one page to store all of the candy names; or Case (2): just use an Alternative 2 B+ tree index (in the usual way) to index the candies. Note that we expect duplicate keys in Case (2); that¡¯s OK.
For the B+ tree index, let us assume that each candy¡¯s name (Cname) takes up 15 bytes. What should the value of x be in order for the B+ tree index on the search key Cname (i.e., Case (2)) to take up less space than the bitmap vector solution (i.e., all of Case (1)). Provide the smallest such value for x. Show all your work. List any reasonable assumptions, if you make any.
Page 5
6. {3 marks} Consider the following query tree/plan. Just write the corresponding SQL statement for it
on this page. T1 and T2 are temporary tables.
S.rating, R.day, R.bid sid (BNL join)
T1
S.years >= 58
Sailors
T2
R.bid > 30
Reserves
Page 6
7. {10 marks} Similar to our examples in class, suppose there are 100,000 Reserves records (1,000 pages), and 40,000 Sailors records (500 pages). Let us assume that a field in Sailors is the sailor¡¯s years of experience (S.years). There is a uniform distribution of years of experience between 0 and 59; and the optimizer knows this. The boats are numbered from 1-300, and the optimizer knows this. There is no index on either table. We are going to project as early as possible, and we want to minimize the number of page I/Os that we do.
Sailors: 50 bytes/tuple: (sid = 4 bytes, rating = 4 bytes, years = 4 bytes, sname = 38 bytes) Ratings: 40bytes/tuple: (bid=4bytes,sid=4bytes,day=10bytes,rname=22bytes)
S.rating, R.day, R.bid sid (BNL join)
T1
S.years >= 58
Sailors
T2
R.bid > 30
Reserves
Use the query tree above with temporary tables T1 and T2. Compute the cheapest plan for the whole query, using BNL join. Project as early as possible. Assume there is no pipelining. You don¡¯t have to re-draw the tree. Show all your calculations. There are B=25 buffers of size 4K.
a) Do your left child calculations here:
THERE IS MORE ACTION ON THE NEXT PAGE.
Page 7
b) Do your right child calculations here:
c) Do any remaining work here:
THIS IS THE END OF THE EXAM.
Page 8