CS W186 Spring 2020 Final 2
Do not share this exam until solutions are released.
• The final has 11 questions, each with multiple parts, and worth a total of 157 points. Taking the exam:
• You have 180 minutes to complete the final.
Copyright By PowCoder代写 加微信 powcoder
• You may print this exam or download it onto an electronic device to work on it.
• For each question, submit only your final answer on Gradescope.
• For numerical answers, do not input any suffixes (i.e. if your answer is 5 I/Os, only input 5 and not 5 I/Os or 5 IOs) and do not use LaTeX.
• For decimal answers, exclude leading zeroes (i.e. if the answer is 0.75, input just .75)
• Questions tagged with [EXPLANATION] require you to show work and not doing so will result in no credit. You can do this by inputting an explanation in the text field or submitting a file (photo, PDF) of your work. The text field supports LaTeX by using $$insert expression here$$.
• Make sure to submit on Gradescope at the end of the exam. Aids:
• This exam is open-book, open-calculator, and open-Internet. • You must work individually on this exam.
Grading Notes:
• All I/Os must be written as integers. There is no such thing as 1.02 I/Os – that is actually 2 I/Os. • 1 KB = 1024 bytes. We will be using powers of 2, not powers of 10
• Unsimplified answers, like those left in log format, will receive a point penalty.
1 Pre-Exam (1 point)
1. (1 point) Please read and sign the Honor Code Statement on Gradescope.
2 Separated Quarantine Life (15 points)
You did a great job setting up the student roster with Octograder in the beginning of the semester, so the CS W186 TAs find you again, this time to help generate project grades. They hand you the following schema:
Throughout this question, assume the following:
• students contains all enrolled students. Each student has unique sid, edx id, and email, but the
database uses edx id as the unique identifier for students.
• assignment group represents the 5 projects in this class: it is one of proj1, proj2, proj3, proj4, proj5.
• assignment name is the parameter that you have to enter when running the turn in.py script: it is one of proj1, proj2, proj3 part1, proj3 part2, proj4 part1, proj4 part2, proj4 part3, proj5.
• profile is either public or hidden.
• results contains two rows for every student submission: one with public profile and one with hidden
profile for that assignment name.
• score and weight are both between 0 and 1, inclusive. The weights of all (assignment name, profile) pairs for an assignment group add up to 1. Given the submissions of a project from a student, the score is the sum of score * weight for all (assignment name, profile) pairs for that assignment group, which gives a score out of 1.
• We only deal with the raw scores; ignore the effects of late submissions, academic dishonesty, etc.
Page 2 of 33
1. (4 points) [EXPLANATION] You try to build the schema, but you get a warning saying Schema Error. Fix the schema so that there are no errors. Your fix should stick to the assumptions from the previous page.
Make sure to clearly state which line(s) you are changing, some examples of valid answers: “add … after line 4”, “change line 17 to …”, “delete … from line 24”.
(Note: there are two schema errors and three changes to make. You need all of them to get full credit.)
The schema is fixed, and you are ready to generate project grades. Due to the difficulties in getting help in OH because of the special circumstances of remote learning this semester, the TAs try to come up with ways to help student grades. Currently there are four proposals in how to generate the total project grades:
Proposal 1: Stick with the current policy: for every student, use the latest submission for every assignment and sum them up to get a score out of 5.
Proposal 2: For every student, use the highest score from every (assignment name, profile) pair and sum them up to get a score out of 5.
Proposal 3: For every student, use the latest submission for every assignment to get 5 scores for the 5 projects, then drop the lowest one, to get a score out of 4.
Proposal 4: Grade on completion: for every student, give 1 if they submitted anything for a project and 0 otherwise, and sum them up to get a score out of 5.
For questions 2 to 8, select a proposal if the query outputs result for that proposal. The output should have two columns, edx id and total score, and it should have one row and only one row for anyone who has made at least one submission for any assignment. The students who didn’t submit anything (i.e. results doesn’t have any row for them) don’t need to be in the output.
Select “Error” if the query errors, or select “Incorrect Result” if the query outputs some results but don’t match any of the proposals.
Page 3 of 33
2. (1.5 points) Query:
A. Proposal 1: latest submission
B. Proposal 2: highest score from every (assignment name, profile) pair C. Proposal 3: latest submission and drop lowest
D. Proposal 4: completion
F. Incorrect Result
3. (1.5 points) Query:
A. Proposal 1: latest submission
B. Proposal 2: highest score from every (assignment name, profile) pair C. Proposal 3: latest submission and drop lowest
D. Proposal 4: completion
F. Incorrect Result
Page 4 of 33
4. (1.5 points) Query:
A. Proposal 1: latest submission
B. Proposal 2: highest score from every (assignment name, profile) pair C. Proposal 3: latest submission and drop lowest
D. Proposal 4: completion
F. Incorrect Result
5. (1.5 points) Query:
A. Proposal 1: latest submission
B. Proposal 2: highest score from every (assignment name, profile) pair C. Proposal 3: latest submission and drop lowest
D. Proposal 4: completion
F. Incorrect Result
Page 5 of 33
After an hour of meeting over Zoom, the TAs still cannot decide on a proposal. The professors decide to be nice and take the maximum score of using proposals 1, 2, and 3, after scaling them to be out of 1 (unfortunately, proposal 4 was eliminated from consideration). The scores from proposals 1, 2, and 3 as described above are already generated. The scores from proposals 1 and 2 are divided by 5, and scores from proposal 3 are divided by 4, so that they are all average project scores out of 1. They are stored in relations P1(edx id, weighted score), P2(edx id, weighted score), and P3(edx id, weighted score) for proposals 1, 2, 3, respectively.
6. (5 points) [EXPLANATION] Select the queries that output the name, email, and the maximum weighted score from proposals 1, 2, and 3. The output should have exactly one row for every enrolled student. For students who didn’t submit anything (i.e. they are not in P1, P2, or P3), they should have a score of 0 in the output.
For the queries that are not selected, briefly explain what is wrong (i.e. what error it throws or how is the output different from the expected output). No need to have explanation for the correct queries.
There may be zero, one, or multiple correct answers.
Page 6 of 33
3 Maintaining (at Least 6 ft. of) Free Space (14 points)
On Netflick’s new spin-off reality show, Too Sick to Handle, contestants compete to see if they can hone their apocalypse survival skills. $10, 000, 000 is in the jackpot, as long as contestants obey one rule: during the duration of the month-long “Quarantine”, they must stay at least 6 feet apart from other people at all times. To make sure the contestants don’t break the rules, the producers of Too Sick to Handle employ an all-knowing AI, Dana, to watch over them, and have asked you to help program her.
In Dana’s database of contestants, we start with the following schema.
CREATE TABLE Contestants ( cid INTEGER PRIMARY KEY, firstName CHAR[15], lastName CHAR[15]
Throughout this question, assume integers are 4 bytes each and pages are 1 KiB (1024 B) each.
1. (1 point) How many contestants records can fit on each page? Page headers consist of a bitmap and an integer to store the bitmap’s length.
2. (2 points) Assume now that each contestant record is 12B, and we have unpacked page with bits [0, 1, 2, 3, 4, 5, 6, 9] in the bitmap on (signalling presence of a record) and an integer to store the bitmap’s length. Rounded down to the nearest byte, how much free space, not including the bitmap, is left on the page in bytes?
There should be a deduction from the prize pool whenever the AI Dana catches contestants incurring an infraction. You decide to add some more info for each contestant, which is as follows:
CREATE TABLE Contestants ( cid INTEGER PRIMARY KEY, firstName CHAR[15], lastName CHAR[15], occupation VARCHAR, address VARCHAR, numInfractions INTEGER
Page 7 of 33
For problems 3-7, assume that each unpacked page’s footer contains an 10-byte area for the record count and pointer to free space, as well as a slot directory where it takes 4 bytes per record to store the pointer/offset (in bytes) to the record in the page and another 4 bytes per record to store the length (in bytes) of that record.
3. (1 point) Assume each record header uses 1 byte pointers. What is the minimum possible record size in bytes?
4. (2 points) [EXPLANATION] How much total free space does the following page have? Assume we have a slot directory independent of the schema above that contains of the following entries: [0, 100], [100, 100], [200, 100], [300, 100], [400, 100], [500, 20], [544, 256], [800, 108].
5. (1 point) True or False: Using the slot directory in question 4, we can insert a record of total length 44B in the page without any reorganization.
6. (1 point) True or False: Using the slot directory in question 4, we can insert a record of total length 38B in the page without any reorganization.
7. (1 point) What is the minimum number of records we would have to delete from the page to have enough space to insert a record of length 384B?
In designing the AI Dana, you also consider the pros and cons of file organizations by comparing a page directory with a linked list implementation.
8. (3 points) [EXPLANATION] Suppose that you want to insert a record into your table, and in your page directory implementation, the first page with free space for the record is referenced in the 10th header page. In order for a linked list implementation to have a faster insert (in terms of I/O’s) even with the worst case number of I/Os, which page (e.g. 1st/2nd/3rd) should be the first page in the linked list that has enough free space for the record? Assume buffer memory is not an issue and full pages in the linked list implementation are added to the beginning of the list for efficiency. Write your answer as a number (i.e. if it’s the 3rd page, write 3).
Page 8 of 33
The producers tell you that the more contestants you have, the more interesting the “Quarantine” will be. You start looking at file and index layouts to make efficient queries to the Contestants table, as they expect thousands of contestants to join the show.
9. (2 points) As contestants balance maintaining distance with maintaining their normal routines, there likely will be many updates per contestant (who are identified by their cid) to their infraction count. The producers also tell you that they will be holding special challenges for contestants within certain ranges of infraction numbers incurred, of whom you will need to help the producers identify.
If you can only have one index, which combination of choices (one letter per each column) would optimize the performance in this scenario?
File Layout
(a) Sorted File on cid
(b) Sorted File on numInfractions (c) Heap File
(d) Unclustered Index on cid
(e) Unclustered Index on numInfractions (f) Clustered Index on cid
(g) Clustered Index on numInfractions
Page 9 of 33
4 B+ Trees Get Degrees (15 points)
For questions 1-3 we have a clustered alternative 2 B+ tree of height=2 (remember the height of the root is 0). Assume all matches for this B+ tree occur on the same data page. Also assume we have 10 buffer pages, use LRU as our eviction policy, and the buffer is empty at the start of each question. How many I/Os do each of the following operations require?
1. (2 points) get(key) where key does not appear in the table
2. (2 points) [EXPLANATION] get(key) where key appears in the table 5 times (assume order d=3)
3. (2 points) [EXPLANATION] put(key, value) where no splits happen
For questions 4-6, we want to see how modifications to the B+ tree affect the I/O count. For each modifi- cation, how much does the I/O count for the operation in question 2 (get(key) where key occurs 5 times) change by? Use +/- signs if the change is non-zero. If there is no change, put 0. For example, if the answer to question 2 is 10 I/Os, and you think after a modification it will only take 8 I/Os, your answer should be -2. These questions are independent of each other, so each question is only making one modification to the B+ tree presented at the start of the problem.
4. (2 points) How does the I/O count change if we make the tree an alternative 1 tree? Remember that we are assuming all records occur on the same page.
5. (2 points) How does the I/O count change if we make the tree an alternative 3 tree?
Page 10 of 33
6. (2 points) How does the I/O count change if we make the tree unclustered (assume all 5 matches occur on different data pages)?
7. (3 points) We will now start with the following B+ tree:
We now insert 8 and then 6. The skeleton of the B+ tree after those insertions is below. Some keys are provided, some slots have letters in them, and some slots are empty even though a key should be there. Fill out the answer sheet with what keys belong in the slots that have letters in them. If a letter is in a place where there is no key, just write NONE. If a node only has one key, it belongs in the leftmost slot.
Page 11 of 33
5 Hash Headaches, Stanford Sorts (17 points)
Jealous of the success of Postgres (which began as a research project at Berkeley), students at Stanford decide to develop their own database: Regres. Worried that they might injure themselves without adult supervision you decide to advise them with your knowledge from CS W186!
One of the students is experimenting with different hash functions for external hash on a set of random uniformly distributed unsigned 32-bit integers. We say a hash function causes data skew if we expect more keys to hash to certain values than others. For example, consider a hash function that returns 1 if the key is a multiple of 10 and 0 otherwise. This function causes data skew as roughly 10% of the possible keys would have a hash value of 1, while about 90% would hash to 0.
1. (2 points) The student asks you to consider the following hash function: the hash value is computed as the number of digits in the base 10 representation of the integer. Possible values are in the range of [1, 10] inclusive. 0 is considered to be 1 digit. Does this hash function cause significant data skew? And if so, which hash value has the most keys hash to it? Also include the percentage of keys which hash to it. In the example above, an acceptable answer would be: “Yes this function would cause data skew, roughly 90% of keys hash to the value of 0.” You may round to the percentage to the nearest multiple of 5.
Hint: The smallest unsigned 32-bit integer is 0, the largest is 232 − 1 = 4, 294, 967, 295 (roughly 4 billion).
2. (2 points) The student decides that writing good hash functions based only on digits is tricky and instead creates a program that returns a random number in the range [1, 186] uniformly as a hash value, regardless of what the key passed in is. On average, would this approach cause significant data skew? What might go wrong if they used this approach in external hashing when their goal is to group together duplicate values?
3. (3 points) You decide to give the students a working implementation of one of the perfectly uniform hash functions frequently mentioned in CS 186 exams. The students test it out on a small input and it seems to work great. The next day the students call you to complain that there’s a problem with the hash function you gave them: if they try to use it to group duplicates from a large input their code never stops running! Close examination reveals that the only hash function the students use for the entire external hashing procedure is the perfectly uniform one you gave them, and that the most frequently repeated key can easily have all duplicates fit in memory at the same time. Explain why their external hashing worked for small inputs, but not for large ones. Your answer should be around 2 to 4 sentences of explanation.
Page 12 of 33
The students tell you they’ve been working for 12 months optimizing the in-memory sorting and merging functions used for external sort. Their optimized versions of in-memory sort (used in pass 0) and in-memory merge (used in the remaining passes) are 12× faster than the default implementations! For example, if the default implementation of either algorithm takes 12 seconds to do something, the optimized implementation will take 1 second.
You decide to test out a small example to see how much this would improve the overall speed of external sort. You find that disk I/Os will take roughly 15 milliseconds to complete, the default implementation of in-memory sort on 4 pages of records takes 4 milliseconds to complete, and the default implementation of in-memory merge takes 1 millisecond for each page being merged to complete (merging 10 pages takes 10 milliseconds). For the following questions consider running external sort on 12 pages of records with access to 4 memory buffers.
Assume the students are using the same implementation of external sort presented in class, and that disk operations are blocking (multiple disk I/Os can’t be run in parallel). The times for in-memory sort and in-memory merge assume that the pages have already been loaded into memory and don’t include the time to write pages back to disk.
4. (2 points) How many I/Os will it take to run external sort and how long in milliseconds will it take to complete these I/Os using the default implementation?
5. (2 points) How many separate in-memory sorts will take place and how many milliseconds in total will be spent performing in-memory sorts? Two in-memory sorts are considered separate from each other if they take place over different pages of records.
6. (1 point) How many milliseconds will be spent performing in-memory merge using the default imple- mentation?
Page 13 of 33
7. (2 points) [EXPLANATION] Sum together the times from the previous three questions to get the total run-time of external sort with the default implementations of in-memory sort and in-memory merge. Then compute the total run-time (in milliseconds) of external sort with the optimized (12× faster) versions of in-memory sort and in-memory merge. How many times faster is external sort with the optimizations (compute the default runtime divided by the optimized runtime)? If you don’t have access to a calculator feel free to leave things in fractional form. Otherwise round your answer to two decimal digits.
8. (3 points) The students suspect that in another twenty-four months they can produce custom hardware to perform the in-memory sort and in-memory merge over 1000000x faster! Based on your answer to question 7, is it worthwhile for the students to focus on this approach if their goal is to increase the overall speed of their external sorts?
(There are multiple correct ways to answer this, any answer that draws a rational conclusion about improvements to the speed of external sort based on the results of question 5.7 is eligible for full credit. Your answer should have around 2-4 sentences of justification).
Page 14 of 33
6 Join Me for Virtual Commencement (22 points)
Due to COVID-19, UC Berkeley, the number one public university, has moved commencement online! You have been asked by the chancellor to match diplomas and graduating students in Berkeley’s database.
You are given the Students(sid,
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com