CS 4320/Homework 2
An online shop stores information on books in a table with the following columns: – Title (type text)
– Author (type text)
– Year (type integer)
Copyright By PowCoder代写 加微信 powcoder
– Category (type text)
For the following questions, we make the following assumptions: – We store information on 100,000 books
– Storing a text field consumes 40 bytes
– Storing an integer field consumes 8 bytes
– One disk page stores 8,192 bytes
Question 1 [40 Points] Assume the table storing books is indexed by an unclustered B+ tree index on the “Category” column. We assume the following:
– The index tree has four levels (entries on the last level point to the actual data) – Each reference to data or to an index node consumes 8 bytes of storage
– We do not merge index or data entries with the same key
– There are 100 categories and books are uniformly distributed over categories
Calculate the total cost (i.e., covering cost for reading data from the index and from the table), measured as the number of page reads, when using the index to retrieve all books of the category “SQL”.
First, we consider cost for reading data from the index itself. We traverse the index from the root node to a leaf. Hence, we read four nodes (i.e., four pages) until reaching the first leaf node. Next, we scan the entries in the index leaf nodes that belong to the relevant category (note that leaf nodes are connected via pointers for B+ trees).
The index leaf nodes contain pairs of keys and record IDs. According to the assumptions, each key consumes 40 bytes and each reference consumes 8 bytes. Hence, we can fit floor(8,192/48) = 170 entries on each index leaf page. We have 100,000 books and 100 categories. As we assume a uniform distribution, we assume 1,000 books per category. 1,000 entries fit on ceil(1,000/170)=6 pages. Hence, we need to read five more leaf nodes after the first one (assuming that the relevant entries align well with the page boundaries). In total, we read 4 + 5 = 9 pages from the index.
Each index entry references one record in the table. The index is unclustered. As seen in class, we pessimistically assume that each entry is located on a different page. Hence, we need one page access for each book in the relevant category. As justified before, we calculate with 1,000 relevant books. Hence, we access 1,000 pages from the table. In total, we read 1,009 pages from hard disk.
Question 2 [20 Points] Answer the previous question now under the assumption that the index is clustered (instead of unclustered). More precisely, assume that the index leaf pages contain directly the data (instead of references to the data).
Again, we traverse the index from the root to a leaf node. As we have four tree levels, we read four pages in the process. Now, the leaf pages contain data entries. Each row contains three text fields and one integer field. According to the assumptions, each row consumes 3*40+8 = 128 bytes of space. This means we can fit 8,192/128 = 64 entries on each page. Therefore, we can fit 1,000 books on ceil(1,000/64)=16 pages. As the index is clustered, all relevant entries are consecutive. Hence, we only need to read 15 more pages for a total of 4+15 = 19 pages. Using the clustered index is significantly cheaper than using the unclustered index.
Question 3 [12 Points] Assume now that we have three (unclustered) indexes on the books table:
– A hash index with the composite key (Author, Year)
– A B+ tree index with the composite key (Author, Year) – A B+ tree index with the key (Author)
For each of the following queries, decide for each index whether it could be used (according to the criteria seen in class). Justify your answer for each index with one to two sentences. Note that we do not ask whether using the index is efficient, just whether it could be used.
Q3.1) [3 Points] Query:
SELECT * FROM Books WHERE Year = 2021
The hash index cannot be used since the query does not restrict all parts of the composite key. The first tree index cannot be used since the restricted column in the query is not a prefix of its composite key. The last index cannot be used either (as it indexes a column that is not used by the query).
Q3.2) [3 Points] Query:
SELECT * FROM Books WHERE Author = ‘ ’
The hash index cannot be used since the query does not restrict all parts of the composite key. The first tree index can be used as the restricted column is a prefix of its composite key. The last index can be used as well since its key column is restricted in the query.
Q3.3) [3 Points] Query:
SELECT * FROM Books WHERE Author = ‘ ’ and Year < 1990
The hash index cannot be used as hash indexes only support equality predicates. The first tree index can be used as it indexes all relevant predicates. The second tree index can be used as well (entries retrieved from the index can still be filtered based on the second predicate on the year).
Q3.4) [3 Points] Query:
SELECT * FROM Books WHERE Author = ‘ ’ and Title = ‘Å
relational model of data for large shared data banks’
The hash index cannot be used as the query does not restrict the year column (which is part of the composite index key). The first tree index can be used as one predicate refers to a prefix of its composite key. The second tree index can be used as its key column is restricted in the query.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com