CS 4320/5320 Homework 2

CS 4320/5320 Homework 2

Spring 2015 Due March 4, 2013

This assignment is due on Wednesday, 2015-03-04 at 23:59. It is out of 80 points and counts for 10% of your overall grade. As before, you must work with one partner for this homework; you may work with the same partner as for Homework 1 or with a different partner if you wish.

There is both a coding part (in Java) and a written part to this assignment.

1 Coding Part: B+ Tree (50 points)

A Note on Academic Integrity: You may discuss your solutions with other students in the class, but do not copy any code from any other students in the class. We will be running submissions through an automated tool that detects code similarity; if you copy code, you will get caught.

In this section, you are asked to implement a basic BPlusTree structure as described in the textbook (pages 344-356). For simplicity, you can assume for this project:

• no duplicate keys will be inserted
• keys will implement interface Comparable

1.1 Description

A B+ tree is balanced tree structure in which the index nodes direct the search and the leaf nodes contain the data entries.

(1) Constraints and Properties of B+ trees:

  • Keys are sorted in the nodes.
  • Nodes are sorted by their keys. For Index Nodes, the B+ tree asserts that for all keys k and adjacent pointers, before(k) and after(k), it must be that:

    max(keys(before(k))) < k ≤ min(keys(after(k)))
    In other words, k is the key that divides the keys in pages before(k) and after(k).

  • The tree is balanced: all paths from the root to the leaf nodes must be of equal length.
  • The nodes are sufficiently filled: For any non-root node, given Order D, D ≤ node.keys.size() ≤

    2*D. The node is underflowed if node.keys.size() < D, and overflowed if node.keys.size() > 2*D.

1

(2) In this assignment, you are asked to write a B+ tree that has Comparable keys and stores values (of any Class). It has search, insert and delete functionality.

(a) Search: Given a key, return the associated value.
(b) Insert: Give a key/value pair, insert it into the correct leaf node, without violating the B+ tree

constraints noted above. In insertion, there are three distinct cases to think about: • the target node has available space for one more key.

• the target node is full, but its parent has space for one more key. (leaf overflow)

• the target node and its parent are both full. (leaf overflow and index node overflow)

(c) Delete: Given a key, delete the corresponding key/value pair without violating the B+ tree constraints noted above.
There are also three main cases to think about (after deletion). Here is how your tree should handle them:

  • the target node is not underflowed (at least half full), there is nothing left to do.
  • the target node is underflowed. Identify a target sibling as described below. If

    node.keys.size()+sibling.keys.size() <= 2*D, merge them.

  • the target node is underflowed and you cannot merge with the target sibling, redistribute with

    the sibling instead.

    The sibling of a node N is a node that is immediately to the left or right of N and has the same parent as N. In this assignment you should handle underflow as follows. Identify a target sibling for the underflowed node. By default, this is the left sibling of the underflowed node. If the underflowed node does not have a left sibling, then we use its right sibling. Once the target sibling is identified, we try to merge with it; if this is impossible because the total number of keys would exceed 2 ∗ D (which happens e.g. when the sibling is full and contains 2 ∗ D entries by itself), we redistribute with the sibling.

    Additionally, as per your textbook’s description of B+ trees, each Leaf Node should keep track of the leaves to its left (lesser key values) and right (greater key values). Note that a leaf node directly to the left (or right) of another need not be its sibling – it may be its cousin, for example.

1.2 Skeleton Code

We provide you with the following skeleton code:

  1. (1)  BPlusTree.java : main class to implement, see the specification as described in the comments in the skeleton code.
  2. (2)  Node.java: general node class for the B+ tree.
  3. (3)  IndexNode.java: subclass of Node.java, represents the index (i.e. interior) nodes of the B+tree.
  4. (4)  LeafNode.java: subclass of Node.java, represents the leaf nodes of the B+tree.
  5. (5)  Utils.java: contains methods aiding testing and debugging your code
  6. (6)  Tests.java: JUnit test class containing some test examples. You should write more test cases to test your tree, although you will not be graded on the additional test cases you write..

1.3 To Do

Implement all the methods in BPlusTree.java. Add any helper methods you need in any of the Java classes.

  1. (1)  search(), insert() and delete() are the main functions that will be called when we test your tree.
  2. (2)  splitLeafNode(), splitIndexNode() are helper functions that may be called by insert. They split the provided leaf/index node, and return the splitkey/newnode pair to be inserted in the node’s parent. For example, say node n contains 1/3/4/5/7. After splitting by splitkey 4, n contains 1/3 and a new node m contains 4/5/7. Entry(4, m) is returned.
  3. (3)  handleLeafNodeUnderflow(), handleIndexNodeUnderflow() are helper functions that may be called by delete(). While not required, we advise that you might find writing “delete” easier if you structure your code to use such functions. They take in two adjacent nodes from the same parent, left (contains smaller keys) and right (contains larger keys), and their parent node. They return -1 if underflow is handled by redistribution indicating no further deletion needed. Otherwise, they return the original splitkey position in their parent for the two merged nodes, so that the parent node knows which key/value pair to delete.
  4. (4)  helper methods’ signatures can be modified if you wish. But the signatures of the main methods (search, insert, delete) must not be changed.

1.4 Grading

We will be running unit tests like the one provided in the skeleton code to test your B+ tree’s search, insert and delete. If your code does not run or fails all our test cases, it will receive no more than 5 points. The points breakdown is approximately as follows:

  1. (1)  5 points on Search
  2. (2)  15 points on Insert
  3. (3)  20 points on Delete
  4. (4)  5 points on logic and style (will be used mostly to allow us to penalize excessively inefficient and/or unreadable code)

2 Written Part (30 points)

For all questions in this part, please write down your reasoning as well as your answer. Just giving a final answer won’t give you full points even if correct, and is guaranteed to give you zero points if incorrect.

2.1 The Cost of Joins (9 points)

Assume a system has the following costs (measured in IOs): • 1.3 per hash-based lookup

• 4 per lookup with B+ index

Assume we want to join relations R and S on a particular attribute A. S has A as primary key, and R has A as a foreign key (so when computing the join, each element of R matches exactly one element of S). Assume a uniform distribution on the foreign keys, i.e. each tuple in S matches the same number of tuples in R.

Both relations can be sorted in 2 passes.
In terms of the number of tuples and pages of R and S, what is the cost of joining the relations using:

(a) Index Nested Loop Join (Clustered B+ Tree Index on S.A) (2 points) (b) Index Nested Loop Join (Unclustered B+ Tree Index on R.A) (2 points)

(c) Index Nested Loop Join (Hash Index on S.A) (2 points) (d) Sort-Merge Join (3 points)

2.2 External Sort (4 points)

Suppose you have a file with 30,000 pages and 4 available buffer pages. Using a two-way external sort, i.e. a sort that only does two-way merging as described in Section 13.2 of your textbook, how many passes are required to sort the file completely and what is the total I/O cost? (2 points) Is it possible to sort this file in just two passes using a two-way external sort given you can have as many buffer pages as you want? How many buffer pages will that take? (2 points)

2.3 The Best Way to Find What You’re Looking For (8 points)

Consider a relation R (a, b, c) containing 8,000,000 records, where each data page of the relation holds 16 records. R is organized as a sorted file with secondary indexes. Assume that R.a is a candidate key for R, with values lying in range 0 to 7,999,999, and that R is stored in R.a order. For each of the following relational algebra queries, state which of the following three approaches is most likely to be cheapest. Remember to give your reasoning to get full credit (2 points each):

• Access the sorted file for R directly.
• Use a (clustered) B+ tree index on attribute R.a. • Use a linear hashed index on attribute R.a.

(a) σa<216,000 (R) (b) σa=216,000 (R)

(c) σa>216,000∧a<218,000 (R) (d) σa̸=216,000 (R)

2.4 Query Optimization (13 points)

Consider a relation with this schema:
Companies(cid : integer, cname : string, age : integer, state : string, category : string)

Suppose that the following indexes, all using Alternative (2) from your textbook for data entries, exist: a hash index on cid, an unclustered B+ tree index on age, an (unclustered) hash index on state and a clustered B+ tree index on ⟨state, age⟩. Each Company record is 160 bytes long. For simplicity you can assume that each index data entry is 20 bytes long regardless of whether the index is on one or two attributes. The Companies relation contains 64,000 pages. There are 16 tuples per page.

(a)

For each of the following conditions, find the most selective path for retrieving all Company tuples that satisfy that condition (only). Explain why this path is the most selective and state its cost. You may assume the cost of navigating a tree index root-to-leaf is 2 I/Os and that the reduction factor (RF) for each term that matches an index is 0.1, regardless of whether the condition has = or some other operator such as <.

(i) age > 10 (2 points)
(ii) state =“CA” (2 points)

(iii) age > 5 AND category=“Technology” (3 points)
(iv) state=“NY” AND age > 5 AND category=“Finance” (3 points)

Suppose that, for the above selection condition (ii), you want to compute the average age for each State (i.e., group by state). For selection condition (ii), describe the least expensive evaluation method and give its cost. (3 points)

Submission Instructions

(b)

3

Submit the two following files:

• a .pdf file containing all your answers to the written questions. As before these must be typeset; scans of handwritten answers are not acceptable and will receive zero points.

• a .zip file containing: all your java files. A README file that includes your name and netid and explains your coding logic and any known bugs.

See CMS for the applicable late submission policy.