CS 4320/Homework 3
Consider the following query plan, executed on the database of an online shop: On-the-fly 𝛑[Zip]
Hash Join ⋈ (Refined)
Index-Nested ⋈ Loops Join
Copyright By PowCoder代写 加微信 powcoder
Materialized 𝜎[YEAR=2020] Customer Account
The plan generates zip codes of the addresses associated with customers who opened an account in 2020. All joins are natural joins (i.e., the join condition requires that all columns with the same name in different tables have the same value). In the following, we ask you to estimate processing costs for this plan, counting the number of page I/ Os and using the methods seen in class.
The three tables above have been created via the following SQL commands:
– Create table address(aid int primary key,
zip int, city text, street text, housenr int);
– Create table customer(cid int primary key, name text, aid int, foreign key (aid) references address(aid));
– Create table account(login text primary key, password text, year int, cid int,
foreign key (cid) references customer(cid));
A hash index has been created on the “cid” column of the “Customer” table. Use the following information for your cost calculations:
– The database stores 100,000 accounts – The database stores 50,000 customers
– The database stores 25,000 addresses
– Text fields consume 40 bytes of storage
– Integer fields consume 8 bytes of storage
– Each disk page stores 8,192 bytes
– The buffer pool can store 100 pages
– The database stores information on ten years
– The data is distributed uniformly
– Retrieving one customer via the hash index on the “Customer” table for a given customer ID costs two page accesses
– We keep only one representative among different columns with the same name after each natural join (the other columns are projected out)
Q1 [60 Points] Calculate the size of all base tables and intermediate results, measured in disk pages. Show and justify your calculations (this enables us to give partial credit if the reasoning is correct but the calculations are erroneous).
First, we calculate the size of base tables. The “Account” table has two text and two integer columns. We count 40 bytes per text column and 8 bytes per integer column. This means that each entry consumes 96 bytes in total. With a page size of 8,912 bytes, we can therefore store 85 entries per page. We have 100,000 accounts in total, consuming ceil(100,000/85)=1,177 pages.
The “Customer” table has one text and two integer columns, hence 40+2*8=56 bytes per entry. This means we can fit floor(8,192/56)=146 entries on a page. We have 50,000 customers in total, consuming ceil(50,000/146)=343 pages.
The “Address” table has three integer columns and two text columns, i.e. 3*8+2*40=104 bytes per entry. Hence, we can fit floor(8,192/104)=78 entries on one page. This means we need ceil(25,000/78)=321 pages to store the entire relation.
We filter the “Account” table to entries from the year 2020. The database stores entries for ten years and we assume a uniform data distribution. Hence, we expect about 1/10th of entries to pass the filter condition. We can still fit 85 entries on one page (see previous filter condition) and therefore need ceil(10,000/85)=118 pages to store the first intermediate result.
After the first join (between accounts and customers), we keep all columns from “Account” and all columns from “Customer”, except for the “cid” column (which is already present in the “Account” table). Hence, we have three text columns and three integer columns. Hence, each entry consumes 3*40+3*8=144 bytes and we can fit floor(8,192/144)=56 entries per page. Due to the foreign key constraint, each account is associated with a customer. We have 10,000 remaining accounts (after filtering by year) which fit on ceil(10,000/56)=179 pages.
Following the conventions introduced in class, we do not count the cost of writing out the final join result. Hence, we do not need to calculate the size of the final result.
Q2 [40 Points] Calculate the cost of each operation that appears in the plan (measured by the number of pages). Sum them to obtain the total execution costs.
First, we need to read the “Account” table from hard disk to apply the filter by year. Based on the previously calculated table size, reading costs 1,177 page I/Os. Next, we materialize the filtered table. The filtered result consumes 118 pages. Hence, we pay 118 page I/Os for materialization.
Next, we execute an index-nested loops join. We pay 118 page I/Os again for reading the outer operand. For each index lookup, we pay two page I/Os (as per assumptions). As we expect 10,000 remaining accounts, after filtering, this accounts for 20,000 page I/Os. Writing the join result to hard disk costs 179 page I/Os.
The second join is a hash join. The cost for this join operator is three times the combined sizes of the input relations. Here, we have a combined size of 179 + 321 = 500 pages. Hence, the total join cost is 1,500 page I/IOs.
In total, the cost is 1,177 + 118 + 118 + 20,000 + 179 + 1,500 = 23,092 page I/Os.
Q3 [10 Points] Propose one improvement to the given query plan (i.e., propose a change that does not change the final result but likely reduces processing costs). You do not need to calculate the cost savings of your proposed change. Instead, justify in one to three sentences why the proposed change is likely to reduce costs.
E.g., we can use pipelining and avoid writing the filtered “Account” table to hard disk. This removes the cost for first writing, then reading the same result to and from hard disk. Other valid propositions: using the index nested loops join with the unclustered index is not very efficient. Using a hash join, for instance, would be cheaper.
Submit answers to all questions as a single .pdf document on CMS. This homework can be submitted by groups of up to three students.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com