CSC 555: Mining Big Data
Project, Phase 2 (due Sunday March 22nd)
In this part of the project, you will execute queries using Hive, Pig and Hadoop streaming and develop a custom version of KMeans clustering. The schema is available below, but don’t forget to change to the correct delimiter:
HYPERLINK “http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql” http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql
The data is available at (this is Scale1, the smallest denomination of this benchmark)
HYPERLINK “http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/” http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/
In your submission, please note what cluster you are using. Please be sure to submit all code (pig, python and Hive). You should also submit the command lines you use and a screenshot of a completed run (just the last page, do not worry about capturing the whole output). An answer without corresponding code will not be counted.
I highly recommend creating a small sample input (e.g., by running head lineorder.tbl > lineorder.tbl.sample, you can create a small version of lineorder with a few lines) and testing your code with it. You can run head -n 500 lineorder.tbl to get a specific number of lines.
NOTE: the total number of points adds up to 70.
Part 1: Data Transformation (15 pts)
Transform part.tbl table into a *-separated (‘*’) file: Use Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions).
Do not use sed to answer this question because: 1) that is an answer found on stackoverflow. 2) sed is a Linux utility and thus you are not actually using Hive to perform the transformation.
In all solutions you must switch the first and last columns in your output. You do not need to transform the columns in any way, just switch them around. This means you do not have to use SELECT TRANSFORM or python in your Hive solution.
Part 2: Querying (25 pts)
Implement the following query:
select c_nation, AVG(lo_extendedprice)
from customer, lineorder
where lo_custkey = c_custkey
and c_region = ‘AMERICA’
and lo_discount = 5
group by c_nation;
using Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions). I Hive, this merely requires pasting the query into the Hive prompt and timing it. In Hadoop streaming, this will require a total of 2 passes (one for join and another one for GROUP BY).
Part 3: Clustering (30 pts)
Create a new numeric file with 250,000 rows and 4 columns, separated by space – you can generate numeric data as you prefer, but submit the code that you have used.
(5 pts) Using Mahout synthetic clustering as you have in a previous assignment on sample data. This entails running the same clustering command, but substituting your own input data and the right number of clusters.
(25 pts) Using Hadoop streaming perform three iterations manually using 5 centers (initially with randomly chosen centers). As discussed in class, this would require passing a text file with cluster centers using -file option, opening the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point based on which center is the closest to each particular point. Your reducer would then compute the new centers, and at that point the iteration is done and the output of the reducer with new centers can be given to the next pass of the same code.
The only difference between first and subsequent iterations is that in first iteration you have to pick the initial centers. In the 2nd iteration, the centers will be given to you by a previous pass of KMeans, and so on.
Submit a single document containing your written answers. Be sure that this document contains your name and “CSC 555 Project Phase 2” at the top.