CSC 555: Mining Big Data
Project, Phase 2 (due Sunday March 24th)
In this part of the project, you will execute queries using Hive, Pig and Hadoop streaming and develop a custom version of KMeans clustering. The schema is available below, but don’t forget to apply the correct delimiter:
HYPERLINK “http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql” http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql
The data is available at (this is Scale1, the smallest denomination of this benchmark)
HYPERLINK “http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/” http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/
In your submission, please note what cluster you are using. Please be sure to submit all code (pig, python and Hive). You should also submit the command lines you use and a screenshot of a completed run (just the last page, do not worry about capturing the whole output). An answer without code will not receive credit.
I highly recommend creating a small sample input (e.g., by running head lineorder.tbl > lineorder.tbl.sample) and testing your code with it. You can run head -n 500 lineorder.tbl to get a specific number of lines.
NOTE: the total number of points adds up to 70 because Phase I is worth 30 of the project.
Part 1: Data Transformation (15 pts)
Transform part.tbl table into a *-separated (‘*’) file: Use Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions).
In all solutions you must switch odd and even columns (i.e., switch the positions of columns 1 and 2, columns 3 and 4, etc.). You do not need to transform the columns in any way, just a new data file.
Part 2: Querying (25 pts)
Implement the following query:
select lo_quantity, c_nation, sum(lo_revenue)
from customer, lineorder
where lo_custkey = c_custkey
and c_region = ‘AMERICA’
and lo_discount BETWEEN 3 and 5
group by lo_quantity, c_nation;
using Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions). I Hive, this merely requires pasting the query into the Hive prompt and timing it. In Hadoop streaming, this will require a total of 2 passes (one for join and another one for GROUP BY).
Part 3: Clustering (30 pts)
Create a new numeric file with 25,000 rows and 3 columns, separated by space – you can generate numeric data as you prefer, but submit whatever code that you have used.
(5 pts) Using Mahout synthetic clustering as you have in a previous assignment on sample data. This entails running the same clustering command, but substituting your own input data instead of the sample.
(25 pts) Using Hadoop streaming perform four iterations manually using 6 centers (initially with randomly chosen centers). This would require passing a text file with cluster centers using -file option, opening the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point based on which center is the closest to each particular point. Your reducer would then compute the new centers, and at that point the iteration is done and the output of the reducer with new centers can be given to the next pass of the same code.
The only difference between first and subsequent iteration is that in first iteration you have to pick the initial centers. Starting from 2nd iteration, the centers will be given to you by a previous pass of KMeans.
Extra credit (7 pts): Create the equivalent of KMeans driver from Mahout. That is, write a python script that will automatically execute the hadoop streaming command, then get the new centers from HDFS and repeat the command. This will be easiest to do if you write your reducer to output just the centers (without the key) to HDFS. This way, all you have to do is to execute the get command to get the new centers (you can hard-code the locations of output in HDFS into your script).
Submit a single document containing your written answers. Be sure that this document contains your name and “CSC 555 Project Phase 2” at the top.