程序代写代做代考 python Java hadoop Hive CSC 555: Mining Big Data

CSC 555: Mining Big Data
Project, Phase 2 (due Friday, March 16

th
)

In this part of the project, you will various queries using Hive, Pig and Hadoop streaming. The schema is
available below, but don’t forget to apply the correct delimiter:
http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql
The data is available at:
http://rasinsrv07.cstcis.cti.depaul.edu/CSC553/data/ (this is Scale4)

In your submission, please note what instance and what cluster you are using (you can reuse your existing

cluster for most of the questions). Please be sure to submit all code (pig, python and Hive). You should also

submit the command lines you use and a screenshot of a completed run (just the last page, do not worry about

capturing the whole output). An answer submission with screenshot/results but without the code will not

receive credit.

I highly recommend creating a small sample input (e.g., by running head lineorder.tbl > lineorder.tbl.sample and

testing your code with it, you can use head -n 100 to get first 100 lines).

Part 1: Data Transformation
Use Scale4 data to perform data processing, unless otherwise specified.

A. Transform lineorder.tbl table into a comma-separated file: Use Hive, MapReduce with HadoopStreaming

and Pig (i.e. 3 different solutions)

B. Extract three of the numeric columns (lo_quantity, lo_linenumber, lo_revenue) for rows where

lo_discount is between 6 and 8 into a space-separated text file (for K-Means clustering later). Use Hive,

MapReduce with Hadoop Streaming, and Pig (3 different solutions)

Part 2: Querying
All queries from SSBM benchmark are available here:

http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_queries_all.sql

Using Scale4 data perform the following data processing and don’t forget to time your results.

A. Run SSBM queries 2.1, 3.3 and 4.3 using Hive only (if you have issues running the queries, try placing

lineorder table first in the FROM clause of the query)

B. Create a pre-join (i.e. a new data file) that corresponds to the following query below. You can think of it

as a materialized view. What is the size of the new file? Use Hive and Pig (2 different solutions and be

sure to report the file size for both).

SELECT lo_partkey, lo_suppkey, s_suppkey, d_year, lo_revenue

http://rasinsrv07.cstcis.cti.depaul.edu/CSC553/data/
http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_queries_all.sql

FROM lineorder, dwdate, lo_supplier

WHERE lo_orderdate = d_datekey and lo_suppkey = s_suppkey;

Part 3: Clustering
Using the file you have created in 1-B, run KMeans clustering using 11 clusters.

A. Using Mahout synthetic clustering as you have in a previous assignment on sample data. This

entrails running the same clustering command, but substituting your own input data instead of

the sample.

NOTE: if you get a java.lang.OutOfMemoryError error, you will need to reconfigure

Hadoop to supply the java virtual machine with more memory. You can do this by

editing the mapred-site.xml (Mapper should not need much RAM):

mapreduce.reduce.java.opts

-Xmx1024m

The amount of memory can be tweaked (you can go higher, but keep in mind how much

physical memory your machine has). If you still run out of memory in 3-A submit the

screenshot of that change and you will get full credit for the question.

B. Using Hadoop streaming perform three iterations manually (initially with randomly chosen

centers). This would require passing a text file with cluster centers using -file option, opening

the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point

based on which center is the closest to each particular point. Your reducer would then compute

the new centers, and at that point the iteration is done and the output of the reducer can be

given to the next pass.

NOTE: Not attempting to answer this question will result in an additional grade penalty

Part 4: Performance
Compare the performance given following combinations.

A. All three of your solutions to Part-1A with

a. Scale4: a single node cluster and a cluster of at least 4 nodes

B. Both of your solutions for 2-B.

a. Scale4: a single node and a cluster of at least 4 nodes

C. Summarize the results and cluster performance/scaling in at least a paragraph.

Extra Credit

Research and describe the most affordable way to build a 10-Petabyte drive. The drive should be

built to own, not to rent (Dropbox or similar services doesn’t count, even if it does say

“unlimited” storage).

Submit a single document containing your written answers. Be sure that this document contains your

name and “CSC 555 Project Phase 2” at the top.