In this assignment, we are going to practice MapReduce on a file named CarSale.cv and the file
is located in our database. In case you don’t have the file, you can get it from
http://ip-address/dataset/CarSale.csv
Take a look at the columns of this file to get to know them better.
Copyright By PowCoder代写 加微信 powcoder
In this homework, we will do some real data scientist job with a messy data. In both questions,
try to bring as clean as possible data into your reduce part.
Please note that you need to run your code on HDFS in both questions.
Note: If you like to have only one output file after MapReduce, add this line into your bash code
(before input line):
-Dmapred.reduce.tasks=1
(40 pts) Write a MapReduce code to find the maximum and average price for each car make
(column 15). Please take these notes:
Technically this question asks something similar to
select make, avg(price), max(price)
group by make
You can make multiple dictionaries to get each value separately for each car
maker. However, it’s much more efficient to use only one dictionary.
You may set the value of dictionaries as list in order to save multiple entries for
each key (car maker).
Remember to have the right type for your price. The average price should be
decimal. So, integer is not a good choice!
(60 pts) Now add one more attribute to the group by and it is accident status. So, we would
like to see the maximum and average price of car makes based on their accident status.
Only consider the rows that has stated the accident status (either false/true).
Make sure to catch both capital letter and small letter on accident status.
Technically this question asks something similar to
select make, has accident, avg(price), max (price)
group by make, has accident
To help you, here is one possible structure that you might get from the output:
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com