给定两个csv文件,从csv文件中读取数据。然后筛选出符合条件的数据,并输出成指定格式。Parking-Violation.csv是违法停车的数据集,open-violation.csv是已经交了罚单的数据集。
Spark程序可以使用如下指令输入文件
from csv import reader
lines = sc.textFile(sys.argv[1], 1)
lines = lines.mapPartitions(lambda x: reader(x))
Task 1: Find all parking violations that have been paid, i.e., that do not occur in open-violations.csv.
Output: A key-value* pair per line, where:
key = summons_number
values = plate_id, violation_precinct, violation_code, issue_date
(*Note: separate key and value by the tab character (‘\t’), and elements within the key/value should be separated by a comma then a space. This applies to all tasks below)
Your output format should conform to the format of following examples:
1307964308 GBH2444, 74, 46, 2016-03-07
4617863450 HAM2650, 0, 36, 2016-03-24
To complete this task,
1) Write a map-reduce job. Run Hadoop using 2 reducers.
2) Write a Spark program.
Task 2: Find the frequencies of the violation types in parking_violations.csv, i.e., for each violation code, the number of violations that this code has.
Output: A key-value pair per line, where:
key = violation_code
value = number of violations
Here are sample outputs with 1 key-value pair:
1 159
46 100
To complete this task,
1) Write a map-reduce job. Run Hadoop using 2 reducers.
2) Write a Spark program.