INFS 5095 – Big Data Basics
Practical Test 3 (SP2 2021) Due: By 11PM on Sunday 6 June
General instructions
• This exercise is worth 10% of your final grade and it is due no later than 11pm on Sunday 6 June.
• The exercise will be marked out of 20.
• You will need to submit your work via learnonline; attach the two required files using a zip folder.
Assessment task
In this assessment you are required to answer five questions using Spark. For each question you will need to write code which uses the appropriate transformations and actions.
Our main input file for this assessment is called DataCoSupplyChainDataset.csv, a subset of a dataset from Kaggle, which contains supply chains used by a company called DataCo Global. There is a second file provided, called DescriptionDataCoSupplyChain.csv, which describes the columns in the main dataset.
You should use the following template file to write your code: test3_solutions.py. See the video instructions provided with the assessment instructions for an example of how to use the template.
Q1. Load the data, convert to dataframe and apply appropriate column names and variable types.
Q2. Determine what proportion of all transactions is attributed to each transaction type in the dataset i.e. Cash = x%, Debit =y% etc.
This question uses the Type field.
Q3. Determine which three products had the least amount of sales. This question uses the Order Item Total and Product Name fields.
Q4. For each number of items bought, determine the average item cost.
This question uses the Order Item Product Price and Order Item Quantity fields.
Q5. What is the most frequently occurring customer name in Puerto Rico? (Repeat transactions by the same customer should not count as separate customers.)
This question uses the Customer Country, Customer Fname and Customer Id fields.
IMPORTANT HINTS
• Q4 will probably be easier if you don’t use the mean action.
• In Q5 you can specify to the max action which field should be used i.e. max(lambda x: x[5])
Submission instructions
You should submit two files in total:
– A .py file, which should be the template file filled with the appropriate code.
– The output file results.txt generated by the template file
Make sure to comment your code sufficiently, this will be included in your final mark. Good programmers are good commenters too, your code should be able to be read by a stranger who wants to use it, or by yourself in a year’s time.
Once finished, zip the two files together and upload your zip file to learnonline.
Distribution of marks
Q1 – 2 Marks
Q2 – 3 Marks
Q3 – 3 Marks
Q4 – 6 Mark
Q5 – 5 Marks
Overall code presentation – 1 mark Total of 20 marks