Data Cleansing (%70)
For this assessment, you are required to write Python (Python 2/3) code to analyze your dataset, find and fix the problems in the data. The input and output of this task are shown below:
Table 1. The input and output of the task
Exploring and understanding the data is one of the most important parts in the data wrangling process. You are required to perform both graphical and non-graphical EDA methods to understand the data first and then find the data problems. However, as a starting point, here is all we know about the dataset in hand:
The dataset is about delivering packages using drones in Victoria, Australia. The description of each data column is shown in Table 2.
Table 2. Description of the columns
Input |
Output |
Jupyter notebook |
<student_no>.csv |
<student_no>_solution.csv |
<student_no>_ass2.ipynb |
COLUMN |
DESCRIPTION |
Id |
A unique id for the delivery |
Drone type |
A categorical attribute for the type of the drone. We know that each type of drone has three phases of flight (namely takeOff, onRoute, and Landing) . The drone may have different speeds at different phases. takeOff and Landing phases only take five minutes. |
Post type |
A categorical attribute for the type of delivery (0:normal, 1:express) |
Package weight |
The weight of the package |
Origin region |
A categorical attribute representing the region for the origin of the delivery |
Destination region |
A categorical attribute representing the region for the destination of the delivery |
Origin latitude |
Latitude of the origin |
Origin longitude |
Longitude of the origin |
Destination latitude |
Latitude of the destination |
Destination longitude |
Longitude of the destination |
Distance |
Distance of the journey |
Departure date |
Date of the departure |
Departure time |
Time of the departure. We know that the delivery company has a specific rule to define morning (6:00:00 – 11:59:59), afternoon (12:00:00 – 20:59:59), and night (21:00 – 5:59:59) |
Travel time |
Travel time (i.e., duration) of the journey |
Delivery time |
The time of the delivery |
Delivery price |
Delivery fare. We know that the fare has a linear relation with some of the attributes of the dataset. |
Note 1: the output csv file must have the exact same columns as the input.
Note 2: the radius of the earth is 6378 km.
Note 3: as EDA is part of this assessment, no further information will be given publicly regarding the data. However, you can brainstorm with the teaching team during tutorials and consultation sessions.
Note 4: there is at least one error in the dataset from each category of the data anomalies (i.e., syntactic, semantic, and coverage).
Documentation (%30)
The cleaning task must be explained in a well-formatted report (with appropriate sections and subsections). Please remember that the report must explain the complete EDA to examine the data, your methodology to find the data anomalies and the suggested approach to fix those anomalies.