FIT5196-S2-2018 assessment 3
This is an individual assessment and worth 30% of your total mark for
FIT5196.
Due date: 11:55 pm, Friday, 26 October 2018
For this assessment, you are required to write Python (Python 2/3) code to integrate several
datasets into one single schema and find and fix possible problems in the data. Input and output
of this assessment are shown below:
Table 1. The input and output of the task
Inputs Output Jupyter notebook
Vic_suburb_boundary.zip,
GTFS_Melbourne_Train_Infor
mation.zip
Each of you is given 7 datasets in various formats and the data is about housing information in
Victoria, Australia. Your assessment is to perform the following tasks.
Task 1: Data Integration (65%)
In this task, you are required to integrate these 7 datasets into one with the following schema.
Table 2. Description of the final schema
COLUMN DESCRIPTION
Property_id A unique id for the property
lat The property latitude
lng The property longitude
addr_street The property address
suburb (15/100) The property suburb. Default value: “not available”
price The property price
property_type The type of the property
year Year of sold
bedrooms Number of bedrooms
bathrooms Number of bathrooms
parking_space The number of parking space of the property
Shopping_center_id
(5/100)
The closest shopping center to the property. Default value: “not
available”
Distance_to_sc
(5/100)
The Euclidean distance from the closest shopping center to the
property. Default value: 0
Train_station_id
(10/100)
The closest train station to the property. Default value: 0
Distance_to_train_sta
tion (5/100)
The Euclidean distance from the closest train station to the
property. Default value: 0
travel_min_to_CBD
(15/100)
The average travel time (minutes) from the closest train station to
the “Flinders street” station on weekdays (i.e. Monday-Friday)
departing between 7 to 9 am. For example, if there are 3 trip
departing from the closest train station to the Flinders street
station on weekdays between 7-9am and each take 6, 7, and 8
minutes respectively, then the value of this column for the
property should be (6+7+8)/3. If there are direct transfers between
the closest station and Flinders street station, only the average of
direct transfers should be calculated). Default value: 0
Transfer_flag
(15/100)
A Boolean attribute indicating whether there is a direct trip to the
Flinders street station from the closest station between 7-9am on
the weekdays. This flag is 0 if there is a direct trip (i.e. no transfer
between trains is required to get from the closest train station to
the Flinders station) and 1 otherwise. Default value: -1
Hospital_id (5/100) The closest hospital to the property. Default value: “not
available”
Distance_to_hospital
(5/100)
The Euclidean distance from the closest hospital to the property.
Default value: 0
Supermarket_id
(5/100)
The closest supermarket to the property. Default value: “not
available”
Distance_to_superma
ket (5/100)
The Euclidean distance from the closest supermarket to the
property. Default value: 0
Task 2: data reshaping (15%)
In this task, you need to study the effect of different normalization/transformation methods (i.e.
standardization, minmax normalization, log, power and sqrt transformation) on the “price”
attribute and observe and explain their effect on the price distribution. Also, you need to compare
them to each other assuming that we want to build a linear model on price using “bedroom”,
“bathroom”, “parking_space”, and “property_type” as the predictors of the linear model and
recommend which one(s) do you think would work better on this data.
Task 3: Documentation (20%)
The main focus on the documentation would be on the quality of your explanation on task 2 but
similar to the previous assignments, your notebook file should be on a decent format with proper
sections and subsections.
Note 1: the output csv file must have the exact same columns as specified on the schema.
If you decide not to calculate any of the required attributes, then you must have that
attribute in your final dataframe with the default value as the value of all the rows. Please
note that output file which is not in a correct format, as specified in the integrated
schema, won’t be marked.
Note 2: the radius of the earth is still 6378 km!
Note 3: In table 2, numbers in front of the some of the columns in the format of (a/b) are
the allocated mark associated with that column. For example, column “suburb” carries
15% of the total mark of task 1.