程序代写代做代考 python FIT5196-S2-2018 assessment 3

FIT5196-S2-2018 assessment 3

This is an individual assessment and worth 30% of your total mark for
FIT5196.

Due date: 11:55 pm, ​Friday, 26 October 2018

For this assessment, you are required to write Python (Python 2/3) code to integrate several
datasets into one single schema and find and fix possible problems in the data. Input and output
of this assessment are shown below:

Table 1. The input and output of the task

Inputs Output Jupyter notebook

.rar,
Vic_suburb_boundary.zip,
GTFS_Melbourne_Train_Infor
mation.zip

_solution.csv _ass3.ipynb

Each of you is given 7 datasets in various formats and the data is about housing information in
Victoria, Australia. Your assessment is to perform the following tasks.

Task 1: Data Integration (65%)
In this task, you are required to integrate these 7 datasets into one with the following schema.

Table 2. Description of the final schema

COLUMN DESCRIPTION

Property_id A unique id for the property

lat The property latitude

lng The property longitude

addr_street The property address

suburb (15/100) The property suburb. Default value: “not available”

price The property price

property_type The type of the property

year Year of sold

bedrooms Number of bedrooms

bathrooms Number of bathrooms

parking_space The number of parking space of the property

Shopping_center_id
(5/100)

The closest shopping center to the property. Default value: “not
available”

Distance_to_sc
(5/100)

The Euclidean distance from the closest shopping center to the
property. ​Default value: 0

Train_station_id
(10/100)

The closest train station to the property. ​Default value: 0

Distance_to_train_sta
tion (5/100)

The Euclidean distance from the closest train station to the
property. ​Default value: 0

travel_min_to_CBD
(15/100)

The average travel time (minutes) from the closest train station to
the “Flinders street” station on weekdays (i.e. Monday-Friday)
departing ​between 7 to 9 am. For example, if there are 3 trip
departing from the closest train station to the Flinders street
station on weekdays between 7-9am and each take 6, 7, and 8
minutes respectively, then the value of this column for the
property should be (6+7+8)/3. If there are direct transfers between
the closest station and Flinders street station, only the average of
direct transfers should be calculated). ​Default value: 0

Transfer_flag
(15/100)

A Boolean attribute indicating whether there is a direct trip to the
Flinders street station from the closest station between 7-9am on
the weekdays. This flag is 0 if there is a direct trip (i.e. no transfer
between trains is required to get from the closest train station to
the Flinders station) and 1 otherwise. ​Default value: -1

Hospital_id (5/100) The closest hospital to the property. ​Default value: “not
available”

Distance_to_hospital
(5/100)

The Euclidean distance from the closest hospital to the property.
Default value: 0

Supermarket_id
(5/100)

The closest supermarket to the property. ​Default value: “not
available”

Distance_to_superma
ket (5/100)

The Euclidean distance from the closest supermarket to the
property. ​Default value: 0

Task 2: data reshaping (15%)

In this task, you need to study the effect of different normalization/transformation methods (i.e.
standardization, minmax normalization, log, power and sqrt transformation) on the “price”
attribute and observe and explain their effect on the price distribution. Also, you need to compare
them to each other assuming that we want to build a linear model on price using “bedroom”,
“bathroom”, “parking_space”, and “property_type” as the predictors of the linear model and
recommend which one(s) do you think would work better on this data.

Task 3: Documentation (20%)

The main focus on the documentation would be on the quality of your explanation on task 2 but
similar to the previous assignments, your notebook file should be on a decent format with proper
sections and subsections.

Note 1: the output csv file must have the exact same columns as specified on the schema.
If you decide not to calculate any of the required attributes, then you must have that
attribute in your final dataframe with the default value as the value of all the rows. Please
note that output file which is not in a correct format, as specified in the integrated
schema, won’t be marked.

Note 2: the radius of the earth is still 6378 km!

Note 3: In table 2, numbers in front of the some of the columns in the format of (a/b) are
the allocated mark associated with that column. For example, column “suburb” carries
15% of the total mark of task 1.