程序代写代做代考 algorithm python CSIT 5800 Introduction to Big Data¶

CSIT 5800 Introduction to Big Data¶
Assignment 1 – Data Pre-processing and Exploratory Analysis¶

Description¶
In this assignment, you will have an opportunity to apply data pre-processing tecniques that you learned in the class to a problem. In addition, you will do exploratory analysis on the given dataset.
To get started on this assignment, you need to download the given dataset and read the description carefully written on this page. Please note that all implementation of your program should be done with Python. 


Intended Learning Outcomes¶
• Upon completion of this assignment, you should be able to:
1. Demonstrate your understanding on how to pre-process data using the algorithms / techniques as described in the class.
2. Use simple descriptive statistical appraoches to understand your data.
3. Construct Python program to analyse the data and draw simple conclusions from it.

Required Libraries¶
The following libraries are required for this assignment:
1. Numpy – Numerical python
2. Scipy – Scientific python
3. Matplotlib – Python 2D plotting library
4. Seaborn – Visualization library based on matplotlib
5. Pandas – Python data analysis library

Dataset ~ House Prices (house-train.csv)¶
This dataset consists of sales prices of houses in Ames, Iowa (The Ames Housing Dataset). The training dataset has 1460 instances with unique Ids, sales prices, and 79 more features.
• Pricing — Monetary values, one of which is the sales price we are trying to determine
Examples: SalePrice, MiscVal
• Dates — Time based data about when it was built, remodeled or sold.
Example: YearBuilt, YearRemodAdd, GarageYrBlt, YrSold
• Quality/Condition — There are categorical assessment of the various features of the houses, most likely from the property assessor.
Example: PoolQC, SaleCondition, GarageQual, HeatingQC
• Property Features — Categorical collection of additional features and attributes of the building
Example: Foundation, Exterior1st, BsmtFinType1, Utilities
• Square Footage — Area measurement of section of the building and features like porches and lot area(which is in acres)
Example: TotalBsmtSF, GrLivArea, GarageArea, PoolArea, LotArea
• Room/Feature Count — Quantitative counts of features (versus categorical) like rooms, prime candidate for feature engineering
Example: FullBath, BedroomAbvGr, Fireplaces,GarageCars
• Neighborhood — Information about the neighborhood, zoning and lot.
Examples: MSSubClass, LandContour, Neighborhood, BldgType
You may refer to the data description for more details (data_description.txt).

Steps:¶
1. Importing data and exploring the features.
2. Cleaning data: Handling missing values.
3. Transforming Categorical data.
4. Creating new features and dropping redundant features.
5. Analysing data statistically.
6. Transforming Numerical Data: Normalization.

Step 1: Importing data and exploring the features¶
Step 1.1¶
To start working with the House Prices dataset, you will need to import the required libraries, and read the data into a pandas DataFrame.
• Import the following libraries using import statements.
▪ pandas (for data manipulation)
▪ numpy (for multidimensional array computation)
▪ seaborn and matplotlib.pyplot (both for data visualization)
• Read the csv file ‘train.csv’ using Pandas’ read_csv function (pandas.read_csv)
Note: Run a code cell by clicking on the cell and using the keyboard shortcut + .
In [ ]:
# Put your statements here

Step 1.2¶
Use head function (pandas.DataFrame.head) of pandas library to preview the first 10 data.
In [ ]:
# Put your statement here

Step 1.3¶
Use tail function (pandas.DataFrame.tail) of pandas library to preview the last 10 data.
In [ ]:
# Put your statement here

Step 1.4¶
Display informtion on dataframe using info function (pandas.DataFrame.info) of pandas library.
In [ ]:
# Put your statement here

Step 1.5¶
Exploring the data
Step 1.5.0¶
Use select_dtypes function (pandas.DataFrame.select_dtypes) of pandas library to get the features (excluding SalePrice and Id) that are numerical (i.e. not categorical).
In [ ]:
numerical_features = trainData.select_dtypes(exclude = [“object”]).columns
numerical_features = numerical_features.drop(“SalePrice”)
numerical_features = numerical_features.drop(“Id”)
print(“# of numerical features: ” + str(len(numerical_features)))
print (numerical_features)
# Run the above code

Step 1.5.1¶
Use select_dtypes function (pandas.DataFrame.select_dtypes) of pandas library to get the features that are categorical.
In [ ]:
# Put your statements here

Step 1.6¶
Evaluate the data quality & perform missing values assessment using isnull function (pandas.isnull) and sum function (pandas.DataFrame.sum) of pandas library.
In [ ]:
# Put your statements here

What is your observation? (Write your observation here.)

Step 1.7¶
Evaluate the distribution of categorical features using describe function (pandas.DataFrame.describe) of the pandas library.
Note: if you cannot see all the features, use the command
pd.options.display.max_columns = 81
In [ ]:
# Put your statements here

What is/are your observation(s)? (Write your observation(s) here.)

Step 2: Cleaning data: Handling missing values¶

Step 2.1 Not Really NA Values¶
According to the data description, NA actually has a particular meaning for many featuress.
But the value “NA” will be regarded as missing values in DataFrame.
We need to replace those by another value.

Step 2.1.0¶
Considering the feature Alley, data description says NA means “no alley access”. Use fillna function (pandas.DataFrame.fillna) of pandas library to replace those NA values with “None”.

In [ ]:
trainData[“Alley”].fillna(“None”, inplace=True)
# Run the above code

Step 2.1.1¶
Similarly, for features:
BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2,
the data description says NA for basement features is “no basement”.
Use fillna function (pandas.DataFrame.fillna) of pandas library to replace those NA values with “No”.

In [ ]:
# Put your statements here

Step 2.1.2¶
Similarly, for features:

• Fence : data description says NA means “no fence”
• FireplaceQu : data description says NA means “no fireplace”
• Functional : data description says NA means typical
• GarageType etc : data description says NA for garage features is “no garage”
• PoolQC : data description says NA for pool quality is “no pool”
• MiscFeature: Miscellaneous feature not covered in other categories, NA means no miscellaneous features
Use fillna function (pandas.DataFrame.fillna) of pandas library to replace those NA values with “No”.

In [ ]:
#Put your statements here

Step 2.2¶
Besides the features above, there are two other features with missing value: 
MasVnrType and MasVnrArea 

Step 2.2.0¶
If we look at those instances with missing values of MasVnrType, we can observe that those instances will also have values of MasVnrArea missing.
In [ ]:
tempData = trainData.isnull()[[“MasVnrType”, “MasVnrArea”]]
tempData.loc[tempData.MasVnrType == True]
# Run the above code

Step 2.2.1¶
First, let’s explore the MasVnrType feature. For this feature, evaluate the distribution using countplot function (seaborn.countplot) of seaborn library.
In [ ]:
# Put your statements here

What is your observation?(Write your observation here.)

Step 2.2.2¶
Use the most common value of the feature to impute the missing values. Again, fillna function (pandas.DataFrame.fillna) of the pandas library can be used.
In [ ]:
# Put your statement here

Step 2.2.3¶
Then, let’s look at the feature: MasVnrArea.
Since we have replaced the missing values of MasVnrType with “None”, we should look at the values of MasVnrType MasVnrArea with respect to those with MasVnrType of “None”.
To do this, evaluate the distribution using countplot function (seaborn.countplot) of seaborn library.
Note: to get those instances whose value of MasVnrType is “None”:
trainData.loc[trainData.MasVnrType ==”None”]
In [ ]:
# Put your statement here

What is your observation? (Write your observation here.)

Step 2.2.4¶
Use the most common value of the feature to impute the missing values. Again, fillna function (pandas.DataFrame.fillna) of the pandas library can be used.
In [ ]:
# Put your statement here

Step 2.3¶
The feature LotFrontage also has missing values.
Step 2.3.1¶
For the feature LotFrontage, evaluate its distribution using hist function (pandas.DataFrame.hist) of pandas library.
In [ ]:
# Put your statements here

Step 2.3.2¶
Compute the mean OR median of the second least missing values feature LotFrontage using mean (pandas.DataFrame.mean) / median function (pandas.DataFrame.median) of pandas library.
Note: You have to skip all the missing values when computing the mean or median.
In [ ]:
# Put your statements here

Step 2.3.3¶
Use mean / median to impute the missing values of the feature with the second least missing values LotFrontage. fillna function (pandas.DataFrame.fillna) of pandas library can be used.
In [ ]:
# Put your statement here

Step 2.4¶
Since there is only one missing instance in the feature ‘Electrical’, we will keep the feature and just delete that instance.
In [ ]:
trainData = trainData.drop(trainData.loc[trainData[‘Electrical’].isnull()].index)
# Run the above code

Step 2.5¶
For the last feature with missing value, GarageYrBlt, we will just drop this feature using drop function (pandas.DataFrame.drop) of the pandas library) since there is already the feature ‘YearBuilt’.
In [ ]:
# Put your statement here

Step 3: Transforming data¶

Step 3.1¶
Using pandas.DataFrame.replace function, transfor the following numerical features to categorical.
• MSSubClass:
20 : “SC20”, 30 : “SC30”, 40 : “SC40”, 45 : “SC45”, 50 : “SC50”, 60 : “SC60”, 70 : “SC70”, 75 : “SC75”, 80 : “SC80”, 85 : “SC85”, 90 : “SC90”, 120 : “SC120”, 150 : “SC150”, 160 : “SC160”, 180 : “SC180”, 190 : “SC190”
• MoSold:
1 : “Jan”, 2 : “Feb”, 3 : “Mar”, 4 : “Apr”, 5 : “May”, 6 : “Jun”, 7 : “Jul”, 8 : “Aug”, 9 : “Sep”, 10 : “Oct”, 11 : “Nov”, 12 : “Dec”
In [ ]:
# Put your statements here

Step 3.2¶
Using pandas.DataFrame.replace function, transfor values of the following categorical features to numerical values.
• Alley : {“Grvl” : 1, “Pave” : 2},
• BsmtCond : {“No” : 0, “Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• BsmtExposure : {“No” : 0, “Mn” : 1, “Av”: 2, “Gd” : 3}
• BsmtFinType1 : {“No” : 0, “Unf” : 1, “LwQ”: 2, “Rec” : 3, “BLQ” : 4, “ALQ” : 5, “GLQ” : 6}
• BsmtFinType2 : {“No” : 0, “Unf” : 1, “LwQ”: 2, “Rec” : 3, “BLQ” : 4, “ALQ” : 5, “GLQ” : 6}
• BsmtQual : {“No” : 0, “Po” : 1, “Fa” : 2, “TA”: 3, “Gd” : 4, “Ex” : 5}
• ExterCond : {“Po” : 1, “Fa” : 2, “TA”: 3, “Gd”: 4, “Ex” : 5}
• ExterQual : {“Po” : 1, “Fa” : 2, “TA”: 3, “Gd”: 4, “Ex” : 5}
• FireplaceQu : {“No” : 0, “Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• Functional : {“Sal” : 1, “Sev” : 2, “Maj2” : 3, “Maj1” : 4, “Mod”: 5, “Min2” : 6, “Min1” : 7, “Typ” : 8}
• GarageCond : {“No” : 0, “Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• GarageQual : {“No” : 0, “Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• HeatingQC : {“Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• KitchenQual : {“Po” : 1, “Fa” : 2, “TA” : 3, “Gd” : 4, “Ex” : 5}
• LandSlope : {“Sev” : 1, “Mod” : 2, “Gtl” : 3}
• LotShape : {“IR3” : 1, “IR2” : 2, “IR1” : 3, “Reg” : 4}
• PavedDrive : {“N” : 0, “P” : 1, “Y” : 2}
• PoolQC : {“No” : 0, “Fa” : 1, “TA” : 2, “Gd” : 3, “Ex” : 4}
• Street : {“Grvl” : 1, “Pave” : 2}
• Utilities : {“ELO” : 1, “NoSeWa” : 2, “NoSewr” : 3, “AllPub” : 4}
In [ ]:
# Put your statements here

Step 4: Creating new features and dropping redundant features¶

Step 4.0¶
We can create new features by combining some existing features. For example, we can combine GrLiveArea with TotalBsmtSF to form a new feature called TotalSF.
• Define a new feature ‘TotalSF’ and assign it with the sum of GrLiveArea and TotalBsmtSF.
In [ ]:
trainData[“TotalSF”] = trainData[“GrLivArea”] + trainData[“TotalBsmtSF”]
# Run the above code

Step 4.1¶
Similarly, we can create more new features considering the followings:

• 

• Overall quality of the house: product of OverallQual and OverallCond
• Overall quality of the garage: product of GarageQual and GarageCond
• Overall quality of the exterior: product of ExterQual and ExterCond
• Overall kitchen score: product of KitchenAbvGr and KitchenQual
• Overall fireplace score: product of Fireplaces and FireplaceQu
• Overall garage score: product of GarageArea and GarageQual
• Overall pool score: product of PoolAre and PoolQC
• Total number of bathrooms: sum of BsmtFullBath, half of BsmtHalfBath, FullBath and half of HalfBath
• Total SF for 1st + 2nd floors: sum of 1stFlrSF and 2ndFlrSF
• Total SF for porch: sum of OpenPorchSF, EnclosedPorch, 3SsnPrch, and ScreenPorch


In [ ]:
# Put your statements here

Step 4.2¶
• The feature “Id” can be dropped (using drop function (pandas.DataFrame.drop) of the pandas library).
In [ ]:
# Put your statement here

Step 5: Analysing data statistically and graphically¶

Step 5.1¶
Use describe function (pands.DataFrame.describe) of pandas library to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution.
In [ ]:
# Put your statement here

Step 5.2¶
We can explore the correlations between features.
Step 5.2.1¶
Using pandas.DataFrame.corr function to compute the pairwise correlation.
In [ ]:
# Put your statements here

Step 5.2.2¶
Display the pairwise correlation with a heatmap using seaborn.heatmap.
In [ ]:
# Put your statements here

Give two observations. (Write your observation(s) here.)

Step 5.3¶
Using Scatter Plot
Step 5.3.1¶
Explore SalePrice with respect to GrLiveArea using scatter function (matplotlib.pyplot.scatter) of matplotlib library.
In [ ]:
# Put your statements here

What is your observation? (Write your observation here.)
What can we do? (Write your answer here.)

Step 5.3.2¶
Explore SalePrice with respect to TotalSF using scatter function (matplotlib.pyplot.scatter) of matplotlib library.
In [ ]:
# Put your statements here

What is your observation? (Write your observation here.)

Step 5.3.3¶
Explore SalePrice with respect to YearBuilt using scatter function (matplotlib.pyplot.scatter) of matplotlib library.
In [ ]:
# Put your statements here

What is your observation? (Write your obeservation here.)

Step 5.4¶
Using Count Plot
Step 5.4.1¶
Explore MoSold (Month Sold) using countplot function (seaborn.countplot) of seaborn library.
In [ ]:
# Put your statements here

What is your observation? (Write your obeservation here.)

Step 5.5¶
Using Box Plot
Step 5.5.1¶
Explore the new feature OverallQual with respect to SalePrice using boxplot function (seaborn.boxplot) of seaborn library.
In [ ]:
# Put your statments here

What is your observation? (Write your obeservation here.)

Step 5.5.2¶
Explore the new feature (Total Number of Bathrooms) created in step 4.1 with respect to SalePrice using boxplot function (seaborn.boxplot) of seaborn library.
In [ ]:
# Put your statments here

What is your observation? (Write your obeservation here.)

Step 5.5.3¶
Explore the Neigborhood with respect to SalePrice using boxplot function (seaborn.boxplot) of seaborn library.
Note: you may want to change the size of the plot using matplotlib.pylot.subplots function

f, ax = plt.subplots(figsize=(26, 12))
before creating the box plot.

In [ ]:
# Put your statements here

What is your observation? (Write your obeservation here.)

Step 6 Normailzation¶

Step 6.1¶
Explore the distribution of SalePrice using seaborn.displot function.
In [ ]:
# Put your statements here

Step 6.2¶
We can also get the skewness and kurtosis using pandas.DataFrame.skew and pandas.DataFrame.kurt functions.
In [ ]:
# Put your statements here

Step 6.3¶
Apply log transformation to SalePrice using numpy.log function.
In [ ]:
# Put your statement here

Step 6.4¶
Plot the distribution of SalePrice using seaborn.distplot function again.
In [ ]:
# Put your statement here

Bonus Tasks¶
You may consider working on the followings:
1. Investigating whether normalization should be performed on any other features.
2. Performing more/other exploratory data analysis to explore other factors constituting to higher/lower SalePrice.
Note: The bonus tasks will worth at most 10 points depending on the amount and quality of the tasks. This assignment worths 100 points. The maximum score of this assignment including bonus is 110 points. But the maximum score of the 2 assignments together is 200.
In [ ]: