程序代写代做代考 python Elixir L3 – Preprocessing-checkpoint

L3 – Preprocessing-checkpoint

Data Preprocessing with Pandas¶

Import Modules¶

In [18]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Import data¶
Data is generated from Australia Bureau of statistics, some cells are removed (set to NaN) manually in order to serve this notebook.

In [19]:

df = pd.read_csv(‘./asset/Median Price of Established House Transfers.txt’, sep=’\t’) # row 3 has a null value
df.head()

Out[19]:

City Quarter Price
0 Sydney Q1 700.0
1 Rest of NSW Q1 370.0
2 Melbourne Q1 525.0
3 Rest of Vic. Q1 NaN
4 Brisbane Q1 460.0

Finding and Handling Missing Data¶

In [20]:

# find rows that Price is null
df[pd.isnull(df[‘Price’])]

Out[20]:

City Quarter Price
3 Rest of Vic. Q1 NaN
13 Rest of NT Q1 NaN
21 Adelaide Q2 NaN
27 Darwin Q2 NaN
40 Hobart Q3 NaN
49 Brisbane Q4 NaN

In [21]:

index_with_null = df[pd.isnull(df[‘Price’])].index
index_with_null

Out[21]:

Int64Index([3, 13, 21, 27, 40, 49], dtype=’int64′)

We can specify a value (e.g., 0) to replace those null values, through the fillna() method

In [22]:

df2 = df.fillna(0) # price value of row 3 is set to 0.0
df2.ix[index_with_null]
# df2.index in index_with_null

Out[22]:

City Quarter Price
3 Rest of Vic. Q1 0.0
13 Rest of NT Q1 0.0
21 Adelaide Q2 0.0
27 Darwin Q2 0.0
40 Hobart Q3 0.0
49 Brisbane Q4 0.0

We can also propagate non-null values forward or backward

In [23]:

df2 = df.fillna(method=’pad’, axis=0)
df2.head() # The price of row 3 is the same as that of row 2

Out[23]:

City Quarter Price
0 Sydney Q1 700.0
1 Rest of NSW Q1 370.0
2 Melbourne Q1 525.0
3 Rest of Vic. Q1 525.0
4 Brisbane Q1 460.0

We can even drop the rows (or columns) with null values

In [24]:

df2 = df.dropna(axis=0) # if axis = 1 then the column will be dropped
df2.head() # Note that row 3 is dropped

Out[24]:

City Quarter Price
0 Sydney Q1 700.0
1 Rest of NSW Q1 370.0
2 Melbourne Q1 525.0
4 Brisbane Q1 460.0
5 Rest of Qld. Q1 400.0

Obviously, none of the above solutions are appropriate.

A better way to deal with the null value is to replace them with the mean value of the prices of the corresponding city over the whole year.

In [25]:

df[“Price”] = df.groupby(“City”).transform(lambda x: x.fillna(x.mean()))
df.ix[index_with_null]

Out[25]:

City Quarter Price
3 Rest of Vic. Q1 294.666667
13 Rest of NT Q1 422.500000
21 Adelaide Q2 420.433333
27 Darwin Q2 579.166667
40 Hobart Q3 359.000000
49 Brisbane Q4 468.633333

Binning¶

Equal-width Partitioning¶

We use the table with all null values filled

In [26]:

pd.cut(df[‘Price’],5).head() # equally partition Price into 5 bins

Out[26]:

0 (597, 716]
1 (359, 478]
2 (478, 597]
3 (239.405, 359]
4 (359, 478]
Name: Price, dtype: category
Categories (5, object): [(239.405, 359] < (359, 478] < (478, 597] < (597, 716] < (716, 835]] In [27]: # We could label the bins and add new column df['Bin'] = pd.cut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"]) df.head() Out[27]: City Quarter Price Bin 0 Sydney Q1 700.000000 High 1 Rest of NSW Q1 370.000000 Low 2 Melbourne Q1 525.000000 Medium 3 Rest of Vic. Q1 294.666667 Very Low 4 Brisbane Q1 460.000000 Low Equal-depth Partitioining¶ In [28]: pd.qcut(df['Price'],5).head() # Note the difference from the Equal-width Partitioning case Out[28]: 0 (555.1, 835] 1 (336.4, 395] 2 (440.5, 555.1] 3 [240, 336.4] 4 (440.5, 555.1] Name: Price, dtype: category Categories (5, object): [[240, 336.4] < (336.4, 395] < (395, 440.5] < (440.5, 555.1] < (555.1, 835]] In [29]: # Let's check the depth of each bin df['Bin'] = pd.qcut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"]) df.groupby('Bin').size() Out[29]: Bin Very Low 12 Low 12 Medium 12 High 12 Very High 12 dtype: int64 Smoothing¶ Smoothing by Bin Means¶ In [30]: df.head() Out[30]: City Quarter Price Bin 0 Sydney Q1 700.000000 Very High 1 Rest of NSW Q1 370.000000 Low 2 Melbourne Q1 525.000000 High 3 Rest of Vic. Q1 294.666667 Very Low 4 Brisbane Q1 460.000000 High In [31]: df['Price-Smoothing-mean'] = df.groupby('Bin')['Price'].transform('mean') df.head() Out[31]: City Quarter Price Bin Price-Smoothing-mean 0 Sydney Q1 700.000000 Very High 634.763889 1 Rest of NSW Q1 370.000000 Low 372.416667 2 Melbourne Q1 525.000000 High 519.961111 3 Rest of Vic. Q1 294.666667 Very Low 270.805556 4 Brisbane Q1 460.000000 High 519.961111 Smoothing by Bin Max¶ In [32]: df['Price-Smoothing-max'] = df.groupby('Bin')['Price'].transform('max') df.head() Out[32]: City Quarter Price Bin Price-Smoothing-mean Price-Smoothing-max 0 Sydney Q1 700.000000 Very High 634.763889 835.0 1 Rest of NSW Q1 370.000000 Low 372.416667 387.5 2 Melbourne Q1 525.000000 High 519.961111 555.0 3 Rest of Vic. Q1 294.666667 Very Low 270.805556 302.0 4 Brisbane Q1 460.000000 High 519.961111 555.0 Normalization¶ In [33]: df = pd.read_csv('./asset/Median Price of Established House.txt', sep='\t') df.head() Out[33]: City Q1 Q2 Q3 Q4 0 Sydney 700.0 750.0 750.0 835.0 1 Rest of NSW 370.0 375.0 373.0 385.0 2 Melbourne 525.0 555.5 540.0 580.0 3 Rest of Vic. 300.0 295.0 287.0 302.0 4 Brisbane 460.0 474.9 471.0 485.0 Standard Scaler¶ In [34]: from sklearn import preprocessing min_max_scaler = preprocessing.StandardScaler() x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column df_standard = pd.DataFrame(x_scaled) df_standard.insert(0, 'City', df.City) df_standard Out[34]: City 0 1 2 3 0 Sydney 2.150188 2.360452 2.379522 2.611997 1 Rest of NSW -0.544010 -0.477782 -0.506402 -0.488501 2 Melbourne 0.721446 0.888355 0.771978 0.855048 3 Rest of Vic. -1.115507 -1.083272 -1.164729 -1.060370 4 Brisbane 0.190771 0.278324 0.243785 0.200499 5 Rest of Qld. -0.299083 -0.250723 -0.299718 -0.309361 6 Adelaide -0.176620 -0.212880 -0.155804 -0.195676 7 Rest of SA -1.401255 -1.340605 -1.409688 -1.246400 8 Perth 0.925552 0.771041 0.848528 0.682798 9 Rest of WA -0.401136 -0.462645 -0.452817 -0.485056 10 Hobart -0.568503 -0.704841 -0.682466 -0.626300 11 Rest of Tas. -1.523719 -1.499546 -1.333139 -1.349750 12 Darwin 1.252122 0.986747 0.963353 0.872273 13 Rest of NT -0.176620 -0.099351 -0.127481 -0.247351 14 Canberra 0.966373 0.846727 0.925078 0.786148 Robust Scaler¶ In [35]: from sklearn import preprocessing min_max_scaler = preprocessing.RobustScaler() x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column df_robust = pd.DataFrame(x_scaled) df_robust.insert(0, 'City', df.City) df_robust Out[35]: City 0 1 2 3 0 Sydney 1.686391 1.837838 1.804905 2.213333 1 Rest of NSW -0.266272 -0.189189 -0.249591 -0.186667 2 Melbourne 0.650888 0.786486 0.660490 0.853333 3 Rest of Vic. -0.680473 -0.621622 -0.718256 -0.629333 4 Brisbane 0.266272 0.350811 0.284469 0.346667 5 Rest of Qld. -0.088757 -0.027027 -0.102452 -0.048000 6 Adelaide 0.000000 0.000000 0.000000 0.040000 7 Rest of SA -0.887574 -0.805405 -0.892643 -0.773333 8 Perth 0.798817 0.702703 0.714986 0.720000 9 Rest of WA -0.162722 -0.178378 -0.211444 -0.184000 10 Hobart -0.284024 -0.351351 -0.374932 -0.293333 11 Rest of Tas. -0.976331 -0.918919 -0.838147 -0.853333 12 Darwin 1.035503 0.856757 0.796730 0.866667 13 Rest of NT 0.000000 0.081081 0.020163 0.000000 14 Canberra 0.828402 0.756757 0.769482 0.800000 Histograms¶ In [36]: df = pd.read_csv('./asset/Median Price of Established House.txt', sep='\t') df.head() Out[36]: City Q1 Q2 Q3 Q4 0 Sydney 700.0 750.0 750.0 835.0 1 Rest of NSW 370.0 375.0 373.0 385.0 2 Melbourne 525.0 555.5 540.0 580.0 3 Rest of Vic. 300.0 295.0 287.0 302.0 4 Brisbane 460.0 474.9 471.0 485.0 In [37]: # use bins=x to control the number of bins df.hist(column=['Q1','Q3'],bins=6,alpha=0.5,figsize=(16, 6)) Out[37]: array([[,
]], dtype=object)

Scatter¶

In [38]:

df.plot.scatter(x=’Q1′, y=’Q3′);

Scatter matrix provide a better way to discover the relationships in data

In [39]:

from pandas.tools.plotting import scatter_matrix

scatter_matrix(df, alpha=0.9, figsize=(12, 12), diagonal=’hist’) # set the diagonal figures to be histograms

Out[39]:

array([[,
,
,
],
[,
,
,
],
[,
,
,
],
[,
,
,
]], dtype=object)

In [ ]:

Further reading¶
https://blog.modeanalytics.com/python-data-cleaning-libraries/?utm_campaign=Data%2BElixir&utm_medium=email&utm_source=Data_Elixir_94

In [ ]: