COMP9321 Data Services Engineering
Term1, 2021
Week 3: Data Wrangling/Pre-processing
2
Removing Unnecessary Data
• Some times you don’t need all the data in the tables so it might help you achieve better performance if you remove the irrelevant data.
• Some columns or rows might be useless for you in the analysis due to having many missing values and replacing them with default values would produce wrong insights.
• Sometimes you are restricted from storage capacity perspective and hence you need to keep what is relevant to the job and drop the others.
• Python has a very good function Drop() to help you with this
3
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Dropping Columns/Raws with NaN values
• Dropping Columns with all NaN values Example:
ohio Colorado Utah 0 NaN 12 11 1NaN 337 2NaN 444 3 NaN 32 22
data.dropna(axis=1, how=’all’) Colorado Utah
0 1211 1 33 7 2 44 4 3 3222
4
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Dropping Columns/Raws with NaN values
• Dropping Raws with all NaN values
Example:
ohio Colorado Utah
data2.dropna(axis=0, how=’all’)
ohio Colorado Utah 1 12.0 33.0 7.0
2 23.0 44.0 4.0
3 34.0 32.0 22.0
0 NaN
1 12.0
2 23.0
3 34.0 32.0 22.0
NaN NaN 33.0 7.0 44.0 4.0
5
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Dropping Columns that are not needed
Example:
ohio Colorado Utah
to_drop = [‘ohio’, ‘Utah’]
data2.drop(to_drop, inplace=True, axis=1)
Colorado 1 33.0 2 44.0 3 32.0
0 NaN
1 12.0
2 23.0
3 34.0 32.0 22.0
NaN NaN 33.0 7.0 44.0 4.0
6
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Dropping Rows on a Condition
• To drop a row based on a condition you use df = df.drop(df[
Example:
ohio Colorado Utah
0 32 0 10.0 1 12.0 33.0 7.0 2 23.0 44.0 4.0 3 34.0 32.0 22.0
df.drop(df[df.Colorado == 0].index, inplace=True)
ohio Colorado Utah 1 12.0 33.0 7.0
2 23.0 44.0 4.0
3 34.0 32.0 22.0
7
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Dropping Duplicate Rows
• To drop duplicate rows we use drop_duplicates function
Example:
ohio Colorado Utah
032 0 10.0 1 12.0 33.0 7.0 2 23.0 44.0 4.0 3 34.0 32.0 22.0 4 12.0 33.0 7.0
df.drop_duplicates()
ohio Colorado Utah 032 0 10.0
1 12.0 33.0 7.0
2 23.0 44.0 4.0
3 34.0 32.0 22.0
8
Formatting data
• Data read from source may not have the correct format (e.g., reading integer as a string)
• Some strings in the data have spacing which might not play well with your analysis at some point.
• The date/time format may not appropriate for your analysis
• Some times the data is generated by a computer program, so it probably has some computer-generated column names, too. Those can be hard to read and understand while working.
9
Formatting data Examples
• Example1 (change data type on read):
df = pd.read_csv(‘mydata.csv’, dtype={‘Integer_Column’: int}) • Example2 (change data type in dataframe)
df[‘column’] = df[‘column’].to_numeric()
df[‘column’] = df[‘column’].astype(str)
• Example3 (Spacing within the values): data[‘Column_with_spacing’].str.strip()
10
Formatting data Examples
• Example4 (unnecessary time item in the date field): df[‘MonthYear’] = pd.to_datetime(df[‘MonthYear’]) df[‘MonthYear’] = df[‘MonthYear’].apply(lambda x: x.date()) • Example5 (rename columns)
data = data.rename(columns = {‘Bad_Name1’:Better_Name1’, ‘Bad_Name2’:’Better_name2’})
11
Manipulating the data
• Merging Data
• Applying a function to data
• Pivot tables
• Change the index of a dataframe
• Groupby
12
Merging Data
• Sometimes in order to have complete dataset you need to Concatenate two datasets when reading from source.
Example: Dataset1=pd.read_csv(‘datasets/project1/dataset1.csv’) Dataset2=pd.read_csv(‘datasets/project1/dataset2.csv’)
Full_data=pd.concat[Dataset1, Dataset2] axis=0, ignore_index=True)
Merging Data (Cont’d)
• Sometimes in order to have complete dataset you need to merge two Dataframes
13
https://s3.amazonaws.com/assets.datacamp.com/production/course_3639/slides/ch3_slides.pdf
Merging Data (Cont’d)
14
https://s3.amazonaws.com/assets.datacamp.com/production/course_3639/slides/ch3_slides.pdf
Patching your Data
combine_first can do some sort of “patching” missing data in the calling object with data from the object you pass
15
https://s3.amazonaws.com/assets.datacamp.com/production/course_3639/slides/ch3_slides.pdf
16
http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb
Applying a function to the entire dataset
• Sometimes You need to apply a function on the level of the entire dataset (e.g., removing, adding, averaging)
def cleaning_function(row_data): # Computation steps
# Computation steps
df.apply(cleaning_function, axis=1)
Applying a Function to Columns
• Sometimes You need to apply a function on the level of Columns Example:
1 2 3 4 5 6 7 8
Original Dataframe xyz
a 22 34 23 b 33 31 11 c 44 16 21 d 55 32 22 e 66 33 27 f 77 35 11
1 2
# Apply a function to one column and assign it back to the column in dataframe
df[‘z’] = df[‘z’].apply(np.square, axis=1)
1 2 3 4 5 6 7 8
xyz
a 22 34 529 b 33 31 121 c 44 16 441 d 55 32 484 e 66 33 729 f 77 35 121
17
Pandas: Apply a function to single or selected columns or rows in Dataframe
18
*http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/
** https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
Pivot Tables
• Summary tables
• Introduce new columns from calculations • Table can have multiple Indexes
• Excel is famous for it
19
http://pbpython.com/pandas-pivot-table-explained.html
Pivot Table Example
Pivot Table Example
20
21
Groupby
• Groupby splits the data into different groups depending on a variable of your choice.
• The output from a groupby and aggregation operation is it a Pandas Series or a Pandas Dataframes?
▪ As a rule of thumb, if you calculate more than one column of results, your result will be a Dataframe. For a single column of results, the agg function, by default, will produce a Series.
22
Groupby Example
• If our dataset is tweets extracted from Twitter and we want to group all the tweets by the username and count the number of tweets each user has
Our_grouped_tweets= df.groupby(‘username’) [‘tweets’].count()
23
Indexing the Dataframe
• Sometimes it is helpful to use a uniquely valued identifying field of the data as its index
• How to check uniqueness? (df[‘Unique_column’].is_unique)
• How to set the index? (df = df.set_index(‘ Unique_column’))
• Is it necessary to have unique vales in column? No, but it will affect the performance
• Pandas supports three types of Multi-axes indexing: .loc() Label based
.iloc() Integer based
.ix() Both Label and Integer based
24
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
Sorting Data
• Sometimes it is required to sort the data according to one or multiple columns.
• Pandas allow this using the function .sort_values()
Example:
df = pd.DataFrame({‘col1’ : [‘A’, ‘A’, ‘B’, np.nan, ‘D’, ‘C’],’col2′ : [2, 1, 9, 8, 7, 4], ‘col3’: [0, 1, 9, 4, 2, 3]})
df.sort_values(by=[‘col1’]) col1 col2 col3
0A20 1A11 2B99 5C43 4D72
3 NaN 8 4
25
Questions?
Useful Read
• Python for Data Analysis, Wes McKinney
• https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-
basic-techniques-that-make-your-data-better/
• https://pandas.pydata.org/pandas-docs/stable/tutorials.html
• https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data- manipulation/
• https://www.dataquest.io/blog/machine-learning-preparing-data/
• https://thispointer.com/pandas-apply-a-function-to-single-or-selected-columns-or-rows-in-
dataframe/
• https://datacarpentry.org/python-ecology-lesson/05-merging-data/index.html
26