2021S2-workshop-week2-lab-solution
Elements Of Data Processing (2021S2) – Week 2¶
Pandas¶
Libraries contain useful resources, such as classes and subroutines, that you can use in your programs.
Pansas is a library that contains high-level data structures and manipulation tools for faster analysis. As with most libraries, an API reference is available which details all of the functionality provided by pandas. This lab will focus on the two most important data structures provided by pandas, the Series and Data Frame.
It’s worth reading through the Intro to Data Structures article on the pandas website to familiarise yourself with these two data structures. There are also a number of step-by-step tutorials available online, such as this one by DataCamp which are worth following.
Before we can use any of the Pandas functions, we must import the Pandas library using the following line.
In [4]:
import pandas as pd
Series¶
One-dimensional array-like object containing the array of data and an associated array of data labels called index.
The basic method to create a Series:
– s = Series(data, index=index)
Here, data can be different things, including:
– a list
– an array
– a dictionary
Example 1 : Create a Basic Series Object¶
In [5]:
# Series constructor with data as a list of integers
l = [4,3,-5,9,1,7]
s = pd.Series(l)
In [6]:
# The default indexing starts from zero
s.index
Out[6]:
RangeIndex(start=0, stop=6, step=1)
In [7]:
# Retrieve the values of the series
s.values
Out[7]:
array([ 4, 3, -5, 9, 1, 7], dtype=int64)
In [8]:
# Create your own index using lists
newIndex = [‘a’,’b’,’c’,’d’,’e’,’f’]
s.index = newIndex
In [9]:
# Verify the index
s
Out[9]:
a 4
b 3
c -5
d 9
e 1
f 7
dtype: int64
In [10]:
# Create a series from a python dict
Aus_Emission = {‘1990’:15.45288167, ‘2000’:17.20060983, ‘2007’:17.86526004,
‘2008’:18.16087566,’2009′:18.20018196,’2010′:16.92095367,
‘2011’:16.86260095, ‘2012’:16.51938578, ‘2013’:16.34730205}
co2_Emission = pd.Series(Aus_Emission)
In [11]:
# Retrieve the values of the series
co2_Emission.values
Out[11]:
array([15.45288167, 17.20060983, 17.86526004, 18.16087566, 18.20018196,
16.92095367, 16.86260095, 16.51938578, 16.34730205])
In [12]:
# Verify the series object
co2_Emission
Out[12]:
1990 15.452882
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
dtype: float64
Slicing¶
Slicing allows you to take part of a Series or DataFrame, in order to visualise it separately or perform more detailed analysis. You can select sections of list-like types (arrays, tuples, NumPy arrays) by using various slice notations:
In [13]:
# Slicing the series using a boolean array operation
co2_Emission[co2_Emission>16.0]
Out[13]:
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
dtype: float64
In [14]:
# Slicing the series using a time period
co2_Emission[:’2000′]
Out[14]:
1990 15.452882
2000 17.200610
dtype: float64
In [15]:
# Doubling the values of the series object
doubled = co2_Emission*2
doubled
Out[15]:
1990 30.905763
2000 34.401220
2007 35.730520
2008 36.321751
2009 36.400364
2010 33.841907
2011 33.725202
2012 33.038772
2013 32.694604
dtype: float64
In [16]:
# Finding the average value of the series
co2_Emission.mean()
Out[16]:
17.05889462333333
In [17]:
# Defining the column name
co2_Emission.name = ‘CO2 Emission’
In [18]:
# Defining the name of the index
co2_Emission.index.name = ‘Year’
In [19]:
# Verifying the series object
co2_Emission
Out[19]:
Year
1990 15.452882
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
Name: CO2 Emission, dtype: float64
Exercise 1 ¶
Pandas Series objects have both ndarray-like and dict-like properties. Given the co2_Emission series object, do the following:
Similar to the average of the series object, retrieve the maximum, median and cumulative sum of CO2 emission between 1960 to 2013 (max(), median() and cumsum() methods).
Retrieve the CO2 emissions in Australia between 2000 to 2010.
Given the population of Australia in 2013 was 23117353, retrieve the CO2 emission per capita for that year.
In [20]:
###answer here
# max
co2_Emission.max()
Out[20]:
18.20018196
In [21]:
#median
co2_Emission.median()
Out[21]:
16.92095367
In [22]:
# cumulative sum
print(“Hello: \n”+ str( co2_Emission.cumsum()))
Hello:
Year
1990 15.452882
2000 32.653492
2007 50.518752
2008 68.679627
2009 86.879809
2010 103.800763
2011 120.663364
2012 137.182750
2013 153.530052
Name: CO2 Emission, dtype: float64
In [23]:
# co2 emission between 2000 to 2010
co2_Emission[‘2000′:’2010’]
Out[23]:
Year
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
Name: CO2 Emission, dtype: float64
In [24]:
# computing the co2 emission per capita
p = 23117353
co2_Emission[‘2013’]/p
Out[24]:
7.071441981268357e-07
DataFrames¶
DataFrames represents tabular data structure and can contain multiple rows and columns. They can be thought of as a dictionary of Series objects, and are one of the most important data structures you will use to store and manipulate information in data science.
A DataFrame has both row and column indices.
The Pandas DataFrame structure contains many useful methods to aid your analysis. The API reference is available which details all of the functionality provided by pandas. You will particularly need con consult the DataFrame reference page.
In [25]:
# as before, begin by importing the pandas library
import pandas as pd
In [26]:
# create a new series of the population
Aus_Population = {‘1990’:17065100, ‘2000’:19153000, ‘2007’:20827600,
‘2008’:21249200,’2009′:21691700,’2010′:22031750,
‘2011’:22340024, ‘2012’:22728254, ‘2013’:23117353}
population = pd.Series(Aus_Population)
In [27]:
# we will reuse the emissions data
Aus_Emission = {‘1990’:15.45288167, ‘2000’:17.20060983, ‘2007’:17.86526004,
‘2008’:18.16087566,’2009′:18.20018196,’2010′:16.92095367,
‘2011’:16.86260095, ‘2012’:16.51938578, ‘2013’:16.34730205}
co2_Emission = pd.Series(Aus_Emission)
In [28]:
# verify the values in the series
population
Out[28]:
1990 17065100
2000 19153000
2007 20827600
2008 21249200
2009 21691700
2010 22031750
2011 22340024
2012 22728254
2013 23117353
dtype: int64
In [29]:
# create a DataFrame object from the series objects
australia = pd.DataFrame({‘co2_emission’:co2_Emission, ‘Population’:population})
australia
Out[29]:
co2_emission Population
1990 15.452882 17065100
2000 17.200610 19153000
2007 17.865260 20827600
2008 18.160876 21249200
2009 18.200182 21691700
2010 16.920954 22031750
2011 16.862601 22340024
2012 16.519386 22728254
2013 16.347302 23117353
In [30]:
# create a DataFrame from a csv file
countries = pd.read_csv(‘data/countries.csv’,encoding = ‘ISO-8859-1’)
In [31]:
# check the top 10 countries in the DataFrame
countries.head(10) # the default value is set to 5
Out[31]:
Country Region IncomeGroup
0 Afghanistan South Asia Low income
1 Albania Europe & Central Asia Upper middle income
2 Algeria Middle East & North Africa Upper middle income
3 American Samoa East Asia & Pacific Upper middle income
4 Andorra Europe & Central Asia High income
5 Angola Sub-Saharan Africa Upper middle income
6 Antigua and Barbuda Latin America & Caribbean High income
7 Argentina Latin America & Caribbean Upper middle income
8 Armenia Europe & Central Asia Lower middle income
9 Aruba Latin America & Caribbean High income
In [32]:
# count the number of countries in each region
countries.Region.value_counts()
Out[32]:
Europe & Central Asia 58
Sub-Saharan Africa 48
Latin America & Caribbean 42
East Asia & Pacific 37
Middle East & North Africa 21
South Asia 8
North America 3
Name: Region, dtype: int64
In [33]:
# set the name of countries as the index
countries.set_index(‘Country’)
Out[33]:
Region IncomeGroup
Country
Afghanistan South Asia Low income
Albania Europe & Central Asia Upper middle income
Algeria Middle East & North Africa Upper middle income
American Samoa East Asia & Pacific Upper middle income
Andorra Europe & Central Asia High income
… … …
Virgin Islands (U.S.) Latin America & Caribbean High income
West Bank and Gaza Middle East & North Africa Lower middle income
Yemen, Rep. Middle East & North Africa Lower middle income
Zambia Sub-Saharan Africa Lower middle income
Zimbabwe Sub-Saharan Africa Low income
217 rows × 2 columns
In [34]:
# create a new DataFrame for the CO2 emission from a csv file
emission = pd.read_csv(‘data/emission.csv’,encoding = ‘ISO-8859-1’)
#emission.head()
In [35]:
# Create a subset of emission dataset for Year 2010
yr2010 = emission[‘2010’]
names = emission[‘Country’]
yr2010.index = names
type(yr2010)
Out[35]:
pandas.core.series.Series
In [36]:
# Sort column values using sort_values
yr2010.sort_values()
Out[36]:
Country
Burundi 0.022480
Congo, Dem. Rep. 0.030197
Chad 0.043463
Rwanda 0.057354
Central African Republic 0.059398
…
Sint Maarten (Dutch part) NaN
South Sudan NaN
St. Martin (French part) NaN
Tuvalu NaN
Virgin Islands (U.S.) NaN
Name: 2010, Length: 217, dtype: float64
In [37]:
#Sort column values to find the top countries
yr2010.sort_values(ascending = False)
Out[37]:
Country
Qatar 41.131162
Trinidad and Tobago 36.073741
Kuwait 29.294309
Aruba 24.182702
Bahrain 23.101200
…
Sint Maarten (Dutch part) NaN
South Sudan NaN
St. Martin (French part) NaN
Tuvalu NaN
Virgin Islands (U.S.) NaN
Name: 2010, Length: 217, dtype: float64
Exercise 2 ¶
Retrieve the mean, median of CO2 emission generated in 2012 by all countries.
Retrieve the top 5 countries with the most CO2 emission in 2012. How about the 5 countries with the least emission? (remember that sort_values has an ascending parameter that is set to True by default).
Retrieve the sum of CO2 emission for all years and find the 2 years with the maximum CO2 emission.
In [38]:
##answer here
# CO2 emission generated in 2012 by all countries
yr2012 = emission[‘2012’]
yr2012.index = names
type(yr2012)
yr2012.mean()
Out[38]:
5.161476912617648
In [39]:
yr2012.median()
Out[39]:
2.7530064495
In [40]:
# top 5 CO2 emitting countries
sorted2012 = yr2012.sort_values(ascending=False)
top5 = sorted2012[:5]
top5
Out[40]:
Country
Qatar 46.697477
Curacao 39.638551
Trinidad and Tobago 33.819694
Kuwait 29.578729
Brunei Darussalam 23.800884
Name: 2012, dtype: float64
In [41]:
sorted2012 = yr2012.sort_values()
least5 = sorted2012[:5]
least5
Out[41]:
Country
Burundi 0.027888
Congo, Dem. Rep. 0.036414
Chad 0.042682
Mali 0.061677
Central African Republic 0.063505
Name: 2012, dtype: float64
In [42]:
res = emission.sum()[1:] # columns To ignore header
res.sort_values(ascending=False)[:2]
Out[42]:
2012 1052.94129
2007 1025.977262
dtype: object
More Sort Operations¶
Pandas allows you to sort your DataFrame by rows/columns as follows:
In [43]:
# Sort column values of a DataFrame
sorted2012 = emission.sort_values( by = ‘2012’,ascending = False )
sorted2012
Out[43]:
Country 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
158 Qatar 24.712027 58.522169 53.672746 46.684906 43.504331 41.131162 42.213752 46.697477 40.462355 NaN NaN NaN
50 Curacao NaN NaN NaN NaN NaN NaN NaN 39.638551 34.161635 NaN NaN NaN
196 Trinidad and Tobago 13.879875 18.844281 36.816763 35.455298 33.952984 36.073741 35.137310 33.819694 34.520237 NaN NaN NaN
105 Kuwait 23.466084 27.759023 29.636849 30.581264 30.300789 29.294309 28.102662 29.578729 27.258964 NaN NaN NaN
28 Brunei Darussalam 24.105188 14.255144 22.474463 23.950011 20.311171 20.856947 24.272670 23.800884 18.918736 NaN NaN NaN
… … … … … … … … … … … … … …
157 Puerto Rico NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
163 San Marino NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
182 St. Martin (French part) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
201 Tuvalu NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
212 Virgin Islands (U.S.) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
217 rows × 13 columns
In [44]:
# Sort column values using two columns
sorted2012 = emission.sort_values( by = [‘2012′,’2013’],ascending = [False, True] )
sorted2012
Out[44]:
Country 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
158 Qatar 24.712027 58.522169 53.672746 46.684906 43.504331 41.131162 42.213752 46.697477 40.462355 NaN NaN NaN
50 Curacao NaN NaN NaN NaN NaN NaN NaN 39.638551 34.161635 NaN NaN NaN
196 Trinidad and Tobago 13.879875 18.844281 36.816763 35.455298 33.952984 36.073741 35.137310 33.819694 34.520237 NaN NaN NaN
105 Kuwait 23.466084 27.759023 29.636849 30.581264 30.300789 29.294309 28.102662 29.578729 27.258964 NaN NaN NaN
28 Brunei Darussalam 24.105188 14.255144 22.474463 23.950011 20.311171 20.856947 24.272670 23.800884 18.918736 NaN NaN NaN
… … … … … … … … … … … … … …
157 Puerto Rico NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
163 San Marino NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
182 St. Martin (French part) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
201 Tuvalu NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
212 Virgin Islands (U.S.) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
217 rows × 13 columns
Slicing using the .loc and .iloc method¶
Slicing allows you to take part of your DataFrame. You can use the .iloc method to select data using row/column numbers, or use .loc to select data using row/column headings. See this article for more examples
In [45]:
# Slicing using a range of rows and range of columns
emission.iloc[2:5,2:6]
Out[45]:
2000 2007 2008 2009
2 2.819778 3.195865 3.168524 3.430129
3 NaN NaN NaN NaN
4 8.018181 6.350868 6.296125 6.049173
In [46]:
# Slicing using specific rows and specific columns
emission.loc[[3,5],[‘Country’,’1990′]]
Out[46]:
Country 1990
3 American Samoa NaN
5 Angola 0.459698
In [47]:
# Specific rows and all columns
emission.loc[[3,5],:]
Out[47]:
Country 1990 2000 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
3 American Samoa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 Angola 0.459698 0.633625 1.311096 1.295687 1.354389 1.369339 1.382752 1.472255 1.384496 NaN NaN NaN
In [48]:
# All rows and specific columns
emission.loc[:,[‘Country’,’1990′]]
Out[48]:
Country 1990
0 Afghanistan 0.216661
1 Albania 1.615624
2 Algeria 3.007911
3 American Samoa NaN
4 Andorra NaN
… … …
212 Virgin Islands (U.S.) NaN
213 West Bank and Gaza NaN
214 Yemen, Rep. 0.802312
215 Zambia 0.300362
216 Zimbabwe 1.478723
217 rows × 2 columns
Exercise 3 ¶
Create a DataFrame object that has the name, region and IncomeGroup of the top 10 emitting countries in 2012.
In [68]:
##answer here
# create a DataFrame for the countries
countries = pd.read_csv(‘data/countries.csv’,encoding = ‘ISO-8859-1’, index_col = ‘Country’)
# finding the top 10 emitting countries in 2012
sorted2012 = emission[‘2012’].sort_values(ascending=False)[:10]
print(countries)
print(sorted2012)
Region IncomeGroup
Country
Afghanistan South Asia Low income
Albania Europe & Central Asia Upper middle income
Algeria Middle East & North Africa Upper middle income
American Samoa East Asia & Pacific Upper middle income
Andorra Europe & Central Asia High income
… … …
Virgin Islands (U.S.) Latin America & Caribbean High income
West Bank and Gaza Middle East & North Africa Lower middle income
Yemen, Rep. Middle East & North Africa Lower middle income
Zambia Sub-Saharan Africa Lower middle income
Zimbabwe Sub-Saharan Africa Low income
[217 rows x 2 columns]
Country
Qatar 46.697477
Curacao 39.638551
Trinidad and Tobago 33.819694
Kuwait 29.578729
Brunei Darussalam 23.800884
Bahrain 21.810997
Sint Maarten (Dutch part) 21.489636
Luxembourg 20.084219
United Arab Emirates 19.252223
Saudi Arabia 19.188937
Name: 2012, dtype: float64
In [50]:
sorted2012 = emission[‘2012’].sort_values(ascending=False)[:10]
cntry = sorted2012.index
# Find the information from countries dataframe
cntryInfo = (countries.loc[c,:] for c in cntry)
# create a new dataframe form cntryInfo
top10 = pd.DataFrame(cntryInfo)
top10
Out[50]:
Region IncomeGroup
Qatar Middle East & North Africa High income
Curacao Latin America & Caribbean High income
Trinidad and Tobago Latin America & Caribbean High income
Kuwait Middle East & North Africa High income
Brunei Darussalam East Asia & Pacific High income
Bahrain Middle East & North Africa High income
Sint Maarten (Dutch part) Latin America & Caribbean High income
Luxembourg Europe & Central Asia High income
United Arab Emirates Middle East & North Africa High income
Saudi Arabia Middle East & North Africa High income
In [51]:
#Another solution
sorted2012 = emission[‘2012’].sort_values(ascending=False)[:10]
cntry = sorted2012.index
top10 = countries.loc[cntry,:]
top10
Out[51]:
Region IncomeGroup
Country
Qatar Middle East & North Africa High income
Curacao Latin America & Caribbean High income
Trinidad and Tobago Latin America & Caribbean High income
Kuwait Middle East & North Africa High income
Brunei Darussalam East Asia & Pacific High income
Bahrain Middle East & North Africa High income
Sint Maarten (Dutch part) Latin America & Caribbean High income
Luxembourg Europe & Central Asia High income
United Arab Emirates Middle East & North Africa High income
Saudi Arabia Middle East & North Africa High income
In [69]:
#Yet another solution
top10 = pd.merge(emission[‘2012’], countries, left_on=[‘Country’], right_on=[‘Country’])
top10 = top10.sort_values(ascending=False, by=’2012′)[:10]
top10
Out[69]:
2012 Region IncomeGroup
Country
Qatar 46.697477 Middle East & North Africa High income
Curacao 39.638551 Latin America & Caribbean High income
Trinidad and Tobago 33.819694 Latin America & Caribbean High income
Kuwait 29.578729 Middle East & North Africa High income
Brunei Darussalam 23.800884 East Asia & Pacific High income
Bahrain 21.810997 Middle East & North Africa High income
Sint Maarten (Dutch part) 21.489636 Latin America & Caribbean High income
Luxembourg 20.084219 Europe & Central Asia High income
United Arab Emirates 19.252223 Middle East & North Africa High income
Saudi Arabia 19.188937 Middle East & North Africa High income
Groupby¶
The Groupby method lets you separate the data into different groups based off shared characteristics. For example, we could group countries by region or income range and then analyse those groups individually. The official documentation on groupby can be found here. This tutorial is also well worth reading.
Groupby¶
Exercise 4 ¶
Using Countries data frame, group the rows using the Region column.
Show the size of each group
Find the number of high income and low income countries by region
In [47]:
##answer here
# Groupby using the Region column
grouped = countries.groupby(‘Region’)
type(grouped)
# Show the size of each group
for region,group in grouped:
print (region + ‘:\t’ + str(group.shape[0]))
#print (group.shape[0])#, type(group))
East Asia & Pacific: 37
Europe & Central Asia: 58
Latin America & Caribbean: 42
Middle East & North Africa: 21
North America: 3
South Asia: 8
Sub-Saharan Africa: 48
In [48]:
# Find the number of high income and low income countries by region
for region,group in grouped:
count = group[‘IncomeGroup’].value_counts()
high_low = count.filter(items = [‘High income’, ‘Low income’] , axis=0)
for i in high_low.index:
print(‘\t’ + i + ‘: ‘ + str(high_low[i]))
High income: 13
Low income: 1
High income: 37
High income: 16
Low income: 1
High income: 8
High income: 3
Low income: 2
High income: 1
Low income: 27
In [ ]:
In [ ]:
In [ ]:
In [ ]: