2021S2-workshop-week2-lab
Elements Of Data Processing (2021S2) – Week 1¶
Pandas¶
Libraries contain useful resources, such as classes and subroutines, that you can use in your programs.
Pansas is a library that contains high-level data structures and manipulation tools for faster analysis. As with most libraries, an API reference is available which details all of the functionality provided by pandas. This lab will focus on the two most important data structures provided by pandas, the Series and Data Frame.
It’s worth reading through the Intro to Data Structures article on the pandas website to familiarise yourself with these two data structures. There are also a number of step-by-step tutorials available online, such as this one by DataCamp which are worth following.
Before we can use any of the Pandas functions, we must import the Pandas library using the following line.
In [1]:
pip install pandas
Requirement already satisfied: pandas in c:\users\chris\anaconda3\lib\site-packages (1.2.3)
Requirement already satisfied: pytz>=2017.3 in c:\users\chris\anaconda3\lib\site-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\chris\anaconda3\lib\site-packages (from pandas) (1.19.2)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\chris\anaconda3\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\chris\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [ ]:
import pandas as pd
Series¶
One-dimensional array-like object containing the array of data and an associated array of data labels called index.
The basic method to create a Series:
– s = Series(data, index=index)
Here, data can be different things, including:
– a list
– an array
– a dictionary
Example 1 : Create a Basic Series Object¶
In [ ]:
# Series constructor with data as a list of integers
l = [4,3,-5,9,1,7]
s = pd.Series(l)
In [ ]:
# The default indexing starts from zero
s.index
In [ ]:
# Retrieve the values of the series
s.values
In [ ]:
# Create your own index using lists
newIndex = [‘a’,’b’,’c’,’d’,’e’,’f’]
s.index = newIndex
In [ ]:
# Verify the index
s
In [ ]:
# Create a series from a python dict
Aus_Emission = {‘1990’:15.45288167, ‘2000’:17.20060983, ‘2007’:17.86526004,
‘2008’:18.16087566,’2009′:18.20018196,’2010′:16.92095367,
‘2011’:16.86260095, ‘2012’:16.51938578, ‘2013’:16.34730205}
co2_Emission = pd.Series(Aus_Emission)
In [ ]:
# Retrieve the values of the series
co2_Emission.values
In [ ]:
# Verify the series object
co2_Emission
Slicing¶
Slicing allows you to take part of a Series or DataFrame, in order to visualise it separately or perform more detailed analysis. You can select sections of list-like types (arrays, tuples, NumPy arrays) by using various slice notations:
In [ ]:
# Slicing the series using a boolean array operation
co2_Emission[co2_Emission>16.0]
In [ ]:
# Slicing the series using a time period
co2_Emission[:’2000′]
In [ ]:
# Doubling the values of the series object
doubled = co2_Emission*2
doubled
In [ ]:
# Finding the average value of the series
co2_Emission.mean()
In [ ]:
# Defining the column name
co2_Emission.name = ‘CO2 Emission’
In [ ]:
# Defining the name of the index
co2_Emission.index.name = ‘Year’
In [ ]:
# Verifying the series object
co2_Emission
Exercise 1 ¶
Pandas Series objects have both ndarray-like and dict-like properties. Given the co2_Emission series object, do the following:
Similar to the average of the series object, retrieve the maximum, median and cumulative sum of CO2 emission between 1960 to 2013 (max(), median() and cumsum() methods).
Retrieve the CO2 emissions in Australia between 2000 to 2010.
Given the population of Australia in 2013 was 23117353, retrieve the CO2 emission per capita for that year.
In [ ]:
###answer here
DataFrames¶
DataFrames represents tabular data structure and can contain multiple rows and columns. They can be thought of as a dictionary of Series objects, and are one of the most important data structures you will use to store and manipulate information in data science.
A DataFrame has both row and column indices.
The Pandas DataFrame structure contains many useful methods to aid your analysis. API reference is available which details all of the functionality provided by pandas. You will particularly need con consult the DataFrame reference page.
In [ ]:
# as before, begin by importing the pandas library
import pandas as pd
In [ ]:
# create a new series of the population
Aus_Population = {‘1990’:17065100, ‘2000’:19153000, ‘2007’:20827600,
‘2008’:21249200,’2009′:21691700,’2010′:22031750,
‘2011’:22340024, ‘2012’:22728254, ‘2013’:23117353}
population = pd.Series(Aus_Population)
In [ ]:
Aus_Emission = {‘1990’:15.45288167, ‘2000’:17.20060983, ‘2007’:17.86526004,
‘2008’:18.16087566,’2009′:18.20018196,’2010′:16.92095367,
‘2011’:16.86260095, ‘2012’:16.51938578, ‘2013’:16.34730205}
co2_Emission = pd.Series(Aus_Emission)
In [ ]:
# verify the values in the series
population
In [ ]:
# create a DataFrame object from the series objects
australia = pd.DataFrame({‘co2_emission’:co2_Emission, ‘Population’:population})
australia
In [ ]:
# create a DataFrame from a csv file
countries = pd.read_csv(‘data/countries.csv’,encoding = ‘ISO-8859-1’)
In [ ]:
# check the top 10 countries in the DataFrame
countries.head(10) # the default value is set to 5
In [ ]:
# count the number of countries in each region
countries.Region.value_counts()
In [ ]:
# set the name of countries as the index
countries.set_index(‘Country’)
In [ ]:
# create a new DataFrame for the CO2 emission from a csv file
emission = pd.read_csv(‘data/emission.csv’,encoding = ‘ISO-8859-1’)
#emission.head()
In [ ]:
# Create a subset of emission dataset for Year 2010
yr2010 = emission[‘2010’]
names = emission[‘Country’]
yr2010.index = names
type(yr2010)
In [ ]:
# Sort column values using sort_values
yr2010.sort_values()
In [ ]:
#Sort column values to find the top countries
yr2010.sort_values(ascending = False)
Exercise 2 ¶
Retrieve the mean, median of CO2 emission generated in 2012 by all countries.
Retrieve the top 5 countries with the most CO2 emission in 2012. How about the 5 countries with the least emission? (remember that sort_values has an ascending parameter that is set to True by default).
Retrieve the sum of CO2 emission for all years and find the 2 years with the maximum CO2 emission.
In [ ]:
##answer here
More Sort Operations¶
Pandas allows you to sort your DataFrame by rows/columns as follows:
In [ ]:
# Sort column values of a DataFrame
sorted2012 = emission.sort_values( by = ‘2012’,ascending = False )
sorted2012
In [ ]:
# Sort column values using two columns
sorted2012 = emission.sort_values( by = [‘2012′,’2013’],ascending = [False, True] )
sorted2012
Slicing using the .loc and .iloc method¶
Slicing allows you to take part of your DataFrame. You can use the .iloc method to select data using row/column numbers, or use .loc to select data using row/column headings. See this article for more examples
In [ ]:
# Slicing using a range of rows and range of columns
emission.iloc[2:5,2:6]
In [ ]:
# Slicing using specific rows and specific columns
emission.loc[[3,5],[‘Country’,’1990′]]
In [ ]:
# Specific rows and all columns
emission.loc[[3,5],:]
In [ ]:
# All rows and specific columns
emission.loc[:,[‘Country’,’1990′]]
Exercise 3 ¶
Create a DataFrame object that has the name, region and IncomeGroup of the top 10 emitting countries in 2012.
In [ ]:
##answer here
Groupby¶
The Groupby method lets you separate the data into different groups based off shared characteristics. For example, we could group countries by region or income range and then analyse those groups individually. The official documentation on groupby can be found here. This tutorial is also well worth reading.
Groupby¶
Exercise 4 ¶
Using Countries data frame, group the rows using the Region column.
Show the size of each group
Find the number of high income and low income countries by region
In [ ]:
##answer here