5AAVC210 Introduction to Programming WEEK 7
Pandas: Python Data Analysis Library
pandas is an open source library providing easy-to-use data structures and data analysis tools for the Python programming language.
Like BeautifulSoup, Panda has to be imported to use it:
Pandas data structures
Pandas data structures: Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
You can think of a Series as a “column” of data, like a single column in an Excel spreadsheet.
A series can stand alone, or it might be a column in a DataFrame.
This is a Series
This is also a Series
Create a Series
Pandas data structures: DataFrames
Pandas DataFrame is basically an in-memory representation of an Excel sheet via Python programming language:
This whole structure is a DataFrame
Create a DataFrame
In the real world, a DataFrame will be created by loading the datasets from a file, including (but not limited to) Excel, csv and MySQL database.
However, to explain it, we can use a python dict to represent it, e.g. if we consider column names as “Keys” and list of items under that column as “Values”:
Since, we haven’t provided any Row Index values to the DataFrame, it automatically generates a sequence (0…6) as row index.
Viewing the Data of a DataFrame
You might have hundreds or thousands of rows in your DataFrame, so you need a way of viewing specific data.
You can use head() to view the first five rows (or specify a number to see that many rows):
Likewise, tail() will give you the end five (or specified number) of rows:
What if we want to see Row Index and Columns name?
Most of the time, the given datasets already contains a row index. In those cases, we don’t need Pandas DataFrame to generate a separate row index for us.
We can set one of our columns as the row index, replacing the auto-generated row index.
Let’s find a dataset. There’s a dataset of the UK election results 2017 in the form of an .xls Excel spreadsheet:
https://files.datapress.com/london/dataset/general-election-results-2017/2017-06-29T11:17:24.85/2017%20General%20Election%20Results.xls
We tell Pandas to get it:
We get back a DataFrame object:
Pandas describe() is used to view some basic statistical details about the data:
We can select columns like a dictionary:
We can sort the data:
Sometimes you will see ‘NaN’ when you query data. This stands for Not a Number and is used in Pandas to denote a null or missing value:
Dropping the ‘NaN’ rows gives us only winning parties:
Questions?
Algorithmic thinking
When you come to write a computer program, the first thing to do is to develop a step-by-step solution to the problem:
break it down into a series of small, more manageable problems
consider how similar problems have been solved previously
focus only on the important information, not the details
design simple steps or rules to solve each of the smaller problems