Tutorial_01_Part_B (1)
QBUS2820 – Predictive Analytics
Tutorial 1 – Part B¶
Download the DirectMarketing.xlsx file from Blackboard and place it in the same folder as your Python file or Jupyter Notebook.
In [3]:
import pandas as pd
marketing = pd.read_excel(‘DirectMarketing.xlsx’)
marketing.head()
Out[3]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
0 Old Female Own Single Far 47500 0 High 6 755 1 0 0 1 0 0 0 1
1 Middle Male Rent Single Close 63600 0 High 6 1318 0 0 1 0 0 1 0 1
2 Young Female Rent Single Close 13500 0 Low 18 296 1 0 1 0 1 0 0 0
3 Middle Male Own Married Close 85600 1 High 18 2436 0 1 1 1 0 1 0 1
4 Middle Female Own Single Close 68400 0 High 12 1304 1 0 1 1 0 1 0 1
Handling Missing or Corrupt Data¶
We can check if there is missing or corrupt cells in our Dataframe by the following code. This tells us how many missing values there are in each column.
In [4]:
marketing.isnull().sum()
Out[4]:
Age 0
Gender 0
OwnHome 0
Married 0
Location 0
Salary 0
Children 0
History 303
Catalogs 0
AmountSpent 0
Gender_b 0
Married_b 0
Location_b 0
Ownhome_b 0
Age_y 0
Age_m 0
Hist_m 0
Hist_h 0
dtype: int64
Then we can remove any rows containing the missing values
In [5]:
marketing_clean = marketing.dropna()
Lets check that the bad rows have been removed
In [6]:
marketing_clean.isnull().sum()
Out[6]:
Age 0
Gender 0
OwnHome 0
Married 0
Location 0
Salary 0
Children 0
History 0
Catalogs 0
AmountSpent 0
Gender_b 0
Married_b 0
Location_b 0
Ownhome_b 0
Age_y 0
Age_m 0
Hist_m 0
Hist_h 0
dtype: int64
Indexing and Slicing¶
Often we are only interested in a single variable or subset of variables from our dataset. We might even only be interested in a set of observations or just a single cell.
We can use Panda’s indexing and slicing functions to extract subsections of data from a Dataframe.
You can index a Dataframe with two functions:
loc, indexing by labels in the index
iloc, indexing by physical position
A dataframe has a special column called the “index”. This is a unique identifier for each row/observation in our data. This index is often numerical but can be a string or date etc. If you want to extract some rows based on this index column then use loc. If you want to index based on the actual row position then use the iloc function.
Both loc and iloc use the square bracket index notation. We can specify the rows and columns we wish to index with the following syntax: [ROWS, COLUMNS].
If we want every column or row we can use “:” in place of ROWS or COLUMNS. We can also specify a range by “start:end” or even just specifying a start or end point e.g. “start:” or “:end”
For example let’s extract the first 3 rows of the dataframe.
In [56]:
first_three = marketing.iloc[0:3, :]
first_three.head()
Out[56]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
0 Old Female Own Single Far 47500 0 High 6 755 1 0 0 1 0 0 0 1
1 Middle Male Rent Single Close 63600 0 High 6 1318 0 0 1 0 0 1 0 1
2 Young Female Rent Single Close 13500 0 Low 18 296 1 0 1 0 1 0 0 0
As mentioned we do not need to specify the start point of this range e.g.
In [57]:
alternative = marketing.iloc[:3, :]
alternative.head()
Out[57]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
0 Old Female Own Single Far 47500 0 High 6 755 1 0 0 1 0 0 0 1
1 Middle Male Rent Single Close 63600 0 High 6 1318 0 0 1 0 0 1 0 1
2 Young Female Rent Single Close 13500 0 Low 18 296 1 0 1 0 1 0 0 0
We can index columns much the same and even get a subset of columns and rows e.g.
In [58]:
subset = marketing.iloc[:3, 5:10]
subset.head()
Out[58]:
Salary Children History Catalogs AmountSpent
0 47500 0 High 6 755
1 63600 0 High 6 1318
2 13500 0 Low 18 296
If you would like to index columns or rows that are not sequential then you can use an array of index values e.g. let’s grab columns 0, 1, 2, 5 and 9.
In [59]:
column_index = [0, 1, 2, 5, 9]
subset = marketing.iloc[:, column_index]
subset.head()
Out[59]:
Age Gender OwnHome Salary AmountSpent
0 Old Female Own 47500 755
1 Middle Male Rent 63600 1318
2 Young Female Rent 13500 296
3 Middle Male Own 85600 2436
4 Middle Female Own 68400 1304
We can also index based on column names. For small selections of data this is the reccomended method because it makes your code more readable! Instead of some opaque numerical indices you are explicity referencing the column by name.
For example to extract the Salary column you can do either of the following (You can also use an array of column names to index your data):
In [60]:
salary = marketing[‘Salary’]
salary.head()
salary2 = marketing.loc[:, ‘Salary’]
salary.head()
Out[60]:
0 47500
1 63600
2 13500
3 85600
4 68400
Name: Salary, dtype: int64
Querying¶
One of the most powerful features of Pandas is the ability to query the data. Querying allows us to extract data that matches criteria.
There are two ways to query a dataframe:
.query()
index notation
Query Function¶
The query function requires a query expression string. You can write it like a maths expression.
Let’s find all rows in our data frame with Salary greater than \$90,000.
In [63]:
big_earners = marketing.query(“Salary > 90000”)
big_earners.head()
Out[63]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
11 Middle Male Own Married Far 111800 3 High 18 2065 0 1 0 1 0 1 0 1
13 Middle Male Own Married Close 111400 0 High 12 2136 0 1 1 1 0 1 0 1
14 Old Female Own Married Far 110000 0 High 24 5564 1 1 0 1 0 0 0 1
17 Middle Male Own Married Close 90100 0 High 18 1956 0 1 1 1 0 1 0 1
25 Middle Female Own Married Close 92300 3 Low 18 655 1 1 1 1 0 1 0 0
Index Query Style¶
Instead of writing a query expression string, you can find evaluate a query against a column, then use this as an index list into the dataframe.
The example below is equivalent to the query function from above
In [64]:
big_earners = marketing[marketing[‘Salary’] > 90000]
big_earners.head()
Out[64]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
11 Middle Male Own Married Far 111800 3 High 18 2065 0 1 0 1 0 1 0 1
13 Middle Male Own Married Close 111400 0 High 12 2136 0 1 1 1 0 1 0 1
14 Old Female Own Married Far 110000 0 High 24 5564 1 1 0 1 0 0 0 1
17 Middle Male Own Married Close 90100 0 High 18 1956 0 1 1 1 0 1 0 1
25 Middle Female Own Married Close 92300 3 Low 18 655 1 1 1 1 0 1 0 0
Multiple Criteria¶
You can join multiple queries together by using AND (both criteria must be true) or OR (one of the criteria must be true) joins.
For example let’s find big spenders with relatively low incomes.
In [65]:
addicts = marketing[(marketing[‘Salary’] <= 50000) & (marketing['AmountSpent'] > 2000)]
addicts.head()
Out[65]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
126 Middle Female Own Married Far 44200 0 High 24 2524 1 1 0 1 0 1 0 1
805 Middle Male Own Single Far 49700 0 High 24 2844 0 0 0 1 0 1 0 1
Creating Columns¶
You can insert your own data into a dataframe.
Let’s look at a simple example of creating a hybrid column of data from our marketing dataset.
I am interested to know the salary/amountspent ratio for each customer. So first let’s create a series with the ratio.
In [66]:
ratio = marketing[‘Salary’] / marketing[‘AmountSpent’]
Then insert this series into the marketing dataframe
In [68]:
marketing[‘SalarySpendRatio’] = ratio
marketing.head()
Out[68]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h SalarySpendRatio
0 Old Female Own Single Far 47500 0 High 6 755 1 0 0 1 0 0 0 1 62.913907
1 Middle Male Rent Single Close 63600 0 High 6 1318 0 0 1 0 0 1 0 1 48.254932
2 Young Female Rent Single Close 13500 0 Low 18 296 1 0 1 0 1 0 0 0 45.608108
3 Middle Male Own Married Close 85600 1 High 18 2436 0 1 1 1 0 1 0 1 35.139573
4 Middle Female Own Single Close 68400 0 High 12 1304 1 0 1 1 0 1 0 1 52.453988
In [ ]: