Tutorial_01_Part_A (1)
QBUS2820 – Predictive Analytics
Tutorial 1 – Part A¶
Data Handling with Pandas¶
https://pandas.pydata.org/pandas-docs/stable/
Pandas is a library for data manipulation. The key feature of Pandas is that the data structures it uses (a Dataframe) can hold multiple different data types. For example it can create an array with integers, strings and floating point numbers all at once. You could consider this capability as similar to an excel spreadsheet. Whereas you cannot mix and match data types in Python Lists or Numpy Arrays.
Pandas is already installed and available in Anaconda/Spyder. To begin using it we first import it as follows
In [15]:
import pandas as pd
Download the DirectMarketing.xlsx file from Blackboard and place it in the same folder as your Python file or Jupyter Notebook.
The data set “DirectMarketing.csv” comes from the book Business Analytics for Managers by . It contains information on 1000 customers in a customer database for the comany Direct Marketing (DM).
We can load the file into Python and convert it to a Dataframe by
In [16]:
marketing = pd.read_excel(‘DirectMarketing.xlsx’)
We can check the first 5 rows of the marketing Dataframe by using its head function.
Here is the docs for the head function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
In [17]:
marketing.head()
Out[17]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent
0 Old Female Own Single Far 47500 0 High 6 755
1 Middle Male Rent Single Close 63600 0 High 6 1318
2 Young Female Rent Single Close 13500 0 Low 18 296
3 Middle Male Own Married Close 85600 1 High 18 2436
4 Middle Female Own Single Close 68400 0 High 12 1304
To see the full information about a dataframe including number of entries and datatypes use the info() function
In [18]:
marketing.info()
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Age 1000 non-null object
Gender 1000 non-null object
OwnHome 1000 non-null object
Married 1000 non-null object
Location 1000 non-null object
Salary 1000 non-null int64
Children 1000 non-null int64
History 697 non-null object
Catalogs 1000 non-null int64
AmountSpent 1000 non-null int64
dtypes: int64(4), object(6)
memory usage: 78.2+ KB
View summary statistics of the DataFrame drinks by using the describe function
In [19]:
marketing.describe()
Out[19]:
Salary Children Catalogs AmountSpent
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 56103.900000 0.93400 14.682000 1216.770000
std 30616.314826 1.05107 6.622895 961.068613
min 10100.000000 0.00000 6.000000 38.000000
25% 29975.000000 0.00000 6.000000 488.250000
50% 53700.000000 1.00000 12.000000 962.000000
75% 77025.000000 2.00000 18.000000 1688.500000
max 168800.000000 3.00000 24.000000 6217.000000
Pandas Series¶
A Series is a column of a Pandas DataFrame.
We can extract the column by name or
In [36]:
catalogs = marketing[‘Catalogs’]
catalogs.head()
Out[36]:
0 6
1 6
2 18
3 18
4 12
Name: Catalogs, dtype: int64
or by index (iloc means index location). The notation [] means “get a slice of the data”. The first parameter is the rows to extract which you can set as a single integer or a range. In this case we set the rows to “:” which is the full range of rows. Then we select the second column. Everything in Python is zero-indexed.
In [35]:
catalogs = marketing.iloc[:, 8]
catalogs.head()
Out[35]:
0 6
1 6
2 18
3 18
4 12
Name: Catalogs, dtype: int64
To find the number of unique values for a categorical column you can use nunique()
In [34]:
catalogs.nunique()
Out[34]:
4
To see the unique values and the number of entries for each value you can use value_counts()
In [23]:
marketing[‘Catalogs’].value_counts()
Out[23]:
12 282
6 252
24 233
18 233
Name: Catalogs, dtype: int64
Pyplot¶
https://matplotlib.org/api/pyplot_api.html
Pyplot is a library for data plotting. Pyplot follows many of the conventions of MATLAB’s plotting functions. It provides a very simple interface to plotting for common tasks. Pyplot is a sub-library of matplotlib.
To begin using it we first import it as follows
In [24]:
import matplotlib.pyplot as plt
Pyplot requires a figure to draw each plot upon. A Figure can contain a single plot or you can subdivide it into many plots. You can think of a figure as a “blank canvas” to draw on.
By default Pyplot will draw on the las figure you created.
Creating a new figure is simple
In [25]:
my_figure = plt.figure()
Line Plot¶
Lets create a simple line plot from some randomly generated data.
The plot function makes a line by taking a set of (x,y) points and joining them together.
The plot function takes one or two input arrays. If you provide one input it assumes that the input is for the y-axis and then evenly spaces each point on the x-axis. If you provide two inputs it places each point in the exact coordinates that you have specified.
In [26]:
import numpy as np
x = np.arange(100)
y = np.linspace(0, 100, 100) + np.random.randn(100)
plt.plot(x, y, label=”My Variable”)
my_figure
Out[26]:
Histogram¶
Let’s create a simple histogram of a variable from the direct marketing dataset.
The variable of interest is the AmountSpent of each person.
In [27]:
my_figure2 = plt.figure()
amount_spent = marketing[‘AmountSpent’]
n, bins, patches = plt.hist(amount_spent)
my_figure2
Out[27]:
This plot is a little hard to read so let’s make some adjustments so that it is more presentable. You can see the full parameters for the hist() function here https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist
In [28]:
my_figure3 = plt.figure()
n, bins, patches = plt.hist(amount_spent, rwidth = 0.95, bins=10, align = “mid”, label = “Number of Customers”, color = [0, 0, 1, 1])
# Colour is specified as Red, Green, Blue, Alpha
plt.xticks(bins)
plt.xlabel(“Amount Spent”)
plt.ylabel(“Number of Customers”)
plt.title(“Histogram of Amount Spent”)
plt.legend()
my_figure3
Out[28]: