Assignment 2: MAF 3003, Summer 2020¶
Store Transaction Analysis¶
Managerial Problem. A retailer of high-end electronic equipment wishes to draw insights on the purchases that occurred during the month of April. They provided you a relational database with the transactions, amount purchased (in dollars), and some basic customer information they obtained at each purchase, specifically the age and whether they were satisfied or not with the service.
The managerial questions the company is interested in are provided below. Please answer them in the corresponding Jupyter cell using Python. The data is located on the table transactions (all in small letters) in the relational database transactions.db available on Quercus. They have not provided any additional information on the columns or column names.
For qualitative questions, answer them as comments in the appropriate cells.
Note: Before answering the questions, please go to to the Jupyter’s menu and execute “Kernel”, “Restart & Run all.” This will ensure that the notebook and modules are appropriately loaded.
This assignment is worth 4 points (out of 30)
In [31]:
# Load modules required for the assignment
import pandas as pd # for data manipulation
import numpy as np # for stats and numerical analysis
import matplotlib.pyplot as plt # for plotting and data visualization
import sqlite3 as sqldatabase # for importing data using SQL
Question 1¶
Read the full data into a dataframe named dfRetailer using SQL. Print the column names and the total number of transactions that occurred during that period. [0.25 pts]
In [32]:
connector = sqldatabase.connect(“transactions.db”)
query = “select * from transactions”
dfRetailer = pd.read_sql(query, connector)
dfRetailer
Out[32]:
Age
Satisfied
Amount Spent
0
30
0
228.01
1
28
1
132.37
2
29
1
407.91
3
34
0
350.91
4
39
0
509.87
…
…
…
…
292
35
0
632.91
293
Not informed
0
394.43
294
34
1
303.13
295
29
1
373.06
296
35
0
334.67
297 rows × 3 columns
In [33]:
dfRetailer.describe()
Out[33]:
Satisfied
count
297.000000
mean
0.515152
std
0.500614
min
0.000000
25%
0.000000
50%
1.000000
75%
1.000000
max
1.000000
Question 2¶
Were the customers who performed the 5 largest transactions satisfied with their purchase? [1.0 pt]
(a) First, ensure that the columns used for this question are properly cleaned (i.e., possible type errors and empty fields are addressed). If there are empty fields, apply either deletion or imputation, clearly justifying your choice. [0.5ps]
(b) To answer the question, generate a table (i.e., a dataframe) with the rows having the 5 largest transactions, and discuss whether those customers were satisfied. [0.5pts]
In [34]:
# (a)
# 1. Check Column Names – No errors
dfRetailer.columns
Out[34]:
Index([‘Age’, ‘Satisfied’, ‘Amount Spent’], dtype=’object’)
In [35]:
# 2. Check Data Types
dfRetailer.dtypes
Out[35]:
Age object
Satisfied int64
Amount Spent object
dtype: object
In [40]:
# 2.1 Age should be changed from string to integer
dfRetailer[“Age”] = dfRetailer[“Age”].astype(“int64”)
—————————————————————————
ValueError Traceback (most recent call last)
1 # 2.1 Age should be changed from string to integer
—-> 2 dfRetailer[“Age”] = dfRetailer[“Age”].astype(“int64”)
~\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors)
5696 else:
5697 # else, only a single dtype is given
-> 5698 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
5699 return self._constructor(new_data).__finalize__(self)
5700
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, copy, errors)
580
581 def astype(self, dtype, copy: bool = False, errors: str = “raise”):
–> 582 return self.apply(“astype”, dtype=dtype, copy=copy, errors=errors)
583
584 def convert(self, **kwargs):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
440 applied = b.apply(f, **kwargs)
441 else:
–> 442 applied = getattr(b, f)(**kwargs)
443 result_blocks = _extend_blocks(applied, result_blocks)
444
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors)
623 vals1d = values.ravel()
624 try:
–> 625 values = astype_nansafe(vals1d, dtype, copy=True)
626 except (ValueError, TypeError):
627 # e.g. astype_nansafe can fail on object-dtype of strings
~\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
872 # work around NumPy brokenness, #1987
873 if np.issubdtype(dtype.type, np.integer):
–> 874 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
875
876 # if we have a datetime/timedelta array of objects
pandas\_libs\lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: ‘Not informed’
Question 3¶
What is the percentage of transactions that were satisfactory to customers? Please make sure you use the cleaned data from Q2. [1 pts]
Hint: To solve this question, use Boolean operators that we saw in Lecture 3.
In [ ]:
Question 4¶
What is the average age of customers who spent more \$500? Assume that each transaction is a different customer. [1.0pts]
(a) First, ensure that the columns used for this question are properly cleaned (i.e., possible type errors and empty fields are addressed). If there are empty fields, apply either deletion or imputation, clearly justifying your choice. [0.5ps]
(b) Answer the question using SQL, Python, or a combination of both. [0.5pts]
In [ ]:
Question 5¶
One of the strategic managers proposed a special discount to customers who are 20 years-old or younger, so as to bring more sales for the month of May. Estimate the probability that a customer of that age range will perform any purchase in the system using the data that they provided you. Based on that information, what would you suggest to the manager? [0.75pts]
Hint: The age distribution is very close to a Normal.
In [ ]: