Tutorial_02_task_solution (1)
Tutorial 2 Tasks Solution¶
Task 1 – Modelling¶
How do customer attributes and marketing efforts affect spending?¶
In order to optimally target customers the sales and marketing teams want to estimate how much a particular customer will spend, given a set of attributes.
This is the updated version. 12/04/2018.
1 – Inspect and visualize the data to investigate which variables do you think would be helpful to explain output variable ‘AmountSpent’¶
In [1]:
import pandas as pd # This helps us load, inspect, and manipulate data
import matplotlib.pyplot as plt # This helps us plot and visualize the data
import seaborn as sns # This helps us plot and visualize the data in a more fancy way
import statsmodels.formula.api as smf # This is the package that contains most of the standard statistical models
In [2]:
# Load in our data
marketing = pd.read_excel(‘DirectMarketing.xlsx’)
In [3]:
# Quickly inspect the data
marketing.head()
Out[3]:
Age Gender OwnHome Married Location Salary Children History Catalogs AmountSpent
0 Old Female Own Single Far 47500 0 High 6 755
1 Middle Male Rent Single Close 63600 0 High 6 1318
2 Young Female Rent Single Close 13500 0 Low 18 296
3 Middle Male Own Married Close 85600 1 High 18 2436
4 Middle Female Own Single Close 68400 0 High 12 1304
In [4]:
# Plot a pairwise plot to quickly inspect the relationship between each pair of variables
# This only compares the numerical data
sns.pairplot(marketing)
plt.show()
From the pairwise plot, it seems like Salary is related to AmountSpent, but we’re not really sure for the other two.
In [5]:
# Let’s investigate the relationship across the age groups
age_group = marketing.groupby(‘Age’)
In [6]:
# I am only interested in the AmountSpent attribute, and I want to summarize the AmountSpent for different age groups
age_group[‘AmountSpent’].describe()
Out[6]:
count mean std min 25% 50% 75% max
Age
Middle 508.0 1501.690945 959.273125 157.0 815.75 1320.0 2008.5 5878.0
Old 205.0 1432.126829 1045.930027 65.0 638.00 1120.0 1985.0 6217.0
Young 287.0 558.623693 450.145949 38.0 248.50 422.0 699.0 3688.0
From the above, it seems like the average spending varies quite a lot between different age groups, BUT the error (standard deviation) is also high, so we cannot be sure that this difference is significant.
In [7]:
# Let’s investigate the relationship across the gender groups
gender_group = marketing.groupby(‘Gender’)
In [8]:
# I am only interested in the AmountSpent attribute, and I want to summarize the AmountSpent for different gender groups
gender_group[‘AmountSpent’].describe()
Out[8]:
count mean std min 25% 50% 75% max
Gender
Female 506.0 1025.339921 910.529013 47.0 364.25 706.0 1442.0 5830.0
Male 494.0 1412.850202 972.769951 38.0 688.75 1216.0 1902.0 6217.0
From the above, it seems like the average spending varies quite a lot between different gender groups, BUT the error (standard deivation) is also high, so we cannot be sure that this difference is significant.
2 – Build a linear regression model with dependant variable AmountSpent and independant variables Salary, Children, Gender and Catalogs.¶
In [9]:
# Write down our formula that represents the model we are trying to model
# In plain English, the formula reads “AmountSpent is explained by Salary and Children and Gender and Catalogs”
formula = ‘AmountSpent ~ Salary + Children + Gender + Catalogs’
# First, we will initlize the ordinary least squares function
# OLS: Ordinary Least Squares
# OLS is the algorithm to train our linear model, but it is not the linear model itself
model = smf.ols(formula, marketing)
In [10]:
# Now, train the model by using the fit() method
results = model.fit()
In [11]:
# Summarize my linear regression result
results.summary()
Out[11]:
OLS Regression Results Dep. Variable: AmountSpent R-squared: 0.659
Model: OLS Adj. R-squared: 0.657
Method: Least Squares F-statistic: 480.4
Date: Fri, 13 Apr 2018 Prob (F-statistic): 1.40e-230
Time: 09:38:24 Log-Likelihood: -7748.7
No. Observations: 1000 AIC: 1.551e+04
Df Residuals: 995 BIC: 1.553e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -431.2400 54.680 -7.887 0.000 -538.541 -323.939
Gender[T.Male] -41.7490 37.152 -1.124 0.261 -114.655 31.157
Salary 0.0206 0.001 33.534 0.000 0.019 0.022
Children -200.9904 17.207 -11.681 0.000 -234.756 -167.224
Catalogs 47.7775 2.756 17.337 0.000 42.370 53.186
Omnibus: 257.549 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 995.294
Skew: 1.180 Prob(JB): 7.49e-217
Kurtosis: 7.279 Cond. No. 2.01e+05
3 – Which variables are significant and which are not?¶
From the above, Salary, Children and Catalogs are significant, but Gender is probably not that significant so we could drop it from the model.
4 – What is the $R^2$ value of the model? How could you potentially improve the model?¶
From the summary table or the rsquared attribute of the regression results object http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
Let’s try with adding more variables. However, we must be aware not to overfit our model. We will discuss this further in the module on Model Selection.
In [12]:
results.rsquared
Out[12]:
0.65886832396594674
In [18]:
formula_big = ‘AmountSpent ~ Salary + Children + Gender + Catalogs + History + Age’
model_big = smf.ols(formula_big, marketing)
results_big = model_big.fit()
results_big.rsquared
Out[18]:
0.72929213520421476
5 – To test your model you should predict sales amount with following features: 50,000 salary, 0 children, 12 sent catalogs and female.¶
In [13]:
# Create a dataframe containing the attribute of this particular customer
customers = pd.DataFrame({‘Salary’: [50000],
‘Children’: [0],
‘Gender’: [‘Female’],
‘Catalogs’: [12]})
In [17]:
# Use my ordinary least squares trained model to predict on this customer
results.predict(customers)
Out[17]:
0 1075.776908
1 -468.131111
dtype: float64
6 – Predict on multiple people at the same time (make up a few people that you might be interested in, does the result make sense to you?)¶
In [15]:
# Create a dataframe containing the attribute of this particular customer
customers = pd.DataFrame({‘Salary’: [50000, 10000],
‘Children’: [0, 1],
‘Catalogs’: [10, 0],
‘Gender’: [‘Female’, ‘Male’]})
In [16]:
# Use my ordinary least squares trained model to predict on these customer
results.predict(customers)
Out[16]:
0 1075.776908
1 -468.131111
dtype: float64