Tutorial_07_new
Predictive Analytics (QBUS2820)
Tutorial 7: Variable Selection
In this tutorial we will study how to implement the variable selection in Python.
Data
Standardising the predictors
Variable selection in OLS
Regularisation methods
This notebook relies on the following imports and settings.
In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings(‘ignore’)
%matplotlib inline
sns.set_context(‘notebook’)
sns.set_style(‘ticks’)
Data¶
Our analysis will be based on a customer acquisition and profitability dataset.
In [6]:
data = pd.read_excel(‘CustomerAcquisition.xls’, index_col=[0])
data.head()
Out[6]:
Acquisition First_Purchase CLV Duration Censor Acq_Expense Acq_Expense_SQ Industry Revenue Employees Ret_Expense Ret_Expense_SQ Crossbuy Frequency Frequency_SQ
Customer
1 1 433.64 0.0000 384 0 760.36 578147.33 1 30.16 1240 2309.79 5335129.84 5 2 4
2 0 0.00 0.0000 0 0 147.70 21815.29 1 39.80 166 0.00 0.00 0 0 0
3 0 0.00 0.0000 0 0 252.56 63786.55 1 54.93 1016 0.00 0.00 0 0 0
4 1 225.84 5.7316 730 1 609.73 371770.67 1 45.83 122 2192.59 4807450.91 2 12 144
5 1 363.04 0.0000 579 0 672.36 452067.97 1 69.03 313 801.14 641825.30 4 7 49
The objective of this type of analysis is to predict the profitability of a customer, which involves both classification and regression tasks. We apply classification methods to predict whether a customer will be acquired and retained, and regression methods to estimate the of acquired and retained customers to the company. Here, we focus on the regression task.
The response variable for this tutorial is the customer lifetime value (CLV, a measure of profitability) of customers who were acquired and remained as customers by the end of the observation window (730 days). These customers are identified with a value of one for the Censor variable.
In [7]:
data = data[data[‘Censor’]==1]
The predictors are the value of the first purchase, dollars spent on marketing expenses for acquiring and retaining the customer, industry, revenue and number of employees of the customer, the number of product categories the customer purchased, and the number of times the customer purchased during the observation window.
Below, we record the name of the response and predictor variables for subsequent use. This is efficient since most of the code is completely automated from here onwards, allowing us to quickly implement the same type of analysis for different data.
In [8]:
response = [‘CLV’]
exclude = response + [‘Acquisition’, ‘Duration’, ‘Censor’] # list of variables who are NOT predictors
predictors=[x for x in list(data.columns) if x not in exclude] # building a list a predictors
data=data[response+predictors] # excluding variabes which we are not going to use
data.head()
Out[8]:
CLV First_Purchase Acq_Expense Acq_Expense_SQ Industry Revenue Employees Ret_Expense Ret_Expense_SQ Crossbuy Frequency Frequency_SQ
Customer
4 5.7316 225.84 609.73 371770.67 1 45.83 122 2192.59 4807450.91 2 12 144
9 6.9161 599.30 452.35 204620.52 1 17.98 1782 1340.75 1797610.56 1 11 121
10 6.0839 271.18 786.72 618928.36 1 38.91 539 2265.93 5134438.76 1 14 196
14 8.7544 678.06 472.13 222906.74 0 58.91 1507 2430.31 5906406.70 5 2 4
16 6.5147 178.28 610.91 373211.03 1 33.78 41 2071.53 4291236.54 6 8 64
List of the potential predictors:
In [9]:
predictors
Out[9]:
[‘First_Purchase’,
‘Acq_Expense’,
‘Acq_Expense_SQ’,
‘Industry’,
‘Revenue’,
‘Employees’,
‘Ret_Expense’,
‘Ret_Expense_SQ’,
‘Crossbuy’,
‘Frequency’,
‘Frequency_SQ’]
We are ready to split the data into training and test sets.
In [10]:
train = data.sample(frac=0.5, random_state=1)
test = data[data.index.isin(train.index)==False].copy()
Standardising the predictors¶
The regularisation methods require us to standardise the predictors by substracting the mean (0 mean) and dividing the standard deviation (to have a unit standard deviation after the standardisation). We need to apply the same transformations to the predictors in the test data, where the mean and standard deviation are based on the training data only.
Alternatively, you can use the standard scaler from Scikit Learn for this task.
We don’t need to standardise the response variable, as Lasso, Ridge and ElasticNet can calcualte the intercept term automatically (fit_intercept=True by default). Hence, the predictions will be on the original scale of the response variable.
In [11]:
mu=train[predictors].mean() # mean for each feature
sigma=train[predictors].std() # std for each feature
In [12]:
mu.head()
Out[12]:
First_Purchase 424.133824
Acq_Expense 563.333824
Acq_Expense_SQ 328524.992647
Industry 0.705882
Revenue 40.867500
dtype: float64
In [13]:
sigma.head()
Out[13]:
First_Purchase 149.298535
Acq_Expense 106.521648
Acq_Expense_SQ 119095.465998
Industry 0.459033
Revenue 16.173683
dtype: float64
Standardise the predictors in both training and test data. Note we must use the same mean and standard deviation calculated based on the training data
In [14]:
train[predictors]=(train[predictors]-mu)/sigma
test[predictors]=(test[predictors]-mu)/sigma
means of the predictors (in training data) after standarisation. Observe that the data have been centered (i.e. having a zero mean)
In [15]:
train[predictors].mean().head()
Out[15]:
First_Purchase 9.959354e-17
Acq_Expense -1.369819e-15
Acq_Expense_SQ -2.693924e-16
Industry -9.469549e-17
Revenue -4.179663e-16
dtype: float64
standard deviations of the predictors (in training data) after standarisation
In [16]:
train[predictors].std().head()
Out[16]:
First_Purchase 1.0
Acq_Expense 1.0
Acq_Expense_SQ 1.0
Industry 1.0
Revenue 1.0
dtype: float64
Variable selection in OLS¶
We will demonstrate doing variable selection in OLS based on hypothesis testing and AIC/BIC criteria.
Note that, technically, unlike regularisation methods, OLS doesn’t require us to standardise the data. But it is often more numerically stable to do so.
In [17]:
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
x_with_intercept = sm.add_constant(train[predictors], prepend=True)
ols = sm.OLS(train[response],x_with_intercept)
est = ols.fit()
print(est.summary())
OLS Regression Results
==============================================================================
Dep. Variable: CLV R-squared: 0.865
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 32.51
Date: Mon, 20 Sep 2021 Prob (F-statistic): 2.44e-20
Time: 11:25:23 Log-Likelihood: -26.423
No. Observations: 68 AIC: 76.85
Df Residuals: 56 BIC: 103.5
Df Model: 11
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
———————————————————————————-
const 6.4117 0.048 134.447 0.000 6.316 6.507
First_Purchase 0.6429 0.263 2.446 0.018 0.116 1.169
Acq_Expense 0.2627 0.453 0.579 0.565 -0.646 1.171
Acq_Expense_SQ -0.5408 0.469 -1.154 0.253 -1.479 0.398
Industry 0.2754 0.051 5.438 0.000 0.174 0.377
Revenue 0.0103 0.095 0.108 0.914 -0.180 0.200
Employees -0.1068 0.218 -0.490 0.626 -0.544 0.330
Ret_Expense 1.9527 0.288 6.780 0.000 1.376 2.530
Ret_Expense_SQ -1.3464 0.287 -4.690 0.000 -1.921 -0.771
Crossbuy 0.3309 0.050 6.653 0.000 0.231 0.431
Frequency 0.8512 0.215 3.960 0.000 0.421 1.282
Frequency_SQ -0.5336 0.218 -2.447 0.018 -0.970 -0.097
==============================================================================
Omnibus: 0.598 Durbin-Watson: 1.659
Prob(Omnibus): 0.742 Jarque-Bera (JB): 0.179
Skew: 0.077 Prob(JB): 0.914
Kurtosis: 3.199 Cond. No. 25.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
As the p-value for “Revenue” is large, let’s try removing this predictor from the model
In [18]:
predictors_new = [‘First_Purchase’,
‘Acq_Expense’,
‘Acq_Expense_SQ’,
‘Industry’,
# ‘Revenue’,
‘Employees’,
‘Ret_Expense’,
‘Ret_Expense_SQ’,
‘Crossbuy’,
‘Frequency’,
‘Frequency_SQ’]
In [19]:
x_with_intercept = sm.add_constant(train[predictors_new], prepend=True)
ols = sm.OLS(train[response],x_with_intercept)
est = ols.fit()
print(est.summary())
OLS Regression Results
==============================================================================
Dep. Variable: CLV R-squared: 0.865
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 36.39
Date: Mon, 20 Sep 2021 Prob (F-statistic): 3.94e-21
Time: 11:25:23 Log-Likelihood: -26.430
No. Observations: 68 AIC: 74.86
Df Residuals: 57 BIC: 99.27
Df Model: 10
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
———————————————————————————-
const 6.4117 0.047 135.628 0.000 6.317 6.506
First_Purchase 0.6666 0.144 4.626 0.000 0.378 0.955
Acq_Expense 0.2490 0.432 0.577 0.566 -0.615 1.113
Acq_Expense_SQ -0.5244 0.440 -1.192 0.238 -1.405 0.356
Industry 0.2754 0.050 5.486 0.000 0.175 0.376
Employees -0.1251 0.137 -0.915 0.364 -0.399 0.149
Ret_Expense 1.9576 0.282 6.941 0.000 1.393 2.522
Ret_Expense_SQ -1.3517 0.280 -4.822 0.000 -1.913 -0.790
Crossbuy 0.3310 0.049 6.715 0.000 0.232 0.430
Frequency 0.8509 0.213 3.993 0.000 0.424 1.278
Frequency_SQ -0.5323 0.216 -2.466 0.017 -0.964 -0.100
==============================================================================
Omnibus: 0.611 Durbin-Watson: 1.664
Prob(Omnibus): 0.737 Jarque-Bera (JB): 0.183
Skew: 0.074 Prob(JB): 0.912
Kurtosis: 3.207 Cond. No. 23.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
With “Revenue” removed, both AIC and BIC got smaller. Now, the p-value of “Acq_Expense” is quite high, next, we try to remove this predictor.
In [20]:
predictors_new = [‘First_Purchase’,
# ‘Acq_Expense’,
‘Acq_Expense_SQ’,
‘Industry’,
# ‘Revenue’,
‘Employees’,
‘Ret_Expense’,
‘Ret_Expense_SQ’,
‘Crossbuy’,
‘Frequency’,
‘Frequency_SQ’]
In [21]:
x_with_intercept = sm.add_constant(train[predictors_new], prepend=True)
ols = sm.OLS(train[response],x_with_intercept)
est = ols.fit()
print(est.summary())
OLS Regression Results
==============================================================================
Dep. Variable: CLV R-squared: 0.864
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 40.86
Date: Mon, 20 Sep 2021 Prob (F-statistic): 7.00e-22
Time: 11:25:40 Log-Likelihood: -26.628
No. Observations: 68 AIC: 73.26
Df Residuals: 58 BIC: 95.45
Df Model: 9
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
———————————————————————————-
const 6.4117 0.047 136.415 0.000 6.318 6.506
First_Purchase 0.6720 0.143 4.701 0.000 0.386 0.958
Acq_Expense_SQ -0.2730 0.059 -4.626 0.000 -0.391 -0.155
Industry 0.2725 0.050 5.487 0.000 0.173 0.372
Employees -0.1297 0.136 -0.956 0.343 -0.401 0.142
Ret_Expense 1.9495 0.280 6.961 0.000 1.389 2.510
Ret_Expense_SQ -1.3460 0.279 -4.832 0.000 -1.904 -0.788
Crossbuy 0.3322 0.049 6.784 0.000 0.234 0.430
Frequency 0.8588 0.211 4.062 0.000 0.436 1.282
Frequency_SQ -0.5474 0.213 -2.570 0.013 -0.974 -0.121
==============================================================================
Omnibus: 0.616 Durbin-Watson: 1.654
Prob(Omnibus): 0.735 Jarque-Bera (JB): 0.164
Skew: 0.020 Prob(JB): 0.921
Kurtosis: 3.237 Cond. No. 13.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Both AIC and BIC got smaller, hence we should remove “Acq_Expense”. Your task: try to remove “Employees”
Regularisation methods¶
Let’s find the Lasso regression coefficients with the shrinkage parameter alpha = 0.01, for example.
See Lasso for how to use this function.
If the predictor data $X$ have not been standardized, you can set normalize=True, then $X$ will be standardized before Lasso regression.
In [22]:
from sklearn import linear_model
Lasso_reg = linear_model.Lasso(alpha=0.01)
Lasso_reg.fit(train[predictors], np.ravel(train[response]))
pd.DataFrame(Lasso_reg.coef_.round(3), index = predictors).T
Out[22]:
First_Purchase Acq_Expense Acq_Expense_SQ Industry Revenue Employees Ret_Expense Ret_Expense_SQ Crossbuy Frequency Frequency_SQ
0 0.488 -0.0 -0.301 0.254 0.043 0.0 1.201 -0.617 0.328 0.452 -0.174
In [22]:
from sklearn import linear_model
Lasso_reg = linear_model.Lasso(alpha=0.02)
Lasso_reg.fit(train[predictors], np.ravel(train[response]))
pd.DataFrame(Lasso_reg.coef_.round(3), index = predictors).T
Out[22]:
First_Purchase Acq_Expense Acq_Expense_SQ Industry Revenue Employees Ret_Expense Ret_Expense_SQ Crossbuy Frequency Frequency_SQ
0 0.465 -0.0 -0.302 0.238 0.044 0.0 0.571 -0.0 0.319 0.25 -0.0
With $\alpha=0.01$, two predictors “Acq_Expense” and “Employees” are removed.
With $\alpha=0.02$, four predictors “Acq_Expense”,”Employees”, “Ret_Expense_SQ” and “Frequency” are removed.
How can we find optimal $\alpha$?
LassoCV function is the most convenient implementation with built in CV-based model selection for tuning parameter $\alpha$.
In [23]:
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=10)
lasso.fit(train[predictors], np.ravel(train[response])) # the np.ravel is a necessary detail for compatibility
Out[23]:
LassoCV(cv=10)
In [24]:
print(“LASSO Lambda: {0}”.format(lasso.alpha_))
LASSO Lambda: 0.0032143930090217892
In [25]:
pd.DataFrame(lasso.coef_.round(3), index = predictors).T
Out[25]:
First_Purchase Acq_Expense Acq_Expense_SQ Industry Revenue Employees Ret_Expense Ret_Expense_SQ Crossbuy Frequency Frequency_SQ
0 0.525 0.0 -0.288 0.264 0.037 -0.012 1.694 -1.1 0.331 0.729 -0.435
Homework task:¶
Implement the BIC method (lecture slide 50) for selecting $\alpha$
In [ ]: