BUSS6002_MKBA_STUDENTID
Marketing and Business Analytics Individual Assignment¶
This assignment accounts for 22% of your final mark. 15% from the Marketing component
7% from the Business Analytics component
This is an individual assignment.
The assignment is due before 17:00 26th October 2018. The late penalty for the assignment is 10% of the assigned mark per day, starting after 5pm on the due date. The closing date 5pm 2nd November 2018 is the last date on which an assessment will be accepted for marking.
Please only include your student ID in the submitted report, and do NOT include your name.
Background¶
You are employed as a Data Scientist at a telecommunications company. One of the biggest issues facing the company is minimising customer churn. Customer churn is when a customer changes provider. The business is interested in analysing and predicting churn since the cost of acquiring new customers is higher than retaining existing customers.
This is particularly problematic for the telecommunications industry as changing telecommunications provider is relatively easy. There are also a large number of price competitive providers to choose from, which encourages churning.
Your job is to help the marketing department to undertake the following:
Investigate why and which customers churn
Discover retention oppurtunities and strategies
Identify current customers that are likely to churn so that the retention strategy can be applied
Identify potential customers so that the incentive strategy can be applied
Files¶
Part 1:
churn_survey.json
Part 2:
tweets.db
Submission Instructions¶
This assessment will be automatically marked. Any deviation from the stated output form will be marked as an incorrect answer. It is your responsibility to check that the output matches the given template or example for each question.
What to Submit¶
Submit only your .ipynb file. You can choose to use this file as a template OR start from scratch. Just make sure that running your notebook generates all the answers!
Filename¶
The filename must be “BUSS6002_MKBA_STUDENTID.ipynb”
Loading the data files when marking¶
We will run your notebook in the same directory as the data files. We will assume the original file names.
Output¶
The output for each question should be saved to the same directory as the notebook.
Checking the format of your output files¶
We have created ED Challenges for each question. The challenges will tell you if the FORMAT of your output file is correct. It does not tell you if your answer is correct. Please test your output files on Ed before submitting your assignment.
Timeout¶
We will automatically run your notebook. Each notebook will be given a maximum of 1 minute to be completed. Please ensure any model training or optimisation will be easily completed in this time frame.
Marking Criteria¶
Correctness of results
Set your Student ID here¶
In [ ]:
YOURSTUDENTID = …
Helper Function¶
In [ ]:
# This function is used to save answers with a non-tabular output
def write_txt(student_id, part_number, data):
file = open(“ID_{0}_Q_{1}.txt”.format(student_id, part_number), ‘w’)
file.write(data)
file.close()
Part 1 – Understanding Churn and Identifying Retention Strategies (20 Marks)¶
Data¶
The marketing team has collected information from a subset of current and past customers. To understand why customers churn and identify why customers have churned use the churn_survey.json file.
Data Dictionary¶
Variable Description
Churn Whether the customer churned or not
Contract The contract term of the customer
Dependents Whether the customer has dependents or not
DeviceProtection Whether the customer has purchased the device protection service or not
Gender Customer gender
InternetService Customer’s internet service type
MonthlyCharges The amount charged to the customer monthly
MultipleLines Whether the customer has multiple lines or not
OnlineBackup Whether the customer has purchased the additional online backup service or not
OnlineSecurity Whether the customer has purchased the additional online security service or not
PaperlessBilling Whether the customer has paperless billing or not
Partner Whether the customer has a partner or not
PaymentMethod The customer’s payment method
PhoneService Whether the customer has a phone service or not
SeniorCitizen Whether the customer is a senior citizen or not
StreamingMovies Whether the customer has purchased the additional streaming movie service or not
StreamingTV Whether the customer has purchased the additional streaming TV service or not
TechSupport Whether the customer has purchased the additional tech support service or not
Tenure Number of months the customer has stayed with the company
TotalCharges The total amount charged to the customer
In [ ]:
# Use this area to load the data
Questions¶
1.1 What is the most popular payment method? (1 Mark)¶
Output your answer as a .txt file containing the name of the most popular payment method.
FILENAME: ID_STUDENTID_Q_1_1.txt
In [ ]:
# WRITE YOUR CODE HERE
method_name = …
# This will save your answer to a .txt file
write_txt(YOURSTUDENTID, “1_1”, method_name)
1.2 What is the mean amount spent per month for the churn and non-churn customers? (1 Mark)¶
Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $ sign.
Churn MonthlyCharges
No 00.0000
Yes 00.0000
FILENAME: ID_STUDENTID_Q_1_2.csv
In [ ]:
# WRITE YOUR CODE HERE
1.3 What is the standard deviation of amount spent per month for the churn and non-churn customers? (1 Mark)¶
Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $ sign.
Churn MonthlyCharges
No 00.0000
Yes 00.0000
FILENAME: ID_STUDENTID_Q_1_3.csv
In [ ]:
# WRITE YOUR CODE HERE
1.4 What is the percentage of contract type for the churn and non-churn customers (2 Marks)¶
Output your answer as a .csv file with the following format to two decimal places. Do not include the % symbol.
Churn Month-to-month One year Two year
No 00.00 00.00 00.00
Yes 00.00 00.00 00.00
This percentage should be relative to the churn status NOT the entire sample i.e. the top left cell is the percentage of customers on month-to-month contracts who didn’t churn.
FILENAME: ID_STUDENTID_Q_1_4.csv
In [ ]:
# WRITE YOUR CODE HERE
1.5 Which feature and value is most correlated with MonthlyCharges? (2.5 Marks)¶
Output your answer as a .csv file with the following format containing the most correlated feature name and value.
Feature Value
FEATURE_NAME FEATURE_VALUE
FILENAME: ID_STUDENTID_Q_1_5.csv
In [ ]:
# WRITE YOUR CODE HERE
1.6 What is the count, mean, std, min, quartiles and max of time before a customer churns? (2.5 Marks)¶
Output your result as a two column .csv with the following format to four decimal places
Tenure
count 0.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
FILENAME: ID_STUDENTID_Q_1_6.csv
In [ ]:
# WRITE YOUR CODE HERE
1.7 What is the proportion of purchase for each account addon for male and female customers? (4 Marks)¶
Output your result as a .csv with the following format to four decimal places
Gender ADDON1 ADDON2 …
Female 0.0000 0.0000 ..
Male 0.0000 0.0000 ..
Please use the original name of the addon from the data. You must use your understanding of the data and the problem to determine where you can find this information in the dataset.
FILENAME: ID_STUDENTID_Q_1_7.csv
In [ ]:
# WRITE YOUR CODE HERE
1.8 Of the listed account addons, which addon/addons could be offered to churning customers for free or at a discounted rate in order to best retain them? (6 Marks)¶
Output your file as a single column .csv with the following format
Addon
ADDONX
ADDONY
…
where ADDONX is the name of one addon that you suggest. You must suggest at least 1 account addon up the total amount of addons listed in the dataset. You must exercise your best judgement and supporting evidence from the data to obtain a list of suggested addons. These addons should reflect the interests of the churning customers, i.e. which addons they actually care about.
FILENAME: ID_STUDENTID_Q_1_8.csv
In [ ]:
# WRITE YOUR CODE HERE
Part 2 – Churn Intervention (24 Marks)¶
The marketing teams wants to know if it is possible to identify customers on social media before they churn and if they can identify potential customers that want to move away from their existing provider. If a customer is identified before they churn then the retention strategy that you developed in Part 1.6 can be applied to stop the customer leaving.
The marketing team has outsourced the collection of data and labels. The data was collected from twitter and includes tweets referencing your company and competitors. The data is available in the tweets.db SQLite file.
To achieve the goals of the project you will need to do some EDA to understand the nature of the data and attempt to build a classifier to predict if an individual is likely to churn based on what they wrote in their tweet.
Data¶
Schema¶
The schema for the tweets.db file is below:
churn
Column Description
tid Tweet ID
churn Churn status
set Training or Hidden
tweets
Column Description
tid Tweet ID
uid User ID
date Datetime of the tweet
text Content of the tweet
Training and Hidden Sets¶
The data has been divided into two sets:
Set Tweets Target
Training Yes Yes
Hidden Yes No
The Churn labels for the training sets has been made available. However the marketing team wants to know how well your classifier will work on future and unseen data before deploying it. They will assess your classification performance on the hidden set.
2.1 How many tweets in the training set contain at least one of the strings “AT&T”, “Verizon” or “T-Mobile” (1.5 Marks)¶
Output the number of tweets as an integer to a .txt file. Your search should be invariant to capitilisation.
FILENAME: ID_STUDENTID_Q_2_1.txt
In [ ]:
# WRITE YOUR CODE HERE
number_tweets =
# This will save your answer to a .txt file
write_txt(YOURSTUDENTID, “2_1”, number_tweets)
2.2 Find all tweets in the training set containing the strings “AT&T”, “Verizon” and “switch” (2.5 Marks)¶
Output the tweets as a two column .csv file with the following format:
tid text
tweet_id1 text1
tweet_id2 text2
tweet_id3 text3
… …
The first column should be the tweet id and the second column should be the original text of the tweet. Your search should be invariant to capitilisation.
FILENAME: ID_STUDENTID_Q_2_2.csv
In [ ]:
# WRITE YOUR CODE HERE
2.3 Identify Churning Customers via Logistic Regression (Total 10 Marks)¶
Train a Logistic Regression Classifier to identify tweets from churning customers
Requirements
The original features must be the tweet text data
Use dimension reduction to reduce the dimensionality of the problem. In class you have learnt about PCA. However PCA will not work for TF or TF-IDF data as it is sparse. You must find an alternative method in scikit-learn that works with sparse data.
Maximum of 5 components
In Q2.3.5 your marks will be assigned based on your classifiers performance on the hidden set. Make sure you tune your model thoroughly in section Q2.3.3.
2.3.1 Transform Features (1.5 Marks)¶
Given the original text data, use an sklearn vectoriser to convert the text to a numeric representation.
Output your fitted vectoriser as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_1.pickle
In [ ]:
# WRITE YOUR CODE HERE
In [ ]:
# This code will save your Transformer/Vectoriser object to a file
import pickle
filename = “ID_{0}_Q_2_3_1.pickle”.format(YOURSTUDENTID)
# MYTRANSFORMEROBJECT must be a sklearn transformer or vectoriser
s = pickle.dump(MYTRANSFORMEROBJECT, open(filename, ‘wb’))
2.3.2 Dimension Reduction (1.5 Marks)¶
Reduce the dimensionality of your features to a maximum of 5 components.
Output your fitted dimensionality reducing object as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_2.pickle
In [ ]:
# WRITE YOUR CODE HERE
In [ ]:
# This code will save your Dimensionality Reducer object to a file
import pickle
filename = “ID_{0}_Q_2_3_2.pickle”.format(YOURSTUDENTID)
# MYREDUCEROBJECT must be a valid dimensionality reducer from sklearn
s = pickle.dump(MYREDUCEROBJECT, open(filename, ‘wb’))
2.3.3 Tuning (2 Marks)¶
Tune your model hyper-parameters for best performance. Make sure to tune thoroughly!
Output your fitted GridSearchCV or RandomisedSearchCV object as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_3.pickle
In [ ]:
# WRITE YOUR CODE HERE
In [ ]:
# This code will save your GridSearchCV or RandomisedSearchCV to a file
import pickle
filename = “ID_{0}_Q_2_3_3.pickle”.format(YOURSTUDENTID)
# MYGRIDSEARCHOBJECT must be GridSearchCV or RandomisedSearchCV
s = pickle.dump(MYGRIDSEARCHOBJECT, open(filename, ‘wb’))
In [ ]:
2.3.4 Output Model (1 Marks)¶
Output your trained logistic regression model as a pickle file. In the next part you will be competing against other students. So make sure you tune your model as best you can!
FILENAME: ID_STUDENTID_Q_2_3_4.pickle
In [ ]:
# WRITE YOUR CODE HERE
In [ ]:
# This code will save your LogisticRegression to a file
import pickle
filename = “ID_{0}_Q_2_3_4.pickle”.format(YOURSTUDENTID)
# MYLOGISTICREGRESSION must be of type sklearn.linear_model.LogisticRegression
s = pickle.dump(MYLOGISTICREGRESSION, open(filename, ‘wb’))
2.3.5 Predicting Churn for the Hidden Customers (4 Marks)¶
We will assign marks to this question based on the relative performance of each students classifier. You must try and tune your classifier in Question 2.3 as best you can!
Output your predictions as a two column .csv file with the following format:
tid Churn
tweet_id1 0
tweet_id2 1
tweet_id3 0
… …
where pred1 is the predicted class i.e. 1 is “Churn” and 0 is “Not churn”.
FILENAME: ID_STUDENTID_Q_2_3_5.csv
In [ ]:
# WRITE YOUR CODE HERE
2.4 Prediction Competition (Total 10 Marks)¶
We will assign marks to this question based on the relative performance of each students classifier.
Your goal is to build the most accurate classification pipeline for the hidden data. You should do your own research to find suitable preprocessing steps and classifier. You are allowed to use any preprocessing you like and any sklearn compatible classifier i.e. it must support the following functions:
fit
predict
You must output your classifier (as a pickle file) and predictions (as csv) using the format from Question 2.3.4 and 2.3.5.
Good luck!
FILENAMES:
ID_{0}_Q_2_4_1.pickle
ID_{0}_Q_2_4_1.csv
In [ ]:
# WRITE YOUR CODE HERE