Assignment_9_solved
Assignment 9: Clustering Customers of a Comic Book Store¶
In this assignment, you will be solving a traditional problem in quantitative marketing: Customer segmentation. Having a properly segmented database is extremely important to define marketing campaigns, as it allows companies to define value-centric actions targeted towards customers of different profiles. While there are several ways we can cluster customers, in this example we will use the Recency, Frequency and Monetary Value), or RFM, paradigm to do so. This way of thinking of customer data reflects the engagement between a customer and a company by reducing their interactions to three values:
Copyright By PowCoder代写 加微信 powcoder
The Recency between interactions: The time between two subsequent purchases or between two general interactions between the customer and your organization.
The Frequency of interactions: The raw number of interactions over a predefined time. This can be for example how many times a customer visits your website every month, or how many purchases the customer does at your store, etc.
The Monetary Value of the interactions: The total monetary value (not necessarily positive) of the interactions of the customer with your organization over the same period of time as before.
Additionally, this dataset has a Cost of Service variable (which is not included in the MV calculations for this example). It shows how much cost each interaction with the customer brings. This can be useful information as a customer may make purchases of small monetary value, but may spend many hours at the store occupying the service personnel’s time. This can mean on average these customers may even end up being a cost to the company! The information comes from a local comic book store, and represents the summary of interactions of the customers with a loyalty card.
In this assignment, we will create a clustering of the customers using these four variables and will create a commercial strategy arising from our results.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
from sklearn.decomposition import PCA
%matplotlib inline
# Uncomment if working in the cloud
#!gdown https://drive.google.com/uc?id=1VL-LjrjgCtGWkDw914MVLj2sEttlL2Uv
Task 1: Studying the data [10 pts]¶
Import the data and present the descriptive statistics of all variables. Written answer: What can you say about the variables you have? Why should you normalize the data?. Normalize the data so you can create clusters.
# Read the data
RFM_data = pd.read_csv(‘RFM_Assignment_09.csv’)
RFM_data.describe()
Recency Frequency MV CoS
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 30.116293 17.164128 8806.972152 8805.228492
std 17.742131 29.419952 3391.111080 3285.384316
min -4.942628 -1.237633 760.115629 944.136187
25% 18.221827 3.710640 8065.394231 9957.361435
50% 28.588186 6.030687 10462.416975 10000.463454
75% 35.880252 9.999136 10608.634596 10041.297949
max 67.258701 102.189321 11139.504803 11054.536104
Written answer: The data is not normalized, so we should apply a transformation to bring all variables to the same scale. Otherwise, the clustering method will be dominated by the variables with the highest magnitud, which is not what we want at all.
# Normalize and apply to the data
normalizer = StandardScaler()
RFM_data = pd.DataFrame(normalizer.fit_transform(RFM_data), index=RFM_data.index, columns=RFM_data.columns)
Task 2: K-Means Clustering and Silhoutte Analysis [40 pts]¶
Now we can perform the cluster analysis. The single most important question in cluster analysis is to determine the number of clusters that you should create. Following the labs (or this tutorial), try between 3 and 8 clusters, using a seed of 10, and plot their silouhette analyses (we will plot the clusters themselves in the next question). Written answer: What is the optimal number of clusters using the silhouette method?
range_n_clusters = [3, 4, 5, 6, 7, 8]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, ax1 = plt.subplots(1, 1)
# The 1st subplot is the silhouette plot
ax1.set_xlim([0, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(RFM_data) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(RFM_data)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(RFM_data, cluster_labels)
print(“For n_clusters =”, n_clusters,
“The average silhouette_score is : %.4f” % silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(RFM_data, cluster_labels)
y_lower = 10
# Iterate over the clusters
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title(“Silhouette plot for the various clusters (n_clusters=%i).” % (n_clusters))
ax1.set_xlabel(“Silhouette coefficient values”)
ax1.set_ylabel(“Cluster label”)
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color=”red”, linestyle=”–“)
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1])
For n_clusters = 3 The average silhouette_score is : 0.7441
For n_clusters = 4 The average silhouette_score is : 0.7899
For n_clusters = 5 The average silhouette_score is : 0.7249
For n_clusters = 6 The average silhouette_score is : 0.6557
For n_clusters = 7 The average silhouette_score is : 0.6513
For n_clusters = 8 The average silhouette_score is : 0.6143
Written answer: We want clusters of similar width and closer to the average. Hard choice in this case as 3, 4 and 5 clusters could be argued as relevant. However, 3 and 4 clusters lead to one very large cluster, while 5 leads to clusters that are far from the average. The next question will help shed some light over this.
Task 3: Plotting the clusters [25 pts]¶
Now we will visualize what we just did. For this we will use a common trick in clustering: use a PCA transform to reduce the data to a few variables (two or three) and plot those. Apply a PCA transform to the data using two components and create a scatterplot, differentiating by using colours the clusters from previous answer using a different colour. Note the clusters still must be calculated over the unrotated data. Use only your results from using 3, 4 and 5 clusters. Written answer: How many clusters would you use considering the results of task 2 and these ones?
# Calculate PCA
PCA_transformer = PCA(2)
PCA_data = PCA_transformer.fit_transform(RFM_data)
# For three clusters.
n_clusters = 3
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(RFM_data)
for i in range(n_clusters):
color = cm.nipy_spectral(float(i) / n_clusters)
plt.scatter(PCA_data[cluster_labels==i, 0], PCA_data[cluster_labels==i, 1],
label=’Cluster %i’ % (i+1))
plt.xlabel(‘Component 1’)
plt.ylabel(‘Component 2’)
plt.title(‘PCA-transformed plot for %i clusters’ % n_clusters)
plt.legend()
# For four clusters.
n_clusters = 4
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(RFM_data)
for i in range(n_clusters):
color = cm.nipy_spectral(float(i) / n_clusters)
plt.scatter(PCA_data[cluster_labels==i, 0], PCA_data[cluster_labels==i, 1],
label=’Cluster %i’ % (i+1))
plt.xlabel(‘Component 1’)
plt.ylabel(‘Component 2’)
plt.title(‘PCA-transformed plot for %i clusters’ % n_clusters)
plt.legend()
# For five clusters.
n_clusters = 5
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(RFM_data)
for i in range(n_clusters):
color = cm.nipy_spectral(float(i) / n_clusters)
plt.scatter(PCA_data[cluster_labels==i, 0], PCA_data[cluster_labels==i, 1],
label=’Cluster %i’ % (i+1))
plt.xlabel(‘Component 1’)
plt.ylabel(‘Component 2’)
plt.title(‘PCA-transformed plot for %i clusters’ % n_clusters)
plt.legend()
Written answer: The PCA analysis is fairly clear in that there are four distinct, well-defined, clusters. The optimal number thus is four. Note that a business, but not technical, argument could be made for five clusters in order to target the two different groups in the bottom-left cluster independently but there is no low-density area that justifies these two cluster as two different groups.
Task 4: Deploying the model. [25 pts]¶
The objective of any cluster analysis over customer data is to create a reasonable segmentation of your customers. Using the number of clusters you have decided in Task 3 and a table of the averages per variable for each cluster, name the different clusters and think what would a company would do with a customer in that cluster. (Hint: For example a clusters with high frequency, low recency, and low monetary value are desirable customers for whom it would be a good plan to try to increase their monetary value while keeping their high engagement. You could name them “Diamonds in the rough”. Pandas’ groupby can probably help.)
# For four clusters.
n_clusters = 4
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
RFM_data[‘KMeans_Clusters’] = clusterer.fit_predict(RFM_data)
# Crosstab with the averages per cluster
RFM_data.groupby([‘KMeans_Clusters’]).mean()
Recency Frequency MV CoS
KMeans_Clusters
0 -0.248960 -0.363576 0.540266 0.363838
1 -1.576063 2.815243 -2.258159 -2.375193
2 1.674413 -0.554165 -0.235917 0.668710
3 1.614325 1.122078 -2.339649 -2.349483
Written answer: A potential interpretation to this output follows.
Cluster 0 has a low recency, low frequency, high monetary value and medium cost of service. This means they are valuable consumers that do not engage the company very much (although when they do they come in short bursts), so it would be a good idea to target them with actions aiming at increasing their engagement. These are the “growth potential” customers.
Cluster 1 has the lowest recency, a very high frequency, a very low monetary value but also a very low cost of service. These are customers that engage the company very often, leaving very little profit but also not costing much. These are customers for whom increasing value should be the core organizational goal. They are the ones that can become loyal, highly profitable customers after a successful upselling campaign.
Cluster 2 has the highest recency, a low frequency, a below-average monetary value and a high cost of service. These customers are complex, as they do not visit the organization often, but when they do they are unprofitable and have a high service cost. They could be targeted with low-cost activities aiming at increasing their value (but NOT their engagement as it can be unprofitable), and even could be considered to not be targeted at all.
Cluster 3 also has a high recency, but paired now with a very high recency, although with the lowest monetary value, but again paired with a low cost of service. These are customers that need to necessarily be targeted with value-increasing measures. They bring very little cost to the organization, but at the same time almost have no cost. They also visit the organization very often, so the are many opportunities to improve engagement.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com