MS6711 Data Mining
Exercise 4
• Briefly explain the roles of the three important components of time frame for predictive models.
• One basic rule in predictive models is that all data used as inputs must occur earlier in time than any of the data used to create the outputs. Explains what will happen if the rule is violated.
• Suppose that you are asked to build a model to classify credit card customer’s behaviour in the future 3 months into one of the three segments: revolers are cardholders who maintain large outstanding balances and pay lots of interest; transactors are cardholders who pay off their bills every month, hence pay no interest; and convenience users are cardholders who charge a lot and then pay off the balance over several months. You have the past 24 months of data at hand. Assume that it will take you 3 weeks to develop the model and the data of a month will not be available until the first week of next month. Describe briefly how you will divide the data for inputs and output of your model.
• Churn is generally the action of the customer to leave the company for some reasons. Customers can leave the company for many different reasons. We can broadly categorize churn by who initiates the action – the company or customer. We call it voluntary churn if the customer first initiates the action. If the company decides to terminate their service with the customer, it is called involuntary churn. Suppose you are now in the month of February and you are asked to predict which customer is likely to churn voluntary in April. You have been given the historical data of the customers from February of last year till January of the current year. Describe briefly how you will divide the data for inputs and output of your model.
• Suppose that a model for predicting whether a customer will respond to a mailing campaign is developed with the following 10 results:
ID
Predicted-Yes
Probability
Predicted-No
probability
ID
Predicted-Yes
probability
Predicted-No
probability
1
0.7
0.3
6
0.2
0.8
2
0.2
0.8
7
0.1
0.9
3
0.1
0.9
8
0.7
0.3
4
0.3
0.7
9
0.4
0.6
5
0.7
0.3
10
0.0
1.0
• Classify each record to either a ‘Yes’ or a ‘No’ class according to the following profit matrix so that the total expected profit is optimized.
Predicted
Yes
No
Actual
Yes
1
0
No
0
1
• Classify each record to either a ‘Yes’ (target level) or a ‘No’ class using a cutoff value of 0.5. Are the results different to those in Part (a)?
• Refer to Question 5. Classify each record to either a ‘Yes’ or a ‘No’ class according to the following profit matrix so that the total expected profit is optimized.
Predicted
Yes
No
Actual
Yes
2
0
No
-1
0
• A classification model has been trained to predict whether a customer will respond to a marketing campaign. Among the 10,000 records used for training, 4,000 of the customers responded to the campaign. The performance of the model is recorded as follows:
Percentile
Responders in each percentile
Percentile
Responders in each percentile
10
750
60
260
20
700
70
210
30
650
80
180
40
600
90
150
50
450
100
50
At each decile, compute the cumulative captured response rate, cumulative response rate, and cumulative lift value.
• Refer to Question 7.
• Construct a confusion matrix of a naïve model.
• If all of the observations in the top 50% decile are classified as responders, construct a confusion matrix for this data mining model.
• Compare the classification accuracy between the naïve model and the data mining model as described in (b).
• Using the data in the following table (Target: A), construct a confusion matrix.
Actual class
Predicted class
Actual class
Predicted class
Actual class
Predicted class
A
B
B
B
B
B
A
B
A
B
B
B
B
B
A
B
B
A
B
A
B
A
B
B
A
B
B
A
B
B
• Refer to Question 9. Compute the sensitivity, specificity, and classification accuracy. (sensitivity = correctly predicted true positive / total actual positive; specificity = correctly predicted true negative / total actual negative.)
• Consider the following classification score for each object:
Target
0
1
1
0
0
Score
0.3580
0.6034
0.9444
0.0481
0.5605
Target
0
1
1
1
0
Score
0.3910
0.4040
0.8266
0.8700
0.4946
If 1 is the target level of interest, what is the area under the ROC?
• A classification model was applied to a training data set with 2,250 records. Suppose 1,000 of these belong to the level of interest. If the rank sum of the classification scores on these 1000 records is 1,500,000, what will be the area under the ROC of the classification model? What is the meaning of the area under the ROC in this context?
• Suppose that adjust the cutoff value of a developed model for classifying a record as fraudulent. Describe how moving the cutoff up or down would affect
• The classification error for records which are truly fraudulent.
• The classification error rate for records which are truly non-fraudulent.
• Suppose that in an undersampled data set, 1000 records are in class A and 500 records are in class B (target level). If the population proportion for A and B is actually 0.95 and 0.05 respectively, what is the adjusted number of records for each class in the population? Assume that all records of class B are included in the sample.
• A large number of insurance records are to be examined to develop a model for predicting fraudulent claims. Of the claims in the historical database, 1% were judged to be fraudulent. A sample is taken to develop a model, and oversampling is used to provide a balanced sample in the light of the very low response rate. When applied to this sample of size 800, the model ends up correctly classifying 310 frauds, and 270 nonfrauds. It missed 90 frauds, and classified 130 records incorrectly as frauds when they were not.
• Produce the confusion matrix and the misclassification rate for the sample as it stands.
• Taking the oversampling into account, produce the adjusted confusion matrix and the misclassification rate.
Exercises for SAS EM
• Refer to the SAS data set Custdet1.Sas7bdat.
• Create a binary variable with value 1 for the customers who have purchased kitchen products, dish products, or flatware. Report the distribution of the created binary variable.
• Use MetaData node to change the role of the variable created in part (a) to Target and change other related settings for the variable. You may assume that the variable value equals to 1 is the level of interest.
• Use MultiPlot node and StatExplore node to examine the distribution and statistics of the other variables in the data set. Pay particular attention to the shape of the distribution, unusual values, and number of levels etc.
• Apply appropriate transformation to the variables
• Partition the data set to train (60%), validation (35%), and Test (15%).
• The SAS data set Customer_Demographic_Exercise.Sas7bdat contains historical records about the customers of a mobile phone company. The data set has already been pre-processed so that it does not contain missing value and outlier. The column Churn_reason shows reasons the customers gave for not renewing or for dropping their mobile phone service. However, if the reason description is ACTIVATION, it indicates the customer hasn’t churned.
• The column Tot_Invoice_Amt minus the column Tot_Paid_Amt gives the outstanding amount of a customer. Compute the total amount of loss to the mobile phone company due to churns.
• Create a new variable of values Churn and nonChurn only. A record has value Churn if the respective customer dropped the mobile service. Otherwise the value is nonChurn.
• Use MetaData node to change the role of the variable created in part (b) to Target and change other related settings for the variable. You may assume that the variable value equals to Churn is the level of interest.