MS6711 Data Mining
Exercise 5
Logistic Regression Models
• A logistic regression model for Class A has the following estimates of coefficients for each variable:
Constant
X1
X2
X3
1.2
-1.3
0.6
0.4
Find the odds ratio and the probability for the following samples:
• (1,-1,-1)
• (-1,1, 0)
• (0,0,0)
• A logistic regression model for determining whether a customer will default a loan (Target level: Yes; Non-target level: No) has the following estimates of coefficients for each variable:
Income
($0-$100,000)
Sex
(Male =1, Female = 0)
Age
(18-65)
(Age)2
(18-65)
-0.0003
0.6
0.04
-0.0025
• Interpret the meaning of the coefficients in the model.
• Suppose the intercept term in the above model is estimated as 3. How likely that a female customer with income $8,000, and age 34 will default a loan? Show your calculations clearly.
• Refer to the SAS data set Custdet1.Sas7bdat.
• Create a binary variable with value 1 for the customers who have purchased kitchen products, dish products, or flatware.
• The variable Valratio is highly skewed. Determine an appropriate transformation to this variable. You may assume that the ultimate goal here is to predict the value of the binary variable created in part (a).
• Build a logistic regression model to predict the buyers of dining wares (either Kitchen products, Dishes or Flatware) without using oversampling method. (Hint: You need to define a binary target variable with value 1 for the customers who have purchased any one of the three items and value 0 for the customers who have never purchased any one of the three items.)
• Suppose a $90 net profit would be generated if a target buyer is correctly identified and a loss of $20 would be incurred if a customer is miss-classified as a target buyer. Build a logistic regression model as in (c) (without oversampling) but this time you try to maximize the amount of profit. Compare the result to that in (c). Do you have the same model?
• Refer to the SAS data set Customer_Demographic_Exercise.Sas7bdat.
• Build logistic regression models to classify churned customers. Evaluate the performance of your models and select the one which performs the best. Select the columns for your model carefully. Not all columns are needed in building the model.
• Suppose the mobile phone company would develop a customer retention marketing plan to give away $50 worth of free calls. The cost to produce and mail out the offer is $2, making the total cost per customer $52. The estimated revenue for each correct prediction of a positive response will be $101, which includes the unpaid bills. If the offer is sent to a customer who was not going to defect, then no additional revenue is gained. If it is the aim of the company to maximize the return of investment, suggest how the developed logistic regression model could be used.
Decision Tree Models
• Why is tree pruning useful in decision tree induction?
• What are the advantages and disadvantages of using a decision tree model for classification in comparison to logistic regression model?
• Given a training data set as below:
Variable 1
Variable 2
Target
A
TRUE
1
A
TRUE
2
A
FALSE
2
A
FALSE
2
A
FALSE
1
B
TRUE
1
B
FALSE
1
B
TRUE
1
B
FALSE
1
C
TRUE
2
C
TRUE
2
C
FALSE
1
C
FALSE
1
C
FALSE
1
• Suppose the measurement level of Variable 1 is nominal. Compute all of the possible entropies for Variable 1 with two branches,
• Compute the amount of Entropy for Variable 2 with two branches, one for each level of the variable.
• Compare the Entropies computed in above and construct the first level splitting of a decision tree.
• Repeat the Question 3, but use Gini Index instead of the Entropy.
• Suppose a decision tree is being built to determine the credit risk level of customers. Two possible splits are given as below:
> 50
<= 50
High Risk: 19
Low Risk: 21
High Risk: 8
Low Risk: 11
High Risk: 11
Low Risk: 10
Age
>65 K
<= 65K
High Risk: 19
Low Risk: 21
High Risk: 18
Low Risk: 5
High Risk: 1
Low Risk: 16
Salary
Split A
Split B
> 50
<= 50
High Risk: 19
Low Risk: 21
High Risk: 8
Low Risk: 11
High Risk: 11
Low Risk: 10
Age
>65 K
<= 65K
High Risk: 19
Low Risk: 21
High Risk: 18
Low Risk: 5
High Risk: 1
Low Risk: 16
Salary
Split A
Split B
• Use the Gini Index to calculate the purity of the two splits.
• Which is a better split? Why?
• Consider the following decision tree for classifying responses to credit card insurance promotion:
• Construct a confusion matrix showing the frequency of all events for the above decision tree at a cutoff value of 0.5.
• Generate a set of IF-THEN rules for each leaf node.
• A financial services company offers a home equity line of credit to its clients. The company has extended several thousand lines of credit in the past, and some of these accepted applicants have defaulted. The following decision tree model was built to predict whether an applicant will default (Default = 1, Not default = 0).
• Write two rules from the above decision tree that can be used to predict which applicant is likely to default at a cutoff value of 0.5.
• Use the Gini index to measure the diversity due to the split of the variable ‘Debt to income ratio’.
• Use the Entropy to measure the diversity due to the split of the variable ‘Debt to income ratio’.
• Construct a confusion matrix showing the frequency of all events for the above decision tree at a cutoff value of 0.5.
• Suppose the company received three applications with the details in the following table. What is the predicted score of default for each applicant?
Variables
Application 1
Application 2
Application 3
Debt to income ratio (%)
67
55
60
Number of delinquent trade lines
2
0
0
Age of oldest trade line in months
250
150
200
Number of trade lines
15
11
15
• Refer to the Simulate5.sas7bdat.
• Use Var1 and Var2 as input for a decision tree model to classify the observations into one of the two classes of Var3.
• Define a new variable equals to Var1-Var2 and build a decision tree model to classify the observations into one of the two classes of Var3.
• Compare the results of the above two decision models. Why does the decision tree model in (b) perform better than that in (a)?
• Refer to the SAS data set Custdet1.Sas7bdat.
• Build a decision tree model to classify the buyers of dining wares (either Kitchen products, Dishes purchases or Flatware purchases) without using oversampling method.
• Compare your model derived in (a) above to that using logistic model. Which is a better model in terms of classification accuracy?
Neural Network Models
• Draw the nodes and node connections for a fully connected feed-forward network that accepts three input values, has one hidden layer of five nodes and an output layer containing two nodes.
• What are the roles of an activation function in a neural network?
• Describe briefly how to use validation data to prevent a neural network model from overfitting the training data.
• Consider the feed-forward network with three input values, one hidden layer of two nodes and an output layer containing one node. Apply the input instance [0.5, 0.2, 1.0] to the feed-forward neural network. Assume all weight values and all biases are equal to 0.2. Specifically,
• Compute the input to the two nodes in the hidden layer, using linear combination function for each node.
• Use a logistic activation function to compute the output of the two hidden nodes.
• Use the output values computed in part b to determine the input and output values for the output node. Use linear combination function and logistic activation function for your computations.
• Assuming that 10,000 cases are available, describe how you would use a feed forward neural network to predict the class (Target level: Yes; Non-target level: No) in the data with the following inputs: age (18-65), income (0-100,000), sex (male, female), credit grade (1, 2, 3, 4, 5; 1 the lowest and 5 the highest). Your answer should address at least the following issues:
• Intervals inputs to the network and their representations.
• Categorical inputs to the network and their representations.
• Outputs from the network and their representations.
• Preparation of training, validation and test data files.
• Number of hidden units.
• Training of the network, including stopping criteria.
• Estimation of the accuracy of the predictions.
• Refer to the data Custdet1.Sas7bdat.
• Build a neural network model to predict the buyers of dining wares (either Kitchen products, Dishes purchases or Flatware purchases) without using oversampling method.
• Compare your model derived in (a) above to that using logistic model and decision tree model. Which is a better model in terms of classification accuracy?
Ensemble Models
• Refer to the data Custdet1.Sas7bdat. Combine the developed logistic model, decision tree model, and neural network model to predict the buyers of dining wares. Check whether this combined model outperforms any one of the individual models.
• Refer to the data Custdet1.sas7bdat. Build a logistic regression model with bagging to predict the buyers of dining wares (either Kitchen products, Dishes purchases or Flatware purchases) without using oversampling method.
• Refer to the data Custdet1.sas7bdat. Build a random forest model to predict the buyers of dining wares (either Kitchen products, Dishes purchases or Flatware purchases) without using oversampling method.