CS计算机代考程序代写 data mining decision tree CLASSIFICATION

CLASSIFICATION – 1 REGRESSION

1
Chapter 4:
Introduction to Classifications

2
Contents
Introduction to DM classification task.
Preparing data set for classification task.
Assessing the performance of classification models.
Using SAS EM nodes to prepare data for classification.

3
Introduction
What is classification?
Examine the attributes (input variables) of a newly presented object and assign it to one of a predefined set of classes.
Characterized by a well-defined definition of the classes (target variable) and a training data set consisting of pre-classified examples.
The approach is often called supervised learning.

Introduction
What is the difference between classification and prediction?
Some DM practitioners view that predicting class labels is classification, and predicting continuous values is prediction; Others also use “estimation” for predicting continuous values.
Examples of classification:
Classify customers on a contact list into likely to respond group and unlikely to respond group.
Classify an insurance claim into fraudulent claim group or non-fraudulent claim group.
Classify loan applicants as low, medium, or high risk.

5
Preparing Data
Modelling time dependent data
Many business data, such as monthly sales, used for classification are time dependent data.
A developed classification model is often used to classify the future status of a given object.
Need to consider the followings:
Is target variable readily defined?
Is time frame correctly defined?
How to name the time dependent variables in a model?
How to prevent a model from overtraining on past data?

6
Preparing Data
Modelling time dependent data
Defining target variable
Some target variables are clearly defined in data set.
Examples
Customers who responded to a mail campaign or not.
Customers who renewed a insurance policy or not.
Customers who repaid a loan or not.
Some target variables are not readily defined
Examples
Customers who used their credit cards in three consecutive months or not.
Customers who spent lesser than a predetermined amount in a specific period of time.

7
Preparing Data
Modelling time dependent data
Defining target variable
Values all input variables of a predictive model must be observed at least 1 time unit before the respective value of the target variable is observed.

Months of historical data
Target variable

Input variables


Input variables



8
Preparing Data
Modelling time dependent data
Defining target variable
E.g. Suppose we would like to identify the customers who will not use their credit card in the future three months.
Step one: Divide past time into a recent 3-month zone and earlier months zone.

future 3 months
recent 3 months

earlier months
present
Historical data

9
Preparing Data
Modelling time dependent data
Defining target variable
E.g.
Step Two: Define the value of target variable

(6 months)

future 3 months
recent 3 months

earlier months
For training:

Target
(6 months)
Input
For application:

Prediction
(6 months)

10
Preparing Data
Modelling time dependent data
Defining time frame
E.g. Suppose we want to identify customers who are likely to make purchase in next month. Assume current month is February, and past data from last March through January are available.
One model training approach would be predicting who had made a purchase in January (Target), and using all of the data from last March to December as input.
Mar
Apr
…
Dec
Jan
Inputs
Model
Target

11
Preparing Data
Modelling time dependent data
Defining time frame
E.g. (Continued)
To use the developed model to predict the purchases in March, data from May through February would be used as inputs.

Is this approach workable?
Will the February data be available before March? Very unlikely!
The developed model is useless
Mar
Apr
…
Dec
Jan
Feb
May
Mar
Inputs
Model
Target

12
Preparing Data
Modelling time dependent data
Defining time frame
Time can be divided into three periods:
Past – consists of what has already happened and data that has already collected.
Present – the time period when the model is being built and data about the present is usually not available.
Future – the time period for prediction.
When making a prediction, a model uses data from the past for making predictions about the future.

13
Past

Present

Future

Past

Latency

Output

Preparing Data
Modelling time dependent data
Defining time frame
When the model is in use:

Data ends here.
Prediction starts here.

For model development, the past data are divided into three components.
Latency must be allowed for the time to get data, training model, deploying the output etc.
Length of latency period = Length of present period.
Length of output period = Length of future period.

14
Preparing Data
Modelling time dependent data
Naming time dependent variables
One commonly make mistake in modelling time dependent variable is naming the variable according to the calendar month (or day or quarter etc) name.
For example
When using January to June monthly sales data as input for model training, one may inappropriately name each month’s sales data as Sales_Jan, Sales_Feb, Sales_Mar, …, Sales_Jun.
Sales_Jan, Sales_Feb, …, Sales_Jun
Model
Output in August

15
Preparing Data
Modelling time dependent data
Naming time dependent variables
For example (Cont’d)
When the developed model is being deployed in August, we have to use the model with sales data from the months February to July as inputs to predict the outcome in October.
Sales data in February have to be inconveniently assigned with name Sales_Jan, March for Sales_Feb, and so forth.

And then again in September, and so forth.

Sales_Jan, Sales_Feb, …, Sales_Jun
Sales in Feb, Sales in March, …, Sales in Jul
Model trained in July:
Model deployed in August:

16
Preparing Data
Modelling time dependent data
Naming time dependent variables
Time dependent variables should be named in a way that they are independent from the corresponding calendar name.
A better approach is to append 01 to the variable, such as Sales, for the most recent input month, 02 for second most recent month before, etc.

17
Preparing Data
Modelling time dependent data
Preventing model from overtraining
Training data set often consist of records of individual items covering the same period of time.

Using similar history for each record has the drawback that the model may learn a feature that ties to a time unit in the past. This pattern will not happen at the same input unit when the model is deployed in the future.

18
Preparing Data
Modelling time dependent data
Preventing model from overtraining
Use individual records more than once in training data set.

19
Preparing Data
Creating a balanced sample
Model trained on larger business data set tends to do a better job of classification because the business data set has more examples from which to learn.
However, the number of records for each target level is also important.
It is possible that a trained model based on a very large data set is totally useless.

20
Preparing Data
Creating a balanced sample
Suppose models based on the following two data sets both give 99% overall model accuracy on classifying credit card transaction fraud. But which of the models is more likely to be useful in identifying fraud?
Model based on data set 1: 1 million records of which 1 percent was fraud.
Model based on data set 2: 50,000 records of which 20 percent was fraud.

21
Preparing Data
Creating a balanced sample
The target event level is often underrepresented in the data set.
E.g. loan default, customer defection, credit card fraud.
If there are not enough records of target event level in the data, most classification models may not be able to identify the features relating to the target event level.

22
Preparing Data
Creating a balanced sample
Classification models do best in distinguish between target levels when these levels have roughly the same number of records.
For binary levels, a 50/50 split would be ideal but 20 (event level) /80 split is tolerable.
Actual number of records of each target level is also important.

23
Preparing Data
Creating a balanced sample
Undersampling
A process of creating a sample of data set with balanced (or close to balance) levels by taking equal numbers of records from different levels.
Often includes all records from the rare level, but includes less records from the more frequent level.

Full Data set
Undersampled data set

24
Preparing Data
Creating a balanced sample
Consequences of undersampling
The reported statistics (probabilities, accuracy, lift etc) are not reflecting what is actually happening.
Statistics have to be rescaled to reflect the actual performance of the model.
In most cases this cannot be done accurately because of model complexity.

25
Preparing Data
Partitioning the model data set
The normal way
Training set– to build models (50%-80%).
Validation set – to fine tune or compare the models. (10%-25%)
Test set – to assess the effectiveness of the developed model (10%-25%).
If available, the test data set should come from a time period different to the training and validation sets.

Model data set

Data partitioning should be performed before the records in the test set to be used for any purposes (such as missing value imputation).

Preparing Data
Partitioning the model data set
If the model data set is not large enough for partitioning into the three subsets, cross-validation method can be used.
Partition the whole data set into m subsets (m-fold) of equal size. Each of the subsets is used for validation and the remainder is used for training.
Thus the training procedure is executed a total of m times on different training sets.
The m model accuracy measures (for training and validation respectively) are averaged to yield an overall error estimates.
26

Preparing Data
Partitioning the model data set
Variations of m-fold cross-validation
Leave-one-out cross-validation: a n-fold cross-validation, where n is the number of records in the model data set.
Advantages: No random sampling is involved. The same result will be obtained each time.
Bootstrap: The model data set is sampled n times, with replacement, to give another data set of n records. This sample will be used as the training set. Records in the original model data set that have not been picked will be used for validation.
It is expected the sampled training set will contain 63.2% of the original records.
An error measure = 0.632 * error_training+0.368 * error_test
The whole bootstrap procedure is repeated several times, and the results averaged.
27

28
Preparing Data
Partitioning the model data set
With undersampling
The undersampled data set is not a representative sample of the model data set.
Evaluation statistics computed from an undersampled test data set could be misleading. They do not reflect the actual performance of the model in deployment.
Model data set
Undersampled data set
Training
Validation
Test

29
Preparing Data
Partitioning the model data set
With two-stage undersampling
The test set is a representative sample of the model data set.
Evaluation statistics computed from the test set is correct.

Model data set
Undersampled data set
Training
Validation
Test

Random sampled data sets

30
Assessing Models
Each classification tool (logistic, decision tree, or neural network) can generate many models under different settings of the tool.
For each classification tool, we would like to identify the best performed model.
The best model from each classification tool will then compare to each other so that the best out of the best could be identified.
If validation data set is available, we should use the validation data set for model assessment purpose.
Once the best model has been determined, the validation data set can be added back to the training data set. A final model is built by re-estimating the best model using the combined data set.

31
Assessing Models
A model may be developed for:
Classification purpose
Each item is individually evaluated and classified into one of the target levels.
E.g. High/Low risk, Fraud/ Not Fraud etc.
Assessment is focused on the accuracy of the model in identifying an item that belongs to the target event level.
Selection purpose
Each item is individually evaluated but focuses on the overall outcome of a selected group of items.
E.g. Select 10,000 customers from 100,000 customers for a marketing activity.
Assessment is focused on the number (or proportion) of items that belongs to the target event level are contained in the selected group.

32
Assessing Models
For classification purpose
Most classification tools return a score for each item to indicate how likely the item falls into the target event level.
Scores are often scaled so that:
score at each level varies between 0 and 1, and
sum of the scores over all levels equals to 1.
An item is classified to the target event level if the score of the item is higher than the cutoff score.
For binary target, the default cutoff score is 0.5.
In some applications, the cutoff score can be set to higher or lower than 0.5. The higher the cut-off, the less items will be classified into the target event level, and vice versa.

33
Assessing Models
For classification purpose
The classification results of a model can be summarized in a confusion matrix.
Confusion matrix is a two-way table of actual target levels versus classified target levels.
E.g. Suppose there are 1000 records in a training (or validation, or test) set, of which 300 records actually belong to Class A (event level, say). A model classifies 550 records to Class A and 450 to Class B at cutoff 0.5, say.

34
Assessing Models
For classification purpose
Performance of models is often compared with each other and to a naive model as well.
Naive model: A model classifies items at random. Every item has equal chance (=number of items belongs to the target event level / total number of items to be classified) of being classified into the target event level.
E.g. Confusion matrix of the naïve model for classifying the same number of items into Class A:

Cell frequency (165) = column total of the cell (550)*row total of the cell (300) / grand total (1000)

35
Assessing Models
For classification purpose
Accuracy
Measures the proportion of items that are correctly classified:

A useful measure when each level of the target is equally important. The higher the model accuracy, the better the model. However, it can be misleading if the target event level is rare.

Naive

36
Assessing Models
For classification purpose
Captured response rate
Measure the proportion of items that belong to the target event level is correctly classified at a specified cutoff value:

Lift = 0.83 / 0.55 = 1.51.
The considered model at 0.5 cutoff performs 51% better than the naive model.
The maximum lift value equals to 1 / proportion of target event level in the data set.

37
Assessing Models
For classification purpose
Captured response rate
Also known as sensitivity, true positive rate, or recall.
If other measurements are the same, a model with higher captured response rate is preferred.
Captured response rate of a model goes up as cutoff value goes down.
Model with extremely high captured response rate should be scrutinized carefully.

38
Assessing Models
For classification purpose
Specificity
Measures the rate of correctly classify the items of the non-event level. Defined as correctly classified no. of items in non-event level / total actual no. of items in non-event level.
Also called as true negative rate.
If other measurements are the same, a model with higher specificity is preferred.
Specificity of a proper DM model goes up as cutoff value increases.

Assessing Models
For classification purpose
Other rates
Misclassification Rate = (FP+FN)/(TP+FN+FP+FN): measures how often the model is wrong. Also known as Error Rate. Equivalent to 1 – Accuracy.
False Positive Rate = FP/(FP+TN): measures how often the model classifies true non-event as event. Equivalent to 1 – Specificity.
Precision = TP/(TP+FP): measures how often the model is correct when it classifies an item as event. Also known as Response Rate.

40
Assessing Models
For classification purpose
Receiver Operating Characteristic (ROC) Charts plot Sensitivity against 1-Specificity (False Positive Rate) for different cutoff values.

Sensitivity
1-Specificity
cutoff
higher
lower

Almost perfect model
For Naive model: Sensitivity + Specificity always equal to 1.

Assessing Models
For classification purpose
Area under the ROC (AUC)
Often used as a measure of quality of classification models.
Naïve classifier has an AUC of 0.5.
A perfect classification model has an AUC of 1.
In practice, most classification models have an AUC between 0.5 and 1. The higher AUC, the better the model fits the data set.
An AUC of 0.8, for example, means that a randomly selected case from the group of event has a score larger than that for a randomly chosen case from the group of non-event in 80% of the time.

Assessing Models
For classification purpose
Example: Consider the following scores:

Rank the scores from the lowest (1)
to the highest (20).
Sum of ranks (R) for Event =1 : 2+4+7+8+10+12+15+16+17+19+20=130
Mann-Whitney Statistic (or Wilcoxon Rank Sum)
U = R – n1(n1+1) / 2 = 130 – 11(11+1)/2 = 64
AUC = U/(n1n0)= 64 / (11)(9) = 0.6465
If an object (Y1) is randomly selected from the group of Event = 1 and an object (Y0) is randomly selected from the group of Event = 0, then P(score of Y1 > score of Y0) = 0.6465.

Assessing Models
For classification purpose
Selecting cutoff value
For most applications, we want to select a cutoff value for a model so that the sensitive (true positive rate) is high and the 1- specificity (false positive rate) is low.

43
Consider choosing a cutoff value such that the difference between sensitive and 1 – specificity is maximized.

Assessing Models
For classification purpose
Selecting cutoff value
If it is important for both sensitive (true positive rate) and specificity (true negative rate) to be as high as possible, choose a cutoff value such that the absolute difference between sensitive and specificity is minimized.
44

45
Assessing Models
For selection purpose
Modelling effort is focused on the set of items with top X% scores, not necessary interested in the overall model accuracy.
The score of each item in this case is not as important, though a higher score will make one feels more confident in the result of selection.
No need to decide a cutoff value directly.
E.g. Suppose you want to select 10,000 customers from a customer base of 100,000 for a marketing activity, the modelling effort should be focused on the top 10% scores, not on the overall model accuracy.

46
Assessing Models
For selection purpose
Cumulative Gains Chart
It shows the ‘gains’ in targeting a given percentage of the total number of customers using the highest scores.
Also called Cumulative % of Captured Responses Chart.

Gain
SAS EM defines the Gains Chart in a different way: (Lift -1) *100%
The values on the y-axis corresponding to the captured response rate among the individuals in cumulative deciles.

Assessing Models
For selection purpose
Cumulative Gains Chart
The 25% gain point of data mining model means that if you score a dataset with the model and sort all of the records by the score, you would expect the top 10% to contain approximately 25% of all of the actual number of the target event.
The diagonal line is the baseline curve from the naïve model. If you select 10% of records from the scored dataset at random, you would expect to get approximately 10% of all of the actual target event.
The further above the baseline a curve lies, the greater the gain.
Curves of multiple models can be plotted together for comparison purpose.
47

Assessing Models
For selection purpose
Cumulative Lift chart
It shows the lift achieved by targeting increasing percentages of the items, ordered by decreasing score.
48

49
Assessing Models
For selection purpose
Steps to construct Gains and Lift charts:
Sort the score of each record in the dataset from the highest to the lowest.
Group the records by the score into 10 groups of equal size (decile).
Decile is most commonly used, but others, such as 20 equal size groups, can also be utilized as needed.
Compute the cumulative gain (cumulative captured response rate, and cumulative lift value) for each decile.
Plot each value against the decile.

50
Assessing Models
For selection purpose
Steps to construct gains and lift charts:
Example: Suppose a data set of 5,000 records are scored by a DM model and 1,000 of the items actually belong to the target level.

Cumulative

51
Assessing Models
For selection purpose
Response rate
Measures the proportion of items classified as target event are really target event. Also known as precision rate.
In general (when the number of non-event is higher than the number of event), response rate of a classification model decreases as more items are selected.
A good classification model should have a higher response rate than that of a naïve model.
Model with extremely high response rate for large proportion of selected items should be scrutinized carefully.
The model with higher response rate for a specified number of items is preferred when two DM models are compared.

52
Assessing Models
For selection purpose
E.g. A data set of 5,000 records are scored by a DM model:
Assume 1000 of the records actually belong to the target event level (say A).
Suppose out of the highest 10% scores, 250 records actually belong to the target level.

53
Assessing Models
For selection purpose
E.g. (Continued)

Lift values computed from response rate and captured response rate are identical for the same confusion matrix.

Assessing Models
For selection purpose
Cumulative % of Responses Chart
It shows the cumulative percentage of items identified as the target event from the increasing percentages of items are really target event, ordered by decreasing score.
54

55
Assessing Models
Adjustment for undersampling
It may not be possible to use a non-undersampled test data set to evaluate the actual performance of a model.
Scores and accuracy measures need to be adjusted to restore the class of items that were underrepresented (often the non-interested target level) or overrepresented (often the interested target level) in the sampling process.
If the classification scores are not to be used for other purposes, such as to calculate the expected profit, it is not necessary to adjusted them.

Assessing Models
Adjustment for undersampling
Confusion matrix adjustment
E.g. Suppose the proportion of target level (1) in the entire population is 2% and the data set were undersampled. Suppose all target level objects are included in the undersampled data set and the confusion matrix looks like this:

57
Assessing Models
Adjustment for undersampling
Confusion matrix adjustment
E.g. (Cont’d)
The adjusted total number of records (X) in the training set is
1000 / X = 0.02  X = 50000.
The adjusted total number of non-responders (0) in the training set is
50000  0.98 = 49000.
The adjusted confusion matrix for the data set is:

Assessing Models
Adjustment for undersampling
Confusion matrix adjustment
E.g. (Cont’d)
For the considered DM model:
Adjusted captured response rate (= 850 / 1000) = Unadjusted captured response rate
Adjusted specificity = Unadjusted specificity
Adjust response rate (= 850 / 5750) ≠ Unadjusted response rate (= 850 / 1250)
Adjusted accuracy (=(850 + 44100) / 50000) ≠ Unadjusted accuracy (= (850 + 3600) / 5000)
Adjusted lift ≠ Unadjusted lift
For the confusion matrix of a naïve model, all its cell values will be adjusted too. The adjusted number of responses will be smaller and hence the adjusted lift value will be larger.

Assessing Models
Adjustment for undersampling
Score adjustment
It is almost impossible to recover the true scores of dataset that has undergone undersampling.
The following score adjustment can be used for all classification models when the number of target events in the population and the number of target events in the undersampled dataset are identical:
Denote p1 and p0 be the respective classification score for target event level (1) and non-target event level (0) of a record in an undersampled data set.
The adjusted score for p1 is p1 / (p1+ p0 * (np /ns)) where np and ns are the number of records that belongs to the non-target event level 0 in the population and the undersampled dataset respectively.
59

Assessing Models
Adjustment for undersampling
Classification scores adjustment
E.g. (Cont’d) If a record in an undersampled dataset receives a score of 0.8 for the target level (1) and a score of 0.2 for the non-target level (0), the adjusted scores are:
For target level (1) : 0.8 / (0.8 + 0.2 *12.25) = 0.2462
For non-target level (0): 0.2 * 12.25 / (0.8 + 0.2 *12.25) =0.7538
For classification purpose, cut-off value needs to be set to a lower value if the scores are adjusted.
For selection purpose, score adjustment is not needed because the the adjustment does not affect the ranking of the scores.

61
Assessing Models
Classification By Profit
If profit information is available, an item can be classified into a level that generates the highest predicted profit for classification purpose.
The performance of a model can also be assessed base on the total predicted profit generated by the model.
A model with higher total predicted profit is preferred to a model with a lower total expected profit.
For an undersampled data set, adjust the score first, then compute the predicted profit.

62
Assessing Models
Classification By Profit
E.g. Suppose the following profit matrix is available for all items:

Suppose an item receives a score of 0.4 for level 1 and 0.6 for level 0.
Predicted profit if the item is classified to level 1:
0.4×(7) + 0.6×(- 4) = 0.4
Predicted profit if the item is classified to level 0:
0.4×(- 2) + 0.6×(0) = -0.8
This item will be classified into level 1.

Assessing Models
Classification By Profit
In general, if the following profit matrix is available for all items:

The decision is to find a cutoff for target event level (1) such that

Or equivalently,

Assessing Models
Classification By Profit
Profit matrix may be different for each item. Consider the following profit matrix for item j:

where C is the cost and Rj is the revenue that will be generated by object j.
If is the score for the object j belongs to level 1, then the object j will be classified into level 1 if

Or equivalently,
64
Classified
1 0
Actual 1 Rj – C 0
0 -C 0

Assessing Models
Classification By Profit
Value of Rj is often only known for the records belong to the target event level in the training data.
For a new item, value of Rj needs to be estimated.
Steps to estimate Rj using training data set:
Obtain Pj from the developed model.
Estimate Rj using only the training data that belong to the target event level.
Heckman (Nobel prize winner in Economics, 2000) suggests that Pj should be included as one of the independent variables in order to counteract the sample selection bias.
65

66
Assessing Models
Selecting the best model
Models must be compared based on the same training set (or validation set if available).
Confusion matrix, captured response rate, and lift are compared as appropriate.
Model A
Model B
Model C

Validation set

Training
set

67
Assessing Models
Selecting the best model
Combining results from different models
For a categorical target variable, take a majority vote of the results or average the scores.
Most useful when the considered models are performed differently.
Model A
Model B
Model C

Ensemble Model

Validation
set

Training
set

68
Examples
Example 4.1
Simulated data CLUSTERSIM2.SAS7BDAT
Open the EM project Project2 that was created in Chapter 2.
Import SAS data Clustersim2 into the project.
Create a diagram and named it as ClusterSim2. Drag the data set Clustersim2 into the diagram.
For demonstration purpose, levels B and C of variable Class are to be combined and set as the target level of interest. The other two levels are also to be combined.
Drag a Replacement node into the diagram and connect the Input Data node to it.
Replace the B and C levels of Class by Yes and its A and D levels by No.

Examples
Example 4.1
To change the Role of Rep_Class from Input to Target:
Drag a Metadata node into the diagram and connect the Replacement node to it.
Set the following Variables properties of the Metadata node:
Set the New Role of Rep_class to Target.
Set the New Level of Rep_class to Binary.
Set the New Order of Rep_class to Descending.

Examples
Example 4.1
To explore the distributions of Rep_class and its relationship with the other variables:
Connect the Metadata node to a StatExplore node.
Set Chi-Square Statistics | Interval Variables to Yes so that the chi-square statistics between the target variable and each input interval variable (in binned form) can be explored.
The Chi-Square statistics indicate that the three interval variables are strongly associated with Rep_Class.
Also connect the Metadata node to MultiPlot node.
The Yes level of Rep_Class is associated with the higher values of col1, it is also associated with the lower and higher values of col2, and mid-values of col3.
70

71
Examples
Example 4.1
Partition the data set.
Connect Metadata node to a Data Partition node.
For the Data Partition node:
Set Partition Role of RepClass as Stratification.
Select Partition Method | Stratified
If the ratio of the target levels is even, simple random can be used as well.
Set Data Set Percentages as follows:
Training – 50
Validation – 25
Test – 25
Run the Data Partition node and view the results to confirm that the partition is properly done.

Examples
Example 4.2: LoanDefault_Sample contains 37,580 loan data during Jun 2007 – Dec 2011, including the loan status (Fully paid, or Charge Off).
Import the data set into Project2.
In the Variables window, set the following Level or Role:
Earliest_Cr_Line : Input Role
Grade, Sub_Grade : Ordinal Level
Loan_Status: Target Role, Binary Level, Ascending Order
Int_Rate, Issue_Date, Revol_Util, Policy_Code, and Pub_Rec_Bankruptcies: Rejected Role

Examples
Example 4.2:
Create a new diagram, named as LoanDefault.
Drag the data set LoanDefault_Sample into the diagram. Run the Input Data node.
Connect the Input Data node to StatExplore node. Set Interval Variables of Chi-Square Statistics property to Yes.
Connect the Input Data node to a Transform Variables nodes. Define the following new variables:
Trn_Int_Rate=input(compress(int_rate,’%’),8.)
Trn_Revol_Util=input(compress(revol_util,’%’),8.)
Trn_Revol_Until contains missing values.

Examples
Example 4.2:
Connect the Transform Variables node to a Data Partition node.
Set the Role of Loan_Status to Stratification, the Partitioning Method property to Stratified.
Set the Training and Testing properties to 80 and 20 respectively.
The test set will be used later for checking the generalization power of a selected classification model.
Run the node.
74

Examples
Example 4.2:
Connect the Data Partition node to a Impute node.
Set Missing cutoff to 99.
Set all Default Input Methods to None.
Set Method for mths_since_last_delinq and Trn_Revol_Until to Tree.
Set the following for Indicator Variables: Type to Unique; Source to Imputed Variables; Role to Input.
Indicator variable will be created for each imputed variable with 1=imputed, and 0 = not imputed for each record.
Run the node.
75

Examples
Example 4.2
Connect the Impute node to a Variable Selection node. Set the following properties of the node:
Set Target Model to Chi-Square.
All variables passed the Chi-Square test will be selected.
Set Hide Rejected Variables to No.
Run the node.
Connect the Variable Selection node to another Variable Selection node.
76

Examples
Example 4.2
Set the following properties of the second Variable Selection node:
Set Use to No for all variables with Input Role.
Set Use to Yes for all variables with Rejected Role.
Set Target Model to R-Square.
All variables passed the R-Square test will be selected.
Set Rejects Unused Variables to No.
Set both Use AOV16 Variables and Use Group Variables properties to Yes.
Set both Hides Rejected Variables and Hides Unused Variables to No.
Run the node.
77

Examples
Example 4.2
Connect the second Variable Selection node to another Data Partition node for a training data set and a validation data set.
Set the Role of Loan_Status to Stratification, the Partitioning Method property to Stratified.
Set the Training and Validation properties to 70 and 30 respectively. Run the node.
Connect the above Data Partition node to a SAS Code node. Then connect the last Variable Selection node to the same SAS Code node as well.
Name the SAS Code node as SASCode1.
The order of connection is important for otherwise the correct training set will not be passed to SASCode1.
Run the node.
78

Examples
Example 4.2
Connect the second Variable Selection node to a Sample node to perform an undersampling process for a balanced sample. Set the following properties of the Sample node:
Set the Sample Role of variable Loan_Status as Stratification, and the Sample Method property as Stratify.

Set the Criterion property as Level Based.
Set Level Selection as Event, Level Proportion as 100, and Sample Proportion as 50.
The Sample node will only import the training data set and export the sample data set. The test data set will not be exported from the node.
Run the node.

Examples
Example 4.2
Connect the Sample node to a new SAS Code node.
Name this SAS Code node as SASCode2.
Then connect the last Variable Selection node to SASCode2 node so that the earlier derived test data set can be passed through SASCode2.
Run the node.
Save the project for later uses.

181716151413121110987654321

Earlier months Most recent 3 months Target variable
No credit card Excluded
With credit card but no tranasction Excluded
With credit card With transactions No (0)
With credit card No transaction Yes (1)

Target
Month
Months: Jan Feb Mar Apr May Jun Aug
Sales
variable: Sales_06 Sales_05 Sales_04 Sales_03 Sales_02 Sales_01

Deployment: Feb Mar Apr May Jun Jul Sep
Mar Apr May Jun Jul Aug Oct

Input
Months

Target Month
Months:JanFebMarAprMayJun Aug
Sales variable:Sales_06Sales_05Sales_04Sales_03Sales_02Sales_01
Deployment:FebMarAprMayJunJul Sep
MarAprMayJunJulAug Oct
Input Months

Object
1121110987654321
2121110987654321
3121110987654321
4121110987654321
5121110987654321
6121110987654321
7121110987654321
8121110987654321
9121110987654321
10121110987654321
Time Unit

Object 1121110987654321
87654321
98765432
109876543
1110987654
12111098765
Object 2121110987654321
87654321
98765432
109876543
1110987654
12111098765
Time Unit

Object 187654321
Object 198765432
Object 1109876543
Object 11110987654
Object 112111098765
Object 287654321
Object 298765432
Object 2109876543
Object 21110987654
Object 212111098765

Class AClass BTotal
Class A250 (25%)50 (5%)300
Class B300 (30%)400 (40%)700
Total5504501000
Classified
Actual

Na
ï
ve Model
Data Mining Model
165 / 300 = 0.55
250 / 300 = 0.83
Captured Response Rate

Cumulative captured response
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10%20%30%40%50%60%70%80%90%100%
Decile
%
Data mining modelNaÏve model

Cumulative Lift
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10%20%30%40%50%60%70%80%90%100%
Decile
Lift
Data mining modelNaÏve model

TopRemaining
10%90%Total
A2507501000
B25037504000
50045005000
Actual
DM Model

TopRemaining
10%90%Total
A1009001000
B40036004000
50045005000
Naive Model
Actual

DM ModelNaive model
Response rate250/500=0.5100/500=0.2
Lift value0.5/0.2=2.5

10Total
18501501000
040036004000
Total125037505000
Classified
Actual

10Total
18501501000
049004410049000
Total57504425050000
Classified
Actual

1 0
1 7 -‐2
0 -‐4 0

Predicted

Actual

10
17-2
0-40
Predicted
Actual

1 0
1 D
tp D
fn
0 D
fp D
tn

Predicted

Actual

10
1D
tp
D
fn
0D
fp
D
tn
Predicted
Actual

/docProps/thumbnail.jpeg

Related Posts