Chapter 5: Classification Models
Chapter 5: Classification Models
Contents
Logistic regression models and SAS Regression node.
Decision tree models and SAS Decision Tree node.
Neural network models and SAS Neural Network node.
Ensemble models and SAS Ensemble node.
Other SAS Utility and Assess nodes.
2
Logistic Regression Models
What is a logistic regression model?
Let Y denotes a binary target variable with value equals to 0 or 1.
Assume Y depends on the value of an input variable X (or a vector of Xs), such that
P(Y=0|X) and P(Y=1|X) are unknown but P(Y=0|X) + P(Y=1|X) =1.
E(Y|X) = 1 × P(Y=1|X) + 0 × P(Y=0|X) = P(Y=1|X)
E(Y|X) is bounded between 0 and 1 for all X values.
Var(Y|X) = P(Y=1|X) (1- P(Y=1|X))
Var(Y|X) is non-constant, and is depending on X.
Consider the standard linear regression model approach:
E(Y|X) = b0 + b1 X
The major problem of this approach is that the predicted E(Y|X) may be bigger than 1 or smaller than 0.
3
Logistic Regression Models
What is a logistic regression model?
Suppose X and E(Y|X) can be represented by a non-linear equation:
The right hand side of the equation is called the logistic response function.
Other form of response function can also be used. For example, cumulative distribution function of the standard normal distribution (Probit regression model).
E(Y|X) = P(Y=1|X)
4
Logistic Regression Models
What is a logistic regression model?
The log(odds) has a linear relationship:
The log(odds) is called the logit.
Let b0 and b1 be the estimates of b0 and b1 respectively, the fitted logistic response function is:
5
Logistic Regression Models
What is a logistic regression model?
Interpretation of b1.
The fitted logit at X = xj is:
Let’s call this as .
The fitted logit at X= xj + z , where z is a constant, is:
Let’s call this as .
6
Logistic Regression Models
What is a logistic regression model?
Interpretation of b1
zb1 is the difference between the two fitted log-odds:
In other words, the fitted odds ratio equals exp(zb1).
E.g. If z = 1 and b1= 0.2, then the odds ratio equals to exp(0.2) = 1.22, that is the odds increases by 22% for 1 unit increases in X. If an odds ratio equals to exp(-0.2) = 0.82, then the odds decreases by 18% for 1 unit increases in X.
The amount of changes (increases or decreases) is non-linear. It increases faster for larger b1. It approaches 0 slowly for very negative value of b1.
7
Logistic Regression Models
What is a logistic regression model?
Example
Whether a respondent would accept a news subscription service:
Target variable: Accept (No: Y = 0 ; Yes: Y = 1).
Input variable: Age (X).
Fitted logistic response function:
p(Y =1 | X) = exp(-2.1961+0.0545 X) / (1+ exp(-2.1961+0.0545 X) )
8
Fitted logistic regression model:
Historical Data:
Logistic Regression Models
General logistic response function with k input variables:
Let b0, b1, …, bk be the estimates of the parameters, fitted logistic response function is:
The fitted odds increases/decreases by exp(zbi) for z units changes in Xi while all other Xs are held constant.
9
Building a Logistic Regression Model (Self-Study)
Input variables could be interval or categorical
Use dummy variables to represent the levels of a categorical variable.
A dummy variable represents the presence or absence of an variable.
It only has two values: 0 and 1.
A categorical (nominal or ordinal) variable with m levels require m – 1 dummy variables.
Example: A risk variable with 3 levels (A, B, C) needs 2 dummy variables (X1= 0, X2 = 1), (X1= 1, X2 = 0), and (X1= 0, X2 = 0).
10
Building a Logistic Regression Model (Self-Study)
Input variables in other forms
Higher order terms
If the independent variables may not be linearly related to log-odds, higher order terms of independent variables could be included.
E.g.:
Interaction terms
The amount of change in the log odds may depend on the change of X1 and X2 simultaneously.
E.g.:
Prior knowledge is important in selecting interaction terms.
11
Building a Logistic Regression Model
Stepwise model building
Split the model data set into training, validation and test.
Only use the training and validation sets for model building.
Training set – for parameters estimation.
Validation set – for selecting the best model according to a pre-specified criterion (such as classification accuracy maximization, or profit maximization).
12
Building a Logistic Regression Model
Stepwise model building
Assume there are k input variables.
At step j:
Fits all possible regression models with the selected variables from previous steps and one of the remaining variables.
Select the most statistically significant model.
If none of the models is significant, terminate the process
Check if any one of the previously selected variables could be dropped.
Repeat until no more variable could be added or dropped.
13
Building a Logistic Regression Model
Stepwise model building
Suppose k’ models have been considered.
The model at the last step does not necessary to be the best model.
Select the best model according to selection criterion using the validation data set.
Training
data set
Model 1
Model k’
Validation data
set
14
Building a Logistic Regression Model
Advantages of logistic regression models
Well established statistical methods.
If the model is a correct model, it provides insights on how input variables are related to the target variable.
Using 1 equation (binary classification) for all values of the input variables in computing the probability of an observation belongs to a class.
Disadvantages of logistic regression models
Need to know in prior the type of relationship between input variables and target variable, the inter-relationship among the input variables.
Sensitive to outliers.
Cannot handle observations with missing values in input variables.
Model is more mathematically complicated.
15
Multiple Classification Models (Self-Study)
Single multiple-classification model
Multinomial logistic regression model
The logistic regression model can be extended to classify more than 2 classes.
Breaks the regression model into a series of binary regressions comparing each group to a baseline group.
All parameters in the model are estimated simultaneously.
Advantages:
Only one model is needed.
Estimates are statistically efficient.
Disadvantages:
Each response function must be in the same form. However, a model is good in classifying level A from D may not do a very good job separating level B from D.
16
Multiple Classification Models (Self-Study)
Combining binary classification models
Create several models, each one predicts 1 level only.
Assemble the prediction probabilities from the outputs of these models.
Approach 1: The highest probability indicates the level.
Approach 2: The probabilities may be used as inputs for a single multiple-classification model.
17
SAS Regression Node
Features of the Regression Node
The node supports binary, interval, and ordinal target variables.
Automatically replace categorical input variables with dummy variables.
Does not use observations that contain missing values.
Can generate higher order or interacted variables.
Provide forward, backward, and stepwise model building methods.
Training data set is used to estimated the parameters.
Validation data set is used to selected the best model according to model selection criterion.
Test data set will not be used.
18
SAS Regression Node
Features of the Regression Node
Allow different model selection criterion such as classification error or profit / loss.
Export compute fitted score for each observation.
Provide assessment measures and gain charts for Training and validation data sets in the results.
Variables appeared in the dataset but not included in the model will be assigned with Rejected Role in the exported dataset.
19
SAS Regression Node
Example 5.1: Refer to the diagram ClusterSim2 in Project2
Settings of Regression node:
Connect the Data Partition node to a Regression node. Name this node Regression1.
In Variable window of Regression node, set Use to ‘No’ for variable Class and the others to ‘Yes‘.
Set Model Selection | Selection Model to Stepwise.
Set Model Selection | Selection Criterion to Validation Misclassification
The cutoff value is set to 0.5 for binary target variable. This cutoff value cannot be changed in this node.
For highly unbalanced target levels, one should use Validation Error (negative log-likelihood for logistic regression), or decision matrix (see later discussion).
20
SAS Regression Node
Example 5.1:
Settings of Regression node:
Set Use Selection Defaults to No. Change the Entry Significance Level and Stay Significance Level of Selection Options to 0.1 and 0.15 respectively, and set the Maximum Number of Steps to 50.
Set Print Output Options | Details to Yes for printing the details of each model step.
Set Print Output Options | Design Matrix to Yes for printing the coded input for the categorical variables.
This example has no categorical input variable.
21
SAS Regression Node
Example 5.1
Output window
22
SAS Regression Node
Example 5.1
Output window
Confusion matrix, based on cutoff = 0.5
23
SAS Regression Node
Example 5.1
Results
View Assessment Charts
In Results window, select View | Assessment | Score Ranking matrix.
This includes cumulative captured response chart, cumulative response chart, and cumulative lift chart.
All Best curves are computed using the actual responses. All Baseline curves are computed from naïve model. The closer the Model curve is to the Best curve, the better the model.
24
X% Depth = Top X% scores
24
SAS Regression Node
Example 5.1
Results
View Assessment Charts
In Results window, select View | Assessment | Classification Chart.
Mis-classification percentage (red colour) for each level should be low. The cutoff value used for for this chart equals to 0.5, which may not be appropriate for your classification task.
25
SAS Regression Node
Example 5.1
Results
View Assessment Charts
In Results window, select View | Assessment | Score Distribution.
The scores of events and non events should be well separated.
26
SAS Regression Node
Example 5.1
Results
SAS EM only displays Gains charts in the results of a classification model node. ROC chart is not included.
To determine the area under the ROC for Validation data set, connect Regression1 node to a SAS Code.
Update the node and then click Code Editor option.
27
SAS Regression Node
Example 5.1
Results
Type the following code for the rank sums of the test data set:
proc npar1way data = &em_import_validate wilcoxon;
class rep_class;
var p_rep_classyes;
run;
Replace &em_import_validate by &em_import_data or &em_import_test for training or test data set respectively as needed.
Save and run the node. The AUC for the validation data set is:
U = 1481694 – (1000)(1001) / 2 = 981194
AUC = 981194 / ( (1000) (1000) ) = 0.981194
28
SAS Regression Node
Example 5.1
Can we further improve the results?
At the cutoff 0.5, the model accuracy is 0.9482; The fitted scores of target objects are high and the fitted scores of non-target objects are low; The cumulative response and cumulative response rates are far away from the that of the naïve model. All results indicate the selected logistic model classify the objects in training and validation data well.
Copy the first Regression node and paste it into the diagram so that the settings of the first Regression node is preserved in the second regression node.
Connect the Partition node to the second Regression node. Name this node Regression2.
29
SAS Regression Node
Example 5.1
Can we further improve the results?
Set Equation | Polynomial Terms | Yes to include polynomial terms for interval variables. This includes cross product of interval variables.
(Set Equation | Two factor Interactions | Yes to include all two-factor interactions for categorical variable. We do not need to do this here as the data set contains no categorical input variable.)
Using these settings with care as they can be quite computational intensive.
Run Regression2.
30
SAS Regression Node
Example 5.1
Can we further improve the results?
Even though the estimated coefficient of col3 * col3 is insignificant, by including the term, it lowers the misclassification rate.
31
SAS Regression Node
Example 5.1
Can we further improve the results?
Classification is almost perfect.
32
SAS Regression Node
Example 5.1
Model Comparison node
Assessment outputs are similar to those from model nodes such as Regression, Neural Network, and Decision Tree. It puts the assessment statistics from different models together for comparison.
It selects the best model according to specified criteria automatically, but we will only use this node to compute the assessment statistics for comparing the models manually. We should make our own decision in model selection.
For selection purpose, compare the models under Score Rankings Overlay (Cumulative Lift or Cumulative Response, and Cumulative Captured Response) based on the validation data.
For classification purpose, compare the models under the ROC Chart based on the validation data.
Assessment statistics that are computed on test data should be used only to understand how well a model will generalize when applied to new data. It should not be used for model selection.
33
SAS Regression Node
Example 5.1
Model Comparison node
Connect both Regression1 and Regression2 nodes to a Model Comparison node. Keep all default settings. Run the node.
34
SAS Regression Node
Example 5.1
Cutoff Node
Cutoff node provides statistics to assist user to determine appropriate cutoff value for classification purpose.
The default cutoff value is 0.5 for binary target. The cutoff value can be changed by the user or selected by the system according to a selection criterion.
It classifies observations to the respective level based on the specified cutoff value.
Column EM_CUTOFF contains the classified levels (1 for interested target level) and column Into_ are included in the export data set.
Usually the same node will be running at least twice. At the first run, it obtains the statistics and determine the cutoff value. At the second run, it classifies the observations according to the determined cutoff value.
The node creates many graphs based on various statistics using training set. We will only focus on Positive Rates or True Rates charts.
35
SAS Regression Node
Example 5.1
Cutoff Node
For illustration purpose, connect the Regression1 node to a Cutoff node. Keep the default settings. Run the node.
Positives Rates (computed from Training data set):
In general, select a cutoff value that gives high True Positive Rate and low (or acceptable) False Positive Rate. A cutoff such that the difference between True Positive Rate and False Positive Rate is maximized often gives satisfactory result.
36
SAS Regression Node
Example 5.1
Cutoff Node
True Rates (computed from Training data set):
In general, select a cutoff value that gives high True Positive Rate and high True Negative Rate. Choose a cutoff value such that the absolute difference between the True Positive Rate and True Negative Rate is minimized often gives satisfactory result.
37
SAS Regression Node
Example 5.1
Cutoff Node
Each observation in the Training data set, Validation data set, and Test data set is assigned to one of the target levels according to its computed score and cutoff value.
The Exported data of the Cutoff node:
38
SAS Regression Node (Self-Study)
Example 5.1
Cutoff Node
Rate definition:
True Positive Rate : TP / AP (= Sensitivity, Captured Response Rate)
True Negative Rate: TN / AN (= Specificity)
False Positive Rate: FP / AN (=1 – Specificity)
Event Precision Rate: = TP / CP (= Response Rate)
39
SAS Regression Node (Self-Study)
Example 5.1
If a model has been selected and the cutoff value (for classification purpose) or the top X% of scores (for selection purpose) has been determined, it is then necessary to estimate the performance of the model when it is actually deployed. The performance of the model on new data can be estimated by scoring the Test data set with the model.
Suppose the task is to select the objects in the top 20% of classification scores from Regression1.
Connect Regression1 to a Model Comparison node. Run the node.
In the result of the Model Comparison node, view either Rankings Overlay or Score Rankings Matrix. Then click the Table icon to view the data uses to plot these graphs.
40
SAS Regression Node (Self-Study)
Example 5.1
At top 20% of Table Score
Confusion matrix
41
SAS Regression Node (Self-Study)
Example 5.1
Suppose now the task is to classify an object to the target event level when its computed score from Regression1 is above the cutoff 0.42.
From the Cutoff node that follows Regression1, click any chart, and then click the Table icon to review the data that are used to plot the chart. From the data table, locate the record with Data Role =‘Test’ and Cutoff=0.42.
42
SAS Regression Node
Example 5.2: Refer to the diagram Donor in Project2
A charity wants to develop a donor selection model for selecting potential donors (for selection purpose).
The data donor_raw_data contains:
19372 rows of observations.
Variable Target_B indicates whether the respondent donated in the last campaign (1=Donated (25%), 0 = not donated (75%)).
Variable Target_D indicates the amount donated by the donor in the last campaign (Rejected Input Data node).
The processed dataset has been partitioned into: training (60%), validation (20%), and test (20%);
Variable Selection node suggested these variables are associated with Target_B: Frequency_Status_97NK, G_Rep_Cluster_Code, Opt_Card_Prom_12, Opt_Last_Gift_Amt, Opt_Recent_Avg_Card_Gift_Amt, Opt_Rep_Months_Since_Last_Pr0m_R, Pep_Star
43
SAS Regression Node
Example 5.2
Trial 1:
Connect the second Variable Selection node to a Regression node.
Name the Regression node as Regression1.
In Variables window, set Use of Donation to Yes. (You may leave it as Default).
The Regression Type should have been set to Logistic Regression automatically. If not, change it to Logistic Regression.
Set Input Coding to GLM.
Set Selection Model to Stepwise.
Set Selection Criteria to Validation error.
Set Use Selection Defaults to No. Set Entry Significance level and Stay Significance level to 0.1 and 0.15 respectively. Set Maximum Number of Steps to 50.
Set Details to Yes, and Design Matrix to Yes.
Run the node. (Stepwise selection process can be computational demanding for large number of input variables).
44
SAS Regression Node
Example 5.2
Trial 1:
Results: View | Score Rankings Matrix
The cumulative response and cumulative captured response curves are both far away from the limit, indicates the model does not separate the levels of Donation too well.
45
Cumulative Response %
Cumulative Captured Response %
SAS Regression Node
Example 5.2
Trial 2:
Copy and paste Regression1 node. Name the pasted node as Regression2.
Connect the Transform Variables node to Regression2 node and update it.
Original variables, as well as binned variables will be considered in this Regression node.
Since variable Donor_Age contains many missing values, observations with missing value in Donor_Age will not be considered in training. Set Use of this variable in the node as No.
Set Polynomial Terms and Polynomial Degree in Equation property of Regression2 node to Yes and 2 respectively.
Run the node. (It takes much longer to run than Regression1.)
Results:
Some improvement over Regression1.
46
SAS Regression Node
Example 5.2
Trial 3:
Connect the Transform Variables node to Impute node.
For Impute node:
Set Default Input Method for both Class Variables and Interval Variables to None.
In Variables property, set Use of Donor_Age to Yes, and Method to Tree.
Run the node.
Copy and paste Regression2 node. Name the pasted node as Regression 3 node. Connect the Impute node to Regression3 node and update it. Run the node.
Results: Performance is similar to the other two Regression nodes.
47
SAS Regression Node
Example 5.2
Let’s say the the modelling goal is to select the top 20% of potential donors.
Connect all three regression nodes to a Model Comparison node. Set the following properties of Model Comparison node:
Set Selection Statistics to Cumulative Lift, and Selection Depth to 20.
Set Selection Table to Validation.
Run the node (Running time of this node can be very long).
48
Decision Tree Models
What is a decision tree model?
A structure that divides a set of records into successively smaller sets of records.
With each successive division, the members of resulting sets becoming more and more homogeneous with respect to the target variable.
The target variable can be categorical as well as interval, but we will only focus on binary target.
Present decision rules in plain language.
Allows one to explain the reason for a decision, but it is not supposed to be used for explaining the casual relationship between the target variables and the set of input variables).
49
49
Decision Tree Models
What is a decision tree?
Structure of a tree
Root node
Child node / Decision node
Leaf node
A unique path from the root to each leaf
Age
> 44
44<=
Income
> $30K
$30K <=
No / Yes
No / Yes
No
No
Yes
Root node
Leaf node
Child node
path
50
50
Decision Tree Models
What is a decision tree?
Classification
The path from the root to a leaf forms the rule to classify a record.
E.g.:
IF Customer's Age <= 44
AND Customer's Income > $30K
THEN Customer Responds
All records that are contained in the same leaf node are classified into the same class.
Majority voting is employed for classification. That is for binary classification, the default cutoff equals to 0.5. However, lower or higher cutoff can still be employed.
Different leaves may make the same classifications, but for a different reason.
Hence, the paths (or rules) are different but the conclusions are the same.
51
Decision Tree Models
What is a decision tree?
Scoring
The proportion of objects at target level in a leaf node is used as a score of how likely an object belongs to the target level.
All objects in the same leaf node receive the same score regardless of the value of their respective input variables.
E.g. Suppose the target level of interest is Yes. All objects in Node 1 receive a score of 0.048, ie the probability that an object (Age <= 44 and Income <=$30K) will respond is 0.048.
No 51.4%
Yes 48.6%
No 70.5%
Yes 29.5%
No 90.5%
Yes 9.5%
No 95.2%
Yes 4.8%
No 9.4%
Yes 90.6%
Age
> 44
44<=
Income
> $30K
$30K <=
1
52
Decision Tree Models
Preparing the data
Data cleaning
Most decision tree algorithms have some mechanism for handling missing values. Possible approaches include:
A branch of missing values.
or assign records with missing values to one of the branches according to adopted tree building criterion.
Apply imputation for missing values may be helpful but not essential.
53
Decision Tree Models
Preparing the data
Transformation
Skewness or outliers have no effect on the performance of decision tree.
Decision tree use the rank order, not the actual value.
No need to include higher order polynomial terms.
But including such terms manually may speed up the tree building process.
Include new variables that are derived from existing input variables can help the decision tree to capture hypothesized relationships among the variables, such as interaction terms. Decision tree only split the training data by one variable at each time, hence it is not very good in detecting the joint effect of the input variables on the target.
No need to group levels of a categorical variable in advance
Doing so manually can speed up the training process.
54
Building A Decision Tree Model
General steps of building a decision tree
Step 1:
The tree starts as a single node (root node) containing the training records.
If the records are all of the same class, the partitioning stops.
Step 2:
Otherwise, search for a variable with a split that will best separate the training records.
A splitting criterion is used to choose the variable and split.
The partitioning stops if no variable is selected.
The root node is then partitioned into 2 or more child nodes according to the selected split.
55
Building A Decision Tree Model
General steps of building a decision tree
Step 3:
Repeat for each child node:
If it contains only one class (pure node), or certain pre-specified criteria are satisfied, such as the maximum level of the tree has been reached, it then becomes a leaf node.
Else if a variable can be found to best separate the training samples in the node, it then splits into two more child nodes.
Else if no variable can be found to best separate the training samples in the node, it then becomes a leaf node.
56
Building A Decision Tree Model
Major components of a decision tree:
Searching for splits
Missing value strategy
Splitting criterion
Stopping rules
Tree pruning
57
Searching for Splits
Determine splits for nominal input variables
If the maximum number of allowed branches from each node is higher than or equal to the number of classes of the variable being considered, split the variable at every class.
If maximum number of allowed branches is lower than the number of classes:
Compare all possible combination of splits, and then choose the best split according to the pre-specified splitting criterion.
58
Searching for Splits
Determine splits for ordinal input variables
If maximum number of allowed branches is higher than or equal to the number of orders of the variable being considered, split the variable at every order.
If maximum number of allowed branches is lower than the number of orders:
Only adjacent orders are grouped to preserve the order.
Compare all possible splits.
Choose the best split according to the pre-specified splitting criterion.
59
Searching for Splits
Determine splits for interval input variables
First sort the values of the variable being considered. Denote the sorted value as {v1, v2, …, vn}, say.
Choose the midpoint of two adjacent values, (vi+vi+1) / 2, as a split point.
May also choose the smaller value vi as the split point.
At most n-1 splits, and n branches.
Proceed as ordinal variables.
60
Searching for Splits
Binary split or multiway splits?
Multiway splits
It may fragments the training set too quickly, leaving insufficient data at the next level down.
The tree structure is more likely to be shallow and therefore less input variables may contribute to the rules unless very large number of training samples are available.
It tends to favour variables with higher number of different values. Adjustments can be applied for this kind of bias.
Binary splits
It allows only 2 splits at each node. The tree structure is often deeper than those with multiway splits.
Since the same variable is often allowed to be considered at more that one level, hence it may give similar results to a tree with multiway splits.
More effort may be needed to fully understand a binary tree than the one with multiway splits.
61
Splitting Criteria
Entropy
Information function I(p) = log2 (1/p) measures the amount of information that is needed to represent the the occurrence of an event having probability p.
In information theory, it is common to compute I(p) with log base 2, but it is also possible to use other bases for the logarithm, such as e or 10.
log2 (x) = loga (x) / loga (2), where a is any any real number greater than 1, such as 10 or e.
For example, flip a fair coin once and observe the outcome, say Head. The probability of the occurrence of Head is 0.5. Thus it takes log2(1/0.5) = 1 bit of information (a 1 or a 0) to represent the event.
62
Splitting Criteria
Entropy
Now consider a target variable with k levels, and pi be the probability of occurrence of level i in a given tree node.
Let I(pi) = log2 (1/pi) be the information that is needed to represent the occurrence of level i in the node.
In a long run (say N), we expect to see Npi occurrence of level i, and the amount of information needed for level i is Npi log2 (1/pi).
The total information I that is needed to represent all k levels in the node in long run is
63
Splitting Criteria
Entropy
Then, the average information H that is needed to represent 1 tree node will be
Note that, by limit theorem, pi log2 (1/pi) equals to 0 when pi = 0.
H is called the entropy. It measures the level of impurity. The higher the impurity, the larger the value of the entropy.
The maximum of H is the log2 of the number of possible levels when the occurrence of all levels are equal likely.
For a binary target, the maximum entropy is 1.
64
Splitting Criteria
Entropy
True pi of a given tree node is unknown, but it can be estimated as:
E.g. A node contains a target variable of 7 Yes and 7 No objects, then
pYes = 7 / 14 pNo = 7 / 14
Hence the entropy of this node is:
H = - ( (7/14) log2(7/14) + (7/14) log2(7/14) ) = 1
65
Splitting Criteria
Entropy
Choose the split that minimizes the average entropy due.
Suppose HD measures the entropy of a tree node D, and suppose D can be split into S child nodes {D1, D2, …, DS}.
The average entropy of the split is defined as
where
and H(Di) is the entropy of child node Di.
(HD – average entropy of the split) is called the information gain or the reduction of entropy.
66
Splitting Criteria
Entropy
E.g. Suppose one possible split of the parent node (7 Yes, 7 No) is by Age such that:
Entropy of each child node:
Young: -( (3/5)log2(3/5) + (2/5)log2(2/5) ) =0.9710
Mid: -( (2/4)log2(2/4) + (2/4)log2(2/4) ) =1
Old: -( (2/5)log2(2/5) + (3/5)log2(3/5) ) = 0.9710
Average entropy of all child nodes:
(5/14) 0.971 + (4/14) 1 + (5/14) 0.971 = 0.9793
67
Splitting Criteria
Entropy
E.g. Suppose the parent node (7 Yes, 7 No) can be split by a competing variable Gender as
Entropy of each child node:
Average entropy:
Variable Gender is selected because it has a smaller entropy (i.e. it returns a higher information gain).
68
Splitting Criteria
Gini index
Gini index of a tree node Di is defined as
Value of the index indicates the purity of a node, the smaller the value, the purer the node is.
Choose the split that minimizes the Average Gini index due to the split.
69
Splitting Criteria
Gini index
E.g. Suppose the parent node (7 Yes, 7 No) can be split by a variable Gender as
Gini index of each child node:
Average Gini index due to the split is:
70
Splitting Criteria
Chi-square statistic
The Chi-square test statistic measures the difference between the observed cell counts and what would be expected if the branches and target classes were independent.
The Chi-square test statistic is defined as
where Oij is the observed number of records in the class i and branch j and Eij is the corresponding expected value.
Used mostly with discrete or qualitative variables. Continuous variables need to be discretized into per-determined number of bins first.
71
Splitting Criteria
Chi-square statistic
P-value = P(c2 >= c2, given that no association between branches and target classes).
degree of freedom: (L-1)(S-1) where L equals the number of target levels and S equals the number of branches.
large P-value : no association between branches and target classes.
small P-value indicates association between branches and target classes.
Choose the split with
the smallest P-value.
or the largest logworth = -log10(P-value).
72
Splitting Criteria
Chi-square statistic
E.g. Suppose the parent node (7 Yes, 7 No) can be split by a variable Gender as
The Chi-square test statistic is
The P-value =0.0306 at (2-1)(2-1) degree of freedom.
Observed
Expected
73
Splitting Criteria
Which is the best criterion?
All criteria are biased towards the variable with large multiple levels if multiway splits are allowed.
Chi-Squared test allows adjustment for the number of branches.
Gini Index and Entropy do not allow adjustment for the number of branches.
For multi-way splits, Chi-Squared test with adjustment is more appropriate.
For binary splits, try all three criteria and choose the best according to a pre-determined criterion.
74
Stopping Rules and Tree Pruning
Overfitting
A tree may be allowed to grow continuously until all leave nodes are pure.
A tree that adapts itself to the training data, generally, does not fit as well when applied to new data.
Accuracy
Number of nodes
Training data
New data
75
Stopping Rules and Tree Pruning
Overfitting
Early stopping rules (Pre-pruning) to prevent overfitting
Some possible restrictions on partitioning:
Specify the minimum number of records in a leaf node.
Specify the minimum number of records required in a node being split.
For binary split, the minimum number of records required in a node being split >= 2 × the minimum number of records in a leaf node.
Specify the minimum reduction in impurity (Gini index or Entropy) or Chi-square P-value.
Specify the maximum depth of the tree.
76
Stopping Rules and Tree Pruning
Overfitting
Prevent overfitting by post-pruning
In a very large decision tree, some leaf nodes may contain very few records. The classification rule derived from this node is therefore unreliable.
Post-pruning is used to simplify a very large tree by discarding some weak branches.
Steps of post-pruning:
After the tree construction stops, regardless why it stops, a set of best trees, one for each possible number of leaves, based on training data set is created.
Each of such trees is fitted to the validation data set.
The smallest tree with the best assessment value (such as misclassification rate) will be chosen.
77
Missing Values Strategy
All splitting criterion can incorporate missing values in the determination of splits. Missing values may be assigned to any single branch that achieves the highest purity.
For a nominal variable, missing values can be replaced by a label. This label can be treated as one of the levels of the nominal variable.
Decision tree may also put all missing values as a separated branch during splitting.
78
Other Issues
Instability of trees
A small change in the data can result in a very different tree, but with nearly the same accuracy.
Train multiple decision trees by using different samples of the training data may make the model more robust in handling new data.
See section Ensemble Models in later discussions.
79
Auxiliary Uses of Trees (Self-Study)
Selecting input variables for other classification models
Can be used to identify variables that do not have strong association with the target variable.
Collapsing levels of nominal inputs
Leaves of a tree fitted with only a nominal input represent subsets of the levels.
Missing values imputation
Missing values of one input are predicted as a function of the other inputs.
Requires no functional form of the imputation model.
80
SAS Decision Tree node
Example 5.3: Refer to the diagram ClusterSim2 in Project2.
Connect the Data Partition node to a Decision Tree node.
If the Role of Class is not already set to Rejected, set its Use value to No.
Set Nominal Target criterion to Entropy.
Set Missing Values to Use in search. Missing values are assigned to the branch that maximizes the worth of the split.
Set Use Input Once to No. The same variable will be used more than once.
Set Maximum Branch to 2 for binary splits.
Set Maximum Depth to 10 for most application.
Set Leaf Size to 50.
Set Split Size property 100.
Split Size >= 2 * Leaf Size.
Set Method to Assessment.
Set Assessment Measure to Average Square Error.
81
SAS Decision Tree node (SS)
Example 5.3
Average Square(d) Error (ASE) Computations:
where
n is the number of objects in the data set
y1i = 1 and y0i = 0 if the ith object actually belongs to the target level
y1i = 0 and y0i = 1 if the ith object actually belongs to the non-target level
^p1i and ^p0i correspond to the ith object’s target level score and non-target level score respectively.
82
SAS Decision Tree node
Example 5.3
Results
Select View | Assessment | Score Rankings Matrix : Class Target to view different lift charts.
The Output window shows the usual confusion matrix
The Tree window displays the decision tree.
83
Click the white box under a node to collapse or expand the paths after the node.
SAS Decision Tree node
Example 5.3
Results
Node Rules
Select View | Model | Node Rules.
Click Save to save the rules as a SAS program file.
The saved file is in fact a text file.
84
SAS Decision Tree Node
Example 5.3
Results
Subtree Assessment Plot
Select View | Model |Subtree Assessment Plot for the results of post-pruning
85
SAS Decision Tree Node
Example 5.3
Undersampling adjustment
Suppose ClusterSim2 is an undersampled data set. The true proportion (prior) of the target level (Yes) in the population is only 5%.
Decision node adjusts the classification scores and assessment statistics according to the specified prior (or profit & cost) of the target level.
The same kind of information can also be specified, without using Decision node, under
the Decision property of an imported data under Data Sources category of the project explorer if a class target has been defined.
All analyses based on this data can make use of the decision information.
Or, the Decision property of an Input Data node in a Diagram if a class target has been defined.
All analyses in the subsequent nodes of the same path can make use of the decision information.
All scores and assessment statistics will be adjusted at the time of building the model if requested.
86
SAS Decision Tree Node
Example 5.3
Undersampling adjustment
Connect the Decision Tree node to a Decision node (under Assess Section). Update the Decision node.
Set Apply Decision under Train property to Yes. Open Custom Editor and define the following under the Prior Probabilities tab:
Select the Yes radio button for entering new prior probabilities.
Enter 0.05 for the Adjusted
Prior of Yes level and 0.95
for the No level. The two
values must be summed to 1.
Click OK. Run the node.
87
SAS Decision Tree Node
Example 5.3
Undersampling adjustment
Results:
Assessment statistics are now computed based on the adjusted scores according to the specified prior.
88
Adjusted and unadjusted scores
Assessment measures using adjusted scores
SAS Decision Tree Node
Example 5.4: Refer to the diagram LoanDefault in Project2.
Connect the SASCode1 node to a Decision Tree node. Name this node Tree1. Set the following properties for Tree1:
Set Nominal Target Criterion to Entropy.
Set Maximum Depth to 10 (higher for larger data set).
Set Maximum Branch to 2 (for binary splitting).
Set Leaf Size to 30 (more for larger data set), and Split Size to 60 (so that a qualified node can be split into two child nodes).
Set Method to Assessment, and Assessment Measure to Average Square Error.
The Misclassification rate is computed based on a 0.5 cutoff, which is probably too high for this case.
89
SAS Decision Tree Node
Example 5.4
Results
View | Model | Subtree Assessment Plot
A smaller tree was selected because its Average Square Error of the validation data set was smaller than other larger trees.
90
SAS Decision Tree Node
Example 5.4
Results
View | Assessment | Score Distribution
The scores for the event level (Charged Off) is rather low, not higher than 0.4.
This is typical for a highly unbalanced data set.
At cutoff 0.5, no record is classified as Charged off.
If a much lower cutoff value is adopted, many records at ‘Paid off’ will be misclassified as ‘Charged Off’.
91
SAS Decision Tree Node
Example 5.4
Connect SASCode2 to a new Decision Tree node. Name the node as Tree2. Update Tree2. Set the following properties for Tree2:
Set Nominal Target Criterion to Entropy (or Gini, or ProbChisq).
Set Maximum Depth to 10.
Set Maximum Branch to 2.
Set Leaf Size to 30, and Split Size to 60.
Set Method to Assessment, and Assessment Measure to Average Square Error (or Misclassification).
Run Tree2.
Since the decision tree was grown to the allowed full size without pruning, the tree may overfit the training data.
92
SAS Decision Tree Node
Example 5.4
Results
93
At cutoff 0.5:
SAS Decision Tree Node
Example 5.4
Copy Tree2 node and paste it into the diagram. Name the pasted node as Tree3.
Connect SASCode2 node to Tree3. Set the following properties for Tree3:
Set Perform Cross Validation to Yes.
Set Number of Subsets to 10.
Run the node.
94
At cutoff 0.5:
SAS Decision Tree Node
Example 5.4
Connect Tree1, Tree2, and Tree3 to a Comparison node.
Three is no common training data set and validation data set used for these three models, we will use the test data set for comparison purpose. The test data set is a representative sample (not undersampled) of the full data set.
Set the following properties for the comparison node:
Set Selection Statistic to ROC.
Set Selection Table to Test.
Run the node.
Results:
95
Tree3 gives the best ROC on test data set.
Neural Network Models
What is a neural network?
It grew out of the community of artificial intelligence during 1980s.
It is a computer program which implements sophisticated pattern detection and machine learning algorithms.
used to build classification / prediction models from large historical database.
Especially useful for classification / prediction problems where
no mathematical formula is known that relates variables to target variable,
classification / prediction is more important than explanation,
there are lots of training data.
96
96
Neural Network Models
What does a neural network look like?
loosely based on the way that the human brain is organized and how it learns.
Node – corresponds to the neuron in the human brain.
Link – corresponds to the connections between neurons in the human brain.
97
97
Neural Network Models
Basic structure of a neural network
Multilayer Perceptron (MLP)
This network becomes “deep” by adding more hidden layers and hidden nodes.
Most deep learning methods use multilayer perceptron as a building block.
98
Hidden layer(s)
Input layer
Output layer
Input 1
Input 2
Input 3
Input 8
Output 1
Output k
.
.
.
.
.
.
98
Neural Network Models
Basic structure of a neural network
Input layer
These nodes takes value from the variables of an object.
One node per interval variable.
One node for each C-1 (or C) dummies of a categorical variable with C classes.
Hidden layer
Further process the values from the variables of an object so that more complex relationship between the variables of the object and its target can be established.
Possible to have more than 1 hidden layer.
Each hidden layer contains at least one node.
Each node combines the values from the input nodes and their associated weights, linearly or nonlinearly. Then output the transformed combined value to the nodes in the next layer.
99
Neural Network Models
Basic structure of a neural network
Output layer
This layer provides the output node(s). These are the targets whose values the network is trying to learn or to classify.
Only one node for an interval target.
One node per class for a categorical target. (1,0) dummies are often used to represent the present of each class for convenience.
For a binary target, it is common to use just one target node to represent the present (1) or the absent (0) of the target level.
Each output node combines the values from the nodes in the preceding hidden layer and their associated weights, linearly or nonlinearly. Then output the transformed combined value.
Links between all the nodes
Each link has a connection strength or weight, maintained by the node on the receiving end of the link.
100
Neural Network Model
Computing the score for an object
From input nodes to a hidden node
A node in the hidden layer may receive values from many variables.
It first multiplies each input by a weight and adds the products to a bias weight. This is called linear combination (other types of combination are possible).
Example: Suppose k inputs are connected to a hidden node. The hidden node receives the combined value u.
u
w1
w2
wk
x1
x2
xk
.
.
.
101
Neural Network Model
Computing the score for an object
From input nodes to a hidden node
The output (f(u)) from a hidden node is often a hyperbolic tangent function (activation function) of the combined weight u.
In this case, the output range of f(u) is in between -1 and 1.
Other activation function (as long as it is non-linear) can also be used.
For example, logistic (or sigmoid) with 0 f(u) 1,
f(u) = 1 / (1 + e-u)
u
w1
w2
wk
x1
x2
xk
.
.
.
102
Neural Network Model
Computing the score for an object
From hidden nodes to the output node
Works in the same way as from the input nodes to a hidden node.
It is common to use logistic activation function for a binary target node as its value is automatically bounded in between 0 and 1.
It is possible to use two output nodes for binary classification, but there is no advantage in doing so.
u’ = w’0 + w’1f(u1) + w’2 f(u2) + . . . + w’m f(um)
z(u’ ) = 1 / (1 + e-u’ )
u’
w’1
w’2
w’m
f(u1)
f(u2)
f(um)
.
.
.
103
Neural Network Model
Computing the score for an object
Example: Compute the output Y for the following NN. For illustration purpose, all bias weights equal to 0.
At Node 1
Input = 0 + 1·0.2 + 0.5·0.5 = 0.45
Output = (e2(0.45) -1)/(e2(0.45) +1) = 0.4219
At Node 2
Input =0 + 1·(-0.6) + 0.5·(-1) = -1.1
Output = (e2(-1.1) -1)/(e2(-1.1) +1) = -0.8005
At Node 3
Input = 0 + 0.4219·1 + (-0.8005)·(-0.5) = 0.82215
Output = 1/(1+e -0.82215) = 0.69469
Since the score at Node 3 is higher than 0.5, if a cutoff 0.5 is adopted, the object is classified as class Yes (Target level).
104
1
2
3
X1=1
X2=0.5
0.2
-0.6
0.5
-1.0
1.0
-0.5
Y
Building A Neural Network Model
Data size
Larger data size allows more hidden nodes to be used in a NN model, thus capture more complex information.
A network with K input nodes, a single hidden layer with m hidden nodes, and a output nodes has (K+1)m+(m+1) weights. We therefore need around 5[(K+1)m+(m+1)] objects.
Data preparation
Missing values
Need to be filled before the analysis
Variable selection
More variables, more hidden nodes are required.
Avoid correlated variables.
Select variables carefully will speed up the training process.
There is no built-in variable selection procedure in NN algorithm. All provided variables will be used.
105
Building A Neural Network Model
Data preparation
Standardize (or rescaling or normalize) interval input variables or not?
Not necessary, in theory.
Any rescaling of an input can be effectively undone by changing the corresponding weights and biases.
Practically, standardizing input variables
may speed up the training process.
reduce the risk of obtaining a false optimal solution.
Common ways to standardize inputs in NN field include:
Range standardization
Central tendency standardization.
106
Building A Neural Network Model
Data preparation
Categorical input variables
For binary input variables, encode each of the two classes by 1 and -1 respectively.
(1,0) encoding can also be used, but some researches indicate (1,-1) works better for neural network models.
For a categorical input with C levels , use deviation coding method with C-1 dummies. For example, a variable of red, blue, and yellow colours can be represented by two dummies: (1,0) for red, (0,1) for blue, and (-1,1) for yellow.
Conventional (1,0) encoding can also be used.
C dummies can also be used, but there is no advantage of doing so.
107
Building A Neural Network Model
Data preparation
Categorical input variables
For an ordinal variable with C levels, use bathtub coding with C- 1 dummies. For the ith category, the jth dummy is defined as when i > j, else .
For example, a variable with values low, medium, and high in that order, can be represented by two dummies, D1 and D2 as:
Other encodings, such as integer encoding or thermometer encoding, are also possible.
108
D1 D2
Low -0.75 -0.75
Medium 0.75 -0.75
High 0.75 0.75
Building A Neural Network Model
Data preparation
Variable transformation or higher order terms
Neural Network model can figure out complex relationship between the target variable and the input variables if sufficient number of hidden nodes and large enough number of observations are available.
Additional information (interaction terms or higher order terms if known) would help to speed up the weights estimation process and allow the model focuses on less trivial information.
109
Building A Neural Network Model
Which activation function?
Any nonlinear function can be used, but a differentiable and bounded function is preferred.
Common choices for hidden nodes are logistic, tanh, and Standard Normal.
For target nodes, choose an activation function that suit the value range of the target value.
Binary target: logistic function.
110
Building A Neural Network Model
How many hidden layers?
In theory, just one hidden layer can learn to approximate virtually to any degree of accuracy.
Provided sufficient data, sufficient hidden nodes, and sufficient training time are available.
Two hidden layers can often yield an accurate approximation with fewer hidden nodes than one hidden layer.
111
Building A Neural Network Model
Number of hidden nodes
Optimal number of hidden nodes depends on the number of training cases, the amount of noise, and the complexity of the pattern contained in the data set.
In general, it lies between half of the number of input nodes and twice the number of input nodes. Provided sufficient number of hidden nodes are included in the model, the exact number of hidden nodes is unimportant.
If the number of hidden nodes are insufficient, the error will not normally decrease during iterations.
112
Building A Neural Network Model
Weights estimation
The output of a neural network is a highly non-linear function of the weights. The exact association depends on the type of combination function and activation function adopted for each node.
Special learning algorithms were proposed for estimating these weights by minimizing an error function in an iterative approach. The weights are adjusted by an predetermined amount according to the surface gradient of the error function.
Modern non-linear optimization algorithms are easier to use. These include Levenberg-Marquardt (a few hundred parameters), conjugate gradient (many thousands of parameters), and quasi-Newton.
Good starting values of these weights will accelerate convergence, decrease computational time, and increase the classification performance.
113
Building A Neural Network Model
Overfitting
If sufficient number of hidden nodes are allowed, NN can easily model the noise in addition to the underlying function that we are looking for.
Early stopping approach
Divide the available data into training and validation sets.
Use large number of hidden nodes to avoid bad local optimal.
Compute the validation error rate periodically during training.
Train the network to convergence, then go back to see which iteration gives the smallest validation error.
Overall Error
No. of iterations
Training sample
Validation sample
114
SAS Neural Network Node
SAS Neural Network Node
Standardize all interval inputs by default.
Allows different model selection criterion.
Can choose from many combination functions, activation functions, and error functions.
Various optimization algorithms are available.
Use preliminary training to select more appropriate initialize weights if requested.
Allows early stopping with validation data set.
Support only one hidden layer with up to 64 hidden nodes.
Many different combinations of options can be specified under the Network and Optimization properties.
115
115
SAS Neural Network Node
Example 5.5
Refer to the diagram ClusterSim2 in Project2.
Connect the Data Partition node to a Neural Network node.
Name this Neural Network node NN1.
In Network property window:
Set the number of hidden units to 1 (allowed values: 1 – 64).
Set Target Layer Combination to Linear.
Set Target Layer Activation Function to Logistic.
Set Target Layer Error Function to Bernoulli:
In Optimization property window:
Set Training Technique to Levenberg-Marquardt.
Set the Maximum iterations to 50.
In Preliminary Training section, set Enable to No.
Set Model Selection Criterion to Average Error, average of the adopted error function, or Misclassification.
116
SAS Neural Network Node
Example 5.5
Results:
The optimization algorithm stopped at around 30 iterations.
The estimated weights at the end of 18th iteration are adopted.
If the algorithm selects the last iteration as the result, rerun the node with higher number of maximum iterations or higher number of hidden nodes.
See also the usual results of Score Rankings Matrix.
In conclusion, the model fit the training data quite well even with only 1 hidden node.
117
SAS Neural Network Node
Example 5.6
Refer to the Donor_Raw diagram of Project 2.
Connect the first Variable Selection node to a Impute node. Variable Donor_Age contains missing values. They are to be imputed by Tree method. Set the following properties of the Impute node:
Set Use and Method of Donor_Age to Yes and Tree respectively
Set the Missing Cutoff of Train to 99.9% so that variables with missing value <99.9% will be imputed in this example.
Set the Default Impute Method of Class Variables and Interval Variables to None for both.
Run the node.
Connect the Impute node to a Variable Selection node for selecting input variables. Set the following properties of the Variable Selection node:
Set Target Model to Chi-Square.
Run the node.
118
SAS Neural Network Node
Example 5.6
Connect the Variable Selection node to a Neural Network node. Name this node as NN1.
In Network property window:
Set the number of hidden units to 10.
Set Target Layer Combination to Linear.
Set Target Layer Activation Function to Logistic.
Set Target Layer Error Function to Bernoulli.
In Optimization property window:
Set Training Technique to Quasi-Newton.
Set the Maximum iterations to 50, including number of preliminary runs.
Set Model Selection Criterion to Average Error (Average of adopted error function).
Run the node.
119
SAS Neural Network Node
Example 5.6
Connect the Variable Selection node to a HP Neural node. Name this node as NN2. Set the following properties for NN2:
Set Architecture to Two Layer, i.e. two hidden layers.
Set Number of Hidden Neurons to 30, i.e. 15 hidden nodes in each hidden layer.
Set Number of Tries to 5, i.e. repeat the process 5 times with different initial weights for each run. The run that contains the smallest validation error will be selected.
120
SAS Neural Network Node
Example 5.6
Suppose the following profit & loss information is available: If a mail is sent to an individual and the individual does not respond, the cost is $2.0. If the individual does respond, then based on the previous experience, a donation of $12 is expected on average.
To classify an individual as a donor based on the expected donation > 0, connect the NN2 node to a Decisions node. Set the following properties for the Decisions node:
Set Apply Decision property to Yes.
In Customer Editor property:
Under Decision Tab, select the Yes option for using the decision.
Decision 1 is the decision to mail a solicitation. Decision 2 is the decision not to mail a solicitation.
Under the Decision Weights Tab:
Enter 10 as the Decision 1 weight for Level1, -2 as the Decision 1 weight for Level 0, and 0 as the Decision 2 weight for both levels.
121
121
SAS Neural Network Node
Example 5.6
Custom Editor:
Run the node.
122
SAS Neural Network Node
Example 5.6
Sort the exported training data of the Decisions node by Expected profit:
Expected profit: The larger of the expected profit from Decision 1 and the expected profit from Decision 2.
Best Profit: The amount of generated average profit if the object’s true level is known.
Computed Profit: The amount of generated average profit due to the decision.
123
SAS Neural Network Node
Example 5.6
If all individuals with expected profit > 0 are classified as Donor level:
This is equivalent to setting the cutoff value of the score as 1/6.
For these selected 9681 classified donors in the training data set:
Total expected net donations = $11,987.6144
Total best net donations = $26,280
Total computed net donations = $12,174
If all individuals in the training data set are classified as Donor level:
For these 11622 classified donors in the training data set:
Total best net donations = $29,060
Total computed net donations = $29,060 – $17,432 = $11,628
124
SAS Neural Network Node
Example 5.6
Scoring new records
Input the to be scored data set Donor_score.sas7bdat into the project.
No need to adjust the assigned role or measure level of the data set, but it must contain the same set of input variables as in Donor_raw_data.
At the Step 7 of Data Source Wizard, set the Role to Score.
Drag Donor_score from Data Source into the diagram and connect it to a Score node.
Connect the NN2 node (or Model Comparison node, or any model node) to the same Score node.
If more than 1 model nodes are connected to Model Comparison node, the selected model in the Model Comparison node will be used to score the new observations.
Set the Use Fixed Output Names property to Yes.
The score column for the event (Donation=1) will always be named as Em_Eventprobability regardless of the original name.
Any data modifications applied preceding the Score node on the same path of the selected model node will be applied to the to be scored data set automatically.
Records in the scored new data set can be classified by using the Cutoff node as usual.
125
SAS Decision node (Self-Study)
Example 5.6
Exporting scored new records
Any scored data set (or any exported data set in a node) can be exported using Save Data node (in Utility Tab).
Connect the Score node to a Save Data node.
Set File Format of Output Format property to SAS (.sas7bdat).
Other format, such as .txt or .xlsx, can also be selected.
If not all data sets are to be exported, set All Roles of Output Data property to No and specify Select Roles accordingly.
126
Ensemble Models
Since different models often capture different features of the data space as part of any one model, it can be shown that in certain conditions the combined classification power is better than the classification power of individual models.
The ensembles of models have emerged as the most important technique for many practical classification problems.
127
Ensemble Models
128
Ensemble models have a set of base models that accept the same inputs and predict the outcome individually. Then the outputs from all of these base models are combined, usually by voting, to form an ensemble output.
Ensemble Models
Ensemble models give better results than individual model under two conditions:
Each model of the ensemble should be independent.
The individual model misclassification rate should be less than 50% for binary classifiers with balanced data set.
If the misclassification rate of a model is more than 50%, then it is worse than random guess, and hence it is not a good model to begin with.
Achieving the first criterion of independent amongst the base models is difficult since there is only 1 training data set available. There are a few techniques available to make base models as diverse as possible.
129
Ensemble Models
To achieve the diversity in the base models, we can alter the conditions in which the base model is built. The most commonly used conditions are:
Different model algorithms
Different parameters within the models
Changing the training data set: A sufficiently large training data set can be divided into multiple data sets and each set can be used to build one base model. Alternatively, we can sample training data with replacement from a data set and repeat the same process for other base models.
Changing the input variable set: We can sample the input variables for each base model. This techniques works if the training data have a large number of input variables.
130
Ensemble Models
Ensemble by voting
Building multiple base models using the same data set.
For each records, the predicted classes are tallied up amongst all the base classifier and the class with highest number of votes is the predicted class for the ensemble model.
The final score of a record is the average of the scores from the base models that predicted the same class.
131
Training Data Set
Model 1
Model 2
Model K
.
.
.
Ensemble Models
Ensemble Models
Ensemble by bagging (or bootstrap aggregating)
Bagging is a technique where base models are developed by changing the training data set for every base model.
In a given training set T of n records, m training sets are developed each with n records, by sampling with replacement. Each training set T1, T2, …, Tm will have the same record count of n as the original training set T.
When n is sufficiently large, each sample contains approximately 63% of unique records on average.
Each training set contains some duplicate records. This is called bootstrapping. Each training set is used for a base model, and the prediction of each model is aggregated for an ensemble model with voting. This combination of bootstrapping and aggregating is called bagging.
132
Ensemble Models
Ensemble by boosting
Boosting trains the base models in sequence one by one and assign weights for all training records.
The boosting process concentrates on the training records that are hard to classify and overrepresents them in the training data set for the next iteration.
To start with, all training records have equal weight. The weight of the record is used for the sampling distribution for selection with replacement. A training sample is selected based on the weights and used for model building. Incorrected classified records are assigned with higher weight and correctly classified records are assigned with lower weight so hard-to-classify records have a higher propensity of selection for next round.
133
Ensemble Models
Ensemble by boosting
The training sample for the next round will be most likely filled with incorrectly classified records from the previous round. Hence the next model will focus on the hard-to-classify data space, and so forth.
All base learners are combined through a simple voting aggregation.
134
Training Set
Model
Ensemble Models
Sample
Adjusted Weight Set
Model
Sample
Adjusted Weight Set
. . .
Model
Sample
Ensemble Models
Ensemble by random forest
Similar to bagging in that random forest trains many decision trees by using different samples and then combines the predictions by voting. However, instead of considering all input variable for splitting at each node when building a decision tree, random forest only consider a random subset of all variables in the training set. a subset of input variables for splitting.
135
Training Set
Random Forest
Ensemble Models
Sample
Random Forest
Sample
. . .
Random Forest
Sample
Ensemble Models
Example 5.7. Refer to the diagram Donor in Project 2.
Connect the model nodes Regression1, Regression2, NN1, and NN2 to Ensemble node. Set the following properties for the Ensemble node:
Set Posterior Probabilities to Voting (A cutoff value of 0.5 is adopted).
Set Voting Posterior Probabilities to Average.
Run the node.
If voting is not used, the predicted scores from all models can be used.
Set Posterior Probabilities to Average.
136
Ensemble Models
Example 5.8. Refer to the diagram LoanDefault in Project2.
Connect the SAS Code node to a Start Groups node (in Utility Section). Set the following properties of Start Groups node:
Set Mode to Bagging (or Boosting).
Set Index Count to 10 (or more) so that 10 samples are to be selected.
For Bagging, set Type to Percentage, and set Percentage to 30 so that the sample size is not too small for this data set.
Research indicates low sampling rate improves the performance of bagging. If sample size is large enough, set Percentage to 10 or 20.
Connect the Start Groups node to a Decision Tree node with the usual settings for the Decision Tree node.
Set the Method to Largest so that no post pruning is performed.
Connect the Decision Tree node to a End Groups node.
Run the End Groups node.
137
Ensemble Models
Example 5.9. Refer to the diagram Donor in Project2.
Connect the Impute node to a HP Forest node (in HPDM section). Set the following properties for the HP Forest node:
Set Maximum Number of Trees to 10.
Set this to smaller number for exploration purpose.
Set Number of Variables to Consider in Split Search to 20.
Set this number too small may exclude variables with predictive power. Try different values and check performance of the node.
Set Smallest Number of Obs in Node to 30 (or higher for larger data set).
Set Split size to 60.
*Type of Sample: Proportion; Proportion of Obs in Each Sample: 0.5
138
Rule Induction Node (Self-Study)
Rule Induction node can be used to classify rare events.
This node contains the following three steps:
It first builds a decision tree model to find the largest node that meets the purity threshold for the most common event. The records in the largest node that meets the purity threshold criterion are then removed from the data. The step is repeated until no more node meets the purity threshold.
It then builds a binary classification model (decision tree, logistic regression, or neural network) for the most common event level. Correctly classified records of this level are removed from the data. It then builds another classification model for the next most comment event level, and then remove the correctly classified records of this event level. Repeat this process for the next most common event level, and so forth.
This model building step can also be started from the rarest event.
Ideally the remaining records in the data set shall become close to balance in the target levels after this step.
139
Rule Induction Node (Self-Study)
Rule Induction node can be used to classify rare events.
This node contains the following steps:
If there are sufficient number of records left behind for at least 2 event levels after Step 2, a final clean up model (decision tree, or neural network) will be built for the remaining records. User has no control on the details of the adopted model.
The model at Steps 1 & 2 determines the score of the respective removed records. The clean up model, if applicable, at Step 3 determines the score of the remaining records. If there are insufficient amount of data for the clean up model building, event proportion in the remaining data set shall be used as the score for the remaining records.
This node has no variable selection mechanism. The adopted neural network model or regression model will drop records with missing values.
140
Rule Induction Node (Self-Study)
Example 5.10. Refer to the LoanDefault diagram in project Classification.
Connect the Impute node to a Rule Induction node.
For Step 1, Set Purity Threshold to 93% so that the largest node that meets the purity threshold is removed from the data iteratively.
If no leaf was ripped from the model at all (can be checked from the Output in the Results of the node, reduce the threshold and rerun the model.
For Step 2, set Binary Models property to Tree ( or regression, or neural network). Set Binary Order to Descending so that the binary model training starts with the most common event.
For Step 3, Set Cleanup Model to Neural ( or Tree or regression).
141
YesNoTotal
Yes18741251999
No8219171999
Total195620423998
Actual
Classified
PositiveNegative
(Target Level)(Non-Target Level)
Positive
(Target Level)
Negative
(Non-Target Level)
C
P
C
N
N
Classified
Actual
No. of true positive (T
P
)No. of false negative (F
N
)A
P
No. of false positive (F
P
)No. of true negative (T
N
)A
N
I = Npi log2 1 pi( )
i=1
k
∑
I=Np
i
log
2
1p
i
()
i=1
k
å
H =
1
N
Npi log2 1 pi( )
i=1
k
∑ = − pi log2 pi( )
i=1
k
∑
H=
1
N
Np
i
log
2
1p
i
()
i=1
k
å
=-p
i
log
2
p
i
()
i=1
k
å
pi =
Number of objects belongs to level i in the node
Total number of objects in the node
p
i
=
Number of objects belongs to level i in the node
Total number of objects in the node
Average Entropy = P(Di )H (Di )
i=1
S
∑
Average Entropy=P(D
i
)H(D
i
)
i=1
S
å
P(Di ) =
number of objects in child node Di
number of objects in the node D
P(D
i
)=
number of objects in child node D
i
number of objects in the node D
MF
Yes52
No16
Gender
Target
M: -( (5 /6) log
2
(5/6) + (1/6) log
2
(1/6) ) = 0.6500
F: -( (2 /8) log
2
(2/8) + (6/8) log
2
(6/8) ) = 0.8113
(6/14) (0.6500) + (8/14) (0.8113) = 0.7422
Average IGini = P(Di )IGini (Di )
i=1
S
∑
Average I
Gini
=P(D
i
)I
Gini
(D
i
)
i=1
S
å
(
)
(
)
(
)
(
)
64
/
24
8
6
8
2
1
:
F
36
/
10
6
1
6
5
1
:
M
2
2
2
2
=
–
–
=
–
–
(
)
(
)
(
)
(
)
3333
.
0
64
24
14
8
36
10
14
6
=
+
MF
Yes(6)(7)/14=3(8)(7)/14=4
No(6)(7)/14=3(8)(7)/14=4
Gender
Target
k
k
x
w
x
w
x
w
w
u
+
+
+
+
=
L
2
2
1
1
0
/docProps/thumbnail.jpeg