程序代写 ISOM3360 Data Mining for Business Analytics, Session 6&7

ISOM3360 Data Mining for Business Analytics, Session 6&7
Model Evaluation
Measures & Cost-sensitive Classification
Instructor: Department of ISOM Spring 2022

Copyright By PowCoder代写 加微信 powcoder

Example: A Regression Tree
Problem: predict the price of used Toyota Corolla cars
Available attributes: Age (in months), Weight (in kilograms), HP
(Horsepower), KM (Accumulated kilometers), …
– There are 137 training examples whose age<=32.5, mean value 18560. - For a car in test set, if we only know that its age<=32.5, the prediction is 18560. - There are 1005 training examples, mean value 10843. - For a car in test set, if we don’t have any information on the car, the prediction is 10843. - There are 868 training examples whose age>32.5, mean value 9625.
– For a car in test set, if we only know that its age>32.5, the prediction is 9625.

Post-Pruning
Reduced error pruning
􏰁 Split training data into a training and a validation set
􏰀 Validation data is just for parameter tuning, NOT for
􏰁 Trim the nodes of the decision tree in a bottom‐up fashion
􏰀 Eliminate subtree if the resulting pruned tree performs no worse than the original tree over the validation set.
􏰁 Grow a full tree on training set

Reduced Error Pruning
Training data
Validation data

Three-Way Data Splits (Optional)
If model selection and model evaluation (e.g., true error estimates) are to be computed simultaneously, the data should be divided into three disjoint sets [Ripley, 1996]
􏰁 Training set: a set of examples used for learning (to fit/learn the parameters of the classifier)
􏰁 Validation set: a set of examples used to select among several trained models
􏰁 Test set: a set of examples used only to assess the performance of a fully-trained-and-selected model
What if we also use validation set to assess the performance of the selected model?

Three-Way Data Splits (Optional)

Pros and Cons of Decision Tree Learning
􏰁 Simple to understand and interpret
􏰁 Require little data preparation (does not need data normalization for numerical variables, one-hot encoding for categorical variables, etc.)

􏰁 Can overfit, thus need pruning steps 􏰁 Tree structure is not very stable*
*A learning algorithm is said to be unstable if it is sensitive to small changes in the training data.

A Few Questions to Ask Yourself
How do we choose which attribute to split over first?
in trainingset
perform vegan iualidstin set but
What is overfitting?
When to stop splitting? Pre-pruning or post- bad in
pruning? testingset

Evaluation

Evaluation
We are interested in generalization – the performance on data not used for training
􏰁 Holdout validation
􏰁 k-fold cross validation

Holdout Validation
Split data into training set and test set (e.g., 70% training
set, 30% test set)
Use training set to train the model
Use the test set to estimate the error rate of the model
Potential weaknesses
1) What if by accident you selected a particularly easy/hard test set?
2) Do you have an idea of the variation in model accuracy due to the split?

Drawbacks of the Simple Holdout Method
The holdout estimate of the error can be highly variable across different splits.
􏰁 Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “abnormal” split (e.g., very difficult test set).
􏰁 This is particularly true wh亠en we have a small dataset. Solution: k-fold cross validation!

k-Fold Cross-Validation Partition data into k folds (randomly)
Run experiments k times; for each of k experiments, use k-1 folds for training and a different fold for testing
4o-fold cross validation — Each fold is test set once (test on all data)
— Can compute average and variance of accuracy measures

Leave-One-Out Cross-validation
Leave-one-out is a special case of k-fold cross validation, where k is chosen as the total number of examples.
􏰁 For a dataset with N examples, perform N experiments
􏰁 For each experiment use N-1 examples for training and the
remaining example for testing

How Many Needed?
In practice, the choice of the number of folds depends on the size of the dataset
􏰁 For large datasets, even 3-fold cross validation will be
quite accurate
egfewthan lou
􏰁 For very sparse/small datasets, we may have to use
more folds in order to train on as many examples as possible (e.g., k=N, leave-one-out cross validation)
A common choice for k-fold cross validation is k=10 四

k-Fold Cross-Validation
Under k-fold cross-validation, we will get k decision
trees. Which one should we use for deployment?
The purpose of cross-validation is not to come up with the final model. The only use of cross-validation is model evaluation, i.e., for us to get a understanding of how well the model can perform for unseen examples.
For the deployment, we can use all the examples available to get the best model possible (more data-> better model performance).

A learner or inducer or algorithm
􏰁 A method or algorithm used to generalize a model from a
set of examples.
Learner: induces a pattern from examples
In practice, people use model and learner interchangeably. But they are different.
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’

Model/Learner/Inducer/Algorithm Evaluation
In practice, people use model and learner interchangeably. But they are different.
By saying “model evaluation”, we essentially want to evaluate the performance of the algorithm in the prediction, instead of the performance of any specific model.

Evaluation Classification Performance
Naïve rule: majority-class classifier (benchmark model)
􏰁 Classifying everyone as belonging to the majority class.
􏰁 Is used as a baseline or benchmark for evaluating the performance of more complicated classifiers.
Consider direct mail marketing. Suppose the offer is only accepted by 1% of the households. A majority-class classifier simply classifies every household as a non- responder.

Confusion (Classification) Matrix
Confusion matrix: a table that is often used to describe the performance of a classification model. Entries are counts of correct and incorrect predictions.
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Predicted – +
1. How many records are labeled with + in test set? How many records are labeled with -?
2. How many records in test set?
3. How many records are correctly
predicted? How many records are incorrectly predicted?
True + (TP)
False – (FN)
False + (FP)
True – (TN)
Entries are counts of correct classifications and counts of errors

Evaluation Measures for Classification
Accuracy/Error rate
􏰁 Accuracy: the percentage of correct predictions
Accuracy =
Number of correct predictions Total number of instances in test set
􏰁 Error rate: the percentage of incorrect predictions
Error rate = Number of incorrect predictions Total number of instances in test set

Accuracy Measure is Limited
Consider direct mail marketing. Suppose the offer is only accepted by 1% of the households. We build two models, one by decision tree, the other one by majority- class. Which model has higher accuracy?
Decision tree
Majority-class
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Acci0.978 Acci0,99
+: responder
-: non-responder

Precision and Recall Measures
Precision is the number of correctly classified positive examples divided by the total number of examples that are classified as positive.
Recall is the number of correctly classified positive examples divided by the total number of actual positive
Tp 步 ft FNN P
examples in the test set.
ignores false negative ignores false positive

Tradeoff between Precision and Recall
Precision = Of all the fishes you caught, what % of them are red fish?
# red fishes / (# red fishes + # blue fishes)
Recall = Of all the red fishes, what % did you catch? # red fishes you caught / # all red fishes
There is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.

Still the direct mail marketing example. What is the precision and recall for both models?
Decision tree
Majority-class
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Precision: ? Recall: ?
Precision: ? Recall: ?
+: responder
-: non-responder

Which one is more important for YouTube recommendation, precision or recall?
Which one is more important for COVID-19 detection, precision or recall?

Cost-Sensitive Classification
What we have discussed so far assumes errors have the same cost.
But in many business applications, different errors have different costs (actual costs or missed benefits).
􏰁 Wrongly approving a credit card application costs a lot more than wrongly denying an application
􏰁 Wrongly filtering out a good email costs a lot more than wrongly accepting a spam
􏰁 Wrongly diagnosing an ill patient as normal costs a lot more than wrongly diagnosing a normal people as ill.

Asymmetric Misclassification Costs
Still the direct mail marketing example. Suppose the profit from a responder is $11 and the cost of sending the offer is $1. What will be the cost matrix?
*Assuming the cost of making correct classification is zero. The benefit of making a correct classification would be reflected in the missed benefit of making an incorrect classification.
cost matrix
Actual (+)
Predicted (+)
Predicted (-)
$11-$1=$10
+: responder
-: non-responder
Actual (-) o$1

Average Misclassification Cost
A popular performance measure is average misclassification cost, which measures the average cost of misclassification per classified example.
Average misclassification cost
= #FP*Cost(FP) + #FN*Cost(FN) n

Average Misclassification Cost
What if the performance measure is average misclassification cost?
Decision tree
Majority-class
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Predicted (+)
Predicted (-)
Actual (+)
Actual (-)
Average misclassification cost=?
zo.lt2to 40
Average misclassification cost=? 0410ioylo 30

Probability Estimates (PE)
So far, we have assumed that the classification output is discrete (i.e., either 0 or 1).
But most classification algorithms can return probability estimates, i.e., 0 <= P(+) <= 1. 􏰁 Used as an interim step for generating class membership 􏰁 ROC curve There are 30 positive training examples and 70 negative ones in this leaf. What would be your prediction for a test example who reaches this leaf (i.e. whose F1>=0.5 and F3<=0.3)? Decision Threshold The default decision threshold in 2-class classifiers is 0.5 (i.e., if P(+)>=0.5, predict +; otherwise, predict -.)
It is possible, however, to use a decision threshold that is either higher or lower than 0.5
􏰁 Typically, the error rate will rise if doing so.
Why would we want to use decision threshold different from 0.5 if they increase the error rate?
Answer: asymmetric三misclassification costs! May want to tolerate a higher error rate for the lower- cost misclassification error.

Example for Using Decision Threshold ≠ 0.5
Still the direct mail marketing example.
customer as responder or non-responder?
responder 0
Suppose that P(+) = 0.3, should we classify the
0.3.0to 7.1 a47
+: responder
-: non-responder

ROC Analysis
It provides a better view on classification performance. It utilizes all four cells in confusion matrix.
FP FP + TN
True Positive Rate (TPR) , recall =
False Positive Rate (FPR) =
What do TPR and FPR really represent?

ROC Analysis of A Classifier
cnn.fi 䘡可昆 O_wegioeoppcihbkyHFPR.TN
8in chance you
What do (0, 0), (0,1), (1,0) and (1,1), and the diagonal line on ROC Curve mean?

ROC Analysis
Still the direct mail marketing example. Can you place two classifiers on the ROC graph?
TPR 8_ TPR = ?
TPR = ? FPR = ?
Decision tree
Majority-class

Change Decision Threshold
Given the class probability estimates (PE), changing decision threshold could change TP or FP rate.
Prediction (0.5 Prediction (0.8 threshold) threshold)
True value
True positive rate
False positive rate
Accuracy rate
4 test examples

ROC Analysis
Each decision threshold corresponds to a pair of TPR (on the y-axis) against FPR (on the x-axis), given a particular model
Changing the decision threshold may change the location of the point
Threshold = 0.9
Threshold = 0.7
Threshold = 0.5
Threshold = 0.3
false positive rate (FPR) 0.8
0 0.1 0.2 true positive rate (TPR) 0.8 0.9 1

Connecting dots to get a curve for a model
Threshold = 0.9
Threshold = 0.7
Threshold = 0.5
Threshold = 0.3
false positive rate (FPR) 0.8
ROC curve always starts with (0,0) and ends with (1,1).
0 0.1 0.2 true positive rate (TPR) 0.8 0.9 1

Classifier B
Classifier A
0.2 false positive rate (FPR) 0.8
— Classifier A:
Discrete Classifier (output value is 0 or 1)
— Classifier B:
Probabilistic Classifier (output probability estimates)
0 0.1 0.2 true positive rate (TPR) 0.8 0.9 1

Drawing an ROC Curve
Actual class
Probability estimate
十比moveup we moveright
印品 一 0.8 o7
when0.7 FPR二 六二0.1

Demo: Drawing an ROC Curve
Files needed
􏰁 ROC.ipynb (Python file)

Area under ROC Curve (AUC)
Performance measure: area under ROC curve (AUC)
The bigger the AUC, the better the model is
Measure performance under different decision thresholds. AUC is a “deeper” measurement
􏰁 Better measure when uneven costs are unknown
􏰁 Better measure if the class distribution is uneven

What is AUC of a perfect classifier?
What is AUC of a random classifier (i.e., 30% guessing positive, 70% guessing negative)?

Evaluation Regression Performance
Naïve rule: the average (benchmark model)
􏰁 Using the average outcome value as prediction
􏰁 Is used as a baseline or benchmark for evaluating the performance of more complicated models.
Benchmark Prediction: Spending Amount for Eva (Female, 40) =?
Spending Amount

Evaluation Measures for Regression
Mean squared error (MSE)
􏰁 The average of the squares of the differences between predicted values and actual values
Root mean squared error (RMSE)
􏰁 The square root of the average of the squares of the differences between predicted values and actual values
Mean absolute error (MAE)
􏰁 The average of the differences between predicted values and actual values

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com