Bagging boosting
performance
comparsion
Option 2
Create by Zimin Zhao &
Steven Baranowitz
1
The goal
2
weka
The goal is to determine whether a bagging method and a boosting method increase the performance of classifier models.
3
How to achieve
4
each steps to achieve the goal
5
Step 1
Find and Select 20 datasets
Step 2
Preprocessing for datasets.
Step 3
Select 5 classification
Step 4
Result of 20 datasets x 5 classifier algorithms x 3
Step 5
chart for comparison.
Step 6
Discussion and Conclusion
Step 1: 20 datasets
http://archive.ics.uci.edu/ml/datasets.php
6
Abalone data
Acute Inflammations Data Set
Arrhythmia Data Set
Blood Transfusion Service Center Data Set
Breast cancer data
BUPA liver disorders
Echocardiogram Data
Glass Identification Database
Heart Disease Databases Cleveland
Heart Disease Databases Hungarian
Heart Disease Databases Switzerland
Heart Disease Databases va
Heart Disease Databases va
Horse Colic database
Parkinsons Disease Data Set
Protein Localization Sites
soybean-large
soybean-small
soybean-small
Wine recognition data
Step 2: Preprocessing for datasets
Find useful attribute at .Name document put into excel as Columns,
Find useful data in .Data document put into excel as Row.
Save as .csv
Open .cvs find the Class Distribution we want use, put into nominal.
Filter functions used: NumericToNominal; MergInfrequentNominalValues; AddValues ; RemoveUseless ; FirstOrder; ReplaceMissingValues
Save as .arff
We find out a lot data download from UCI are not capable to use
So preprocessing is very necessary
7
7
Step 3: 5 classifications
1. NaïveBayes
2. Logistic
3. OneR
4. J48
5. RandomForest
8
Step 4: Result of 20 datasets x 5 classifier algorithms x 3
Things we create for better analysis and understand
TPR Change:
How much does TPR change compare to TPR in the 10-fold cross-validation only.
ROCArea Change:
How much does ROC in Bagging and Boost changing compare to ROC in the 10-fold cross-validation.(AUC)
(10-fold cross-validation, Bagging, AdaBoostM1)
9
10
11
Step 5: Draw chart for comparison.
12
Things we create for better analysis and understand
There’s only TPR Change and AUC Change will input to our chart.
Separate by two groups (1-10 is group one)(11-20 is group two)
Each chart representative one classification
13
11-20 J48
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 0 5.28E-2 4.0000000000000001E-3 1.2E-2 3.15E-2 2.2499999999999999E-2 2.5000000000000001E-2 -2.5000000000000001E-2 1.8499999999999999E-2 0.24133333333333301 0 -4.0399999999999998E-2 3.45000000000001E-2 -2.9000000000000001E-2 -3.9999999999999498E-3 -2.5499999999999998E-2 9.5000000000000001E-2 2.5000000000000001E-2 2.1000000000000001E-2 -0.24133333333333301 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -3.0000000000000001E-3 5.16E-2 0.161 6.5999999999999906E-2 3.3999999999999898E-2 7.8000000000000097E-2 5.9799999999999999E-2 -2E-3 0.128 0.30866666666666698 3.0000000000000001E-3 -2.7199999999999998E-2 -3.1E-2 -3.1999999999999897E-2 1.4999999999999999E-2 -0.02 5.0000000000000001E-3 2E-3 -3.0000000000000001E-3 -0.30866666666666698
14
11-20 Logistic
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 2.1000000000000001E-2 2.4400000000000002E-2 4.0000000000000001E-3 0 3.0000000000000001E-3 0 -0.04 0 -6.0000000000000097E-3 1.2999999999999999E-2 -0.17349999999999999 -2.4400000000000002E-2 -4.0000000000000001E-3 0 7.5000000000000101E-3 0 0.06 0 -5.0000000000000001E-3 -1.2999999999999999E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -1.15E-2 6.2799999999999995E-2 5.09999999999999E-2 1.2E-2 4.5499999999999902E-2 2E-3 4.4000000000000003E-3 0 -1E-3 6.3333333333333002E-3 -0.2155 -6.2799999999999995E-2 -0.14299999999999999 -8.4000000000000005E-2 -0.115 -7.4000000000000093E-2 -3.0599999999999999E-2 0 -3.1E-2 -6.3333333333333002E-3
15
11-20 NB
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -1.6500000000000001E-2 -1.4999999999999999E-2 0 1.5E-3 -7.0000000000000097E-3 5.0000000000000001E-3 -5.0000000000000001E-3 0 -3.0000000000000001E-3 -1.0666666666666699E-2 2.1000000000000001E-2 1.9400000000000001E-2 -7.7499999999999999E-2 2.1999999999999999E-2 7.0000000000000097E-3 -1.4E-2 0.01 0 -8.0000000000000106E-3 1.0666666666666699E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein L ocalization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -5.5E-2 -1.1999999999999999E-3 2.5999999999999999E-2 5.0000000000000001E-4 -1.4999999999999999E-2 2E-3 1E-3 1.6500000000000001E-2 5.0000000000000001E-3 -1.56666666666667E-2 1E-3 -8.6800000000000002E-2 -0.11799999999999999 1.2999999999999999E-2 -8.5500000000000007E-2 -1.9E-2 -3.6200000000000003E-2 -1.6500000000000001E-2 -3.5000000000000003E-2 1.56666666666667E-2
16
11-20 OneR
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 0 2.3599999999999999E-2 3.6499999999999901E-2 1.95E-2 2.4E-2 -7.5000000000000101E-3 0.14499999999999999 4.8750000000000002E-2 6.9999999999999498E-3 -0.11 0 -8.8000000000000005E-3 -2.4500000000000001E-2 -5.3499999999999999E-2 8.2500000000000004E-2 -3.4500000000000003E-2 -0.06 6.9000000000000006E-2 -5.9999999999999498E-3 -2.1000000000000001E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -2.75E-2 1.2E-2 0.151 8.8999999999999996E-2 0.111 2.1000000000000001E-2 7.7399999999999997E-2 8.4750000000000006E-2 0.107 0.125 2.75E-2 -3.2000000000000002E-3 -3.0000000000000001E-3 -4.5999999999999999E-2 3.1E-2 2.7E-2 -5.3400000000000003E-2 1E-3 -2.6999999999999899E-2 -9.73333333333333E-2
17
11-20 RF
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -4.1500000000000002E-2 -1.8E-3 -0.125 -4.4999999999999997E-3 2.4E-2 9.0000000000000097E-3 -0.02 0 6.0000000000000097E-3 -1.16666666666666E-2 3.1E-2 5.7200000000000001E-2 0.1055 -3.15E-2 -1.7500000000000002E-2 -1.2500000000000001E-2 0.03 0 1.0500000000000001E-2 -1.0999999999999999E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(la rge) soybean(small) Statlog (Heart) wine BAG BOOST -1.4E-2 -1.0200000000000001E-2 -6.7999999999999894E-2 -3.9999999999998899E-3 -1E-3 9.0000000000000097E-3 -2.0000000000000001E-4 0 4.0000000000000001E-3 1.96666666666667E-2 1.0999999999999999E-2 9.6000000000000096E-3 5.5999999999999897E-2 -4.3499999999999997E-2 2E-3 -9.0000000000000097E-3 0 0 -1E-3 -0.04
Step 6:
Discussion and conclusion
18
Naïve Bayes
There’s not much improvement that made by Bagging, But Boost does great job on not only one dataset.
19
11-20 NB
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -1.6500000000000001E-2 -1.4999999999999999E-2 0 1.5E-3 -7.0000000000000097E-3 5.0000000000000001E-3 -5.0000000000000001E-3 0 -3.0000000000000001E-3 -1.0666666666666699E-2 2.1000000000000001E-2 1.9400000000000001E-2 -7.7499999999999999E-2 2.1999999999999999E-2 7.0000000000000097E-3 -1.4E-2 0.01 0 -8.0000000000000106E-3 1.0666666666666699E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -5.5E-2 -1.1999999999999999E-3 2.5999999999999999E-2 5.0000000000000001E-4 -1.4999999999999999E-2 2E-3 1E-3 1.6500000000000001E-2 5.0000000000000001E-3 -1.56666666666667E-2 1E-3 -8.6800000000000002E-2 -0.11799999999999999 1.2999999999999999E-2 -8.5500000000000007E-2 -1.9E-2 -3.6200000000000003E-2 -1.6500000000000001E-2 -3.5000000000000003E-2 1.56666666666667E-2
Logistic
There’s also not much strong improvement that made by Bagging and Boost, sometimes boost even makes it perform worse them classification itself.
20
11-20 Logistic
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 2.1000000000000001E-2 2.4400000000000002E-2 4.0000000000000001E-3 0 3.0000000000000001E-3 0 -0.04 0 -6.0000000000000097E-3 1.2999999999999999E-2 -0.17349999999999999 -2.4400000000000002E-2 -4.0000000000000001E-3 0 7.5000000000000101E-3 0 0.06 0 -5.0000000000000001E-3 -1.2999999999999999E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -1.15E-2 6.2799999999999995E-2 5.09999999999999E-2 1.2E-2 4.5499999999999902E-2 2E-3 4.4000000000000003E-3 0 -1E-3 6.3333333333333002E-3 -0.2155 -6.2799999999999995E-2 -0.14299999999999999 -8.4000000000000005E-2 -0.115 -7.4000000000000093E-2 -3.0599999999999999E-2 0 -3.1E-2 -6.3333333333333002E-3
OneR
Overall Bagging does better job than Boost. In some case Boost makes classification worse than before.
21
11-20 OneR
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 0 2.3599999999999999E-2 3.6499999999999901E-2 1.95E-2 2.4E-2 -7.5000000000000101E-3 0.14499999999999999 4.8750000000000002E-2 6.9999999999999498E-3 -0.11 0 -8.8000000000000005E-3 -2.4500000000000001E-2 -5.3499999999999999E-2 8.2500000000000004E-2 -3.4500000000000003E-2 -0.06 6.9000000000000006E-2 -5.9999999999999498E-3 -2.1000000000000001E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -2.75E-2 1.2E-2 0.151 8.8999999999999996E-2 0.111 2.1000000000000001E-2 7.7399999999999997E-2 8.4750000000000006E-2 0.107 0.125 2.75E-2 -3.2000000000000002E-3 -3.0000000000000001E-3 -4.5999999999999999E-2 3.1E-2 2.7E-2 -5.3400000000000003E-2 1E-3 -2.6999999999999899E-2 -9.73333333333333E-2
J48
Bagging is better than Boost. And on wine case Boost even perform worse than classification itself. And Bagging is improve from classification itself.
22
11-20 J48
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST 0 5.28E-2 4.0000000000000001E-3 1.2E-2 3.15E-2 2.2499999999999999E-2 2.5000000000000001E-2 -2.5000000000000001E-2 1.8499999999999999E-2 0.24133333333333301 0 -4.0399999999999998E-2 3.45000000000001E-2 -2.9000000000000001E-2 -3.9999999999999498E-3 -2.5499999999999998E-2 9.5000000000000001E-2 2.5000000000000001E-2 2.1000000000000001E-2 -0.24133333333333301 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -3.0000000000000001E-3 5.16E-2 0.161 6.5999999999999906E-2 3.3999999999999898E-2 7.8000000000000097E-2 5.9799999999999999E-2 -2E-3 0.128 0.30866666666666698 3.0000000000000001E-3 -2.7199999999999998E-2 -3.1E-2 -3.1999999999999897E-2 1.4999999999999999E-2 -0.02 5.0000000000000001E-3 2E-3 -3.0000000000000001E-3 -0.30866666666666698
Random Forest
On Bagging it’s not only doesn’t improve much, but also make datasets perform worse over all. Boost has some great improvement.
23
11-20 RF
TPR CHANGE switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -4.1500000000000002E-2 -1.8E-3 -0.125 -4.4999999999999997E-3 2.4E-2 9.0000000000000097E-3 -0.02 0 6.0000000000000097E-3 -1.16666666666666E-2 3.1E-2 5.7200000000000001E-2 0.1055 -3.15E-2 -1.7500000000000002E-2 -1.2500000000000001E-2 0.03 0 1.0500000000000001E-2 -1.0999999999999999E-2 ROCAreaChange switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine switzerland va Hepatitis Domain Horse Colic database Parkinsons Disease Protein Localization soybean(large) soybean(small) Statlog (Heart) wine BAG BOOST -1.4E-2 -1.0200000000000001E-2 -6.7999999999999894E-2 -3.9999999999998899E-3 -1E-3 9.0000000000000097E-3 -2.0000000000000001E-4 0 4.0000000000000001E-3 1.96666666666667E-2 1.0999999999999999E-2 9.6000000000000096E-3 5.5999999999999897E-2 -4.3499999999999997E-2 2E-3 -9.0000000000000097E-3 0 0 -1E-3 -0.04
Keep Digging the difference:
sampling method
Bagging uses uniform sampling.
Boosting samples according to the error rate.
So the classification accuracy of boosting is better than bagging.
Keep Digging the difference:
selection of the training set
The selection of the training set of bagging is random, and the training sets of each round are independent of each other .
The selection of the training set of each round on boosting is related to the learning results of the previous round.
Keep Digging the difference:
prediction function
In each prediction function of bagging has no weight.
Boost has weight for each
The bagging functions can be generated in parallel, while the boosting prediction functions can only be generated sequentially.
That’s all.
Thank you for listening!
.MsftOfcThm_Accent6_Fill {
fill:#87175F;
}
/docProps/thumbnail.jpeg