School of Computing and Information Systems The University of Melbourne
COMP90049 Introduction to Machine Learning (Semester 1, 2022) Workshop: Week 11
1. For the following dataset:
ID Outl Temp Humi Wind PLAY
Copyright By PowCoder代写 加微信 powcoder
TRAINING INSTANCES AshhFN BshhTN CohhFY DrmhFY ErcnFY FrcnTN
TEST INSTANCES GocnT?
HsmhF? Classify the test instances using the ID3 Decision Tree method:
a) Using the Information Gain as a splitting criterion
For Information Gain, at each level of the decision tree, we’re going to choose the attribute that has the largest difference between the entropy of the class distribution at the parent node, and the average entropy across its daughter nodes (weighted by the fraction of instances at each node);
𝐼𝐺(𝐴|𝑅) = 𝐻(𝑅) − ,𝑃(𝐴 = 𝑖)𝐻(𝐴 = 𝑖) !∈#
In this dataset, we have 6 instances total — 3 Y and 3 N. The entropy at the top level of our treeis𝐻(𝑅)= −/$log&$+$log&$4
This is a very even distribution. We’re going to hope that by branching the tree according to an attribute, that will cause the daughters to have an uneven distribution – which means that we will be able to select a class with more confidence – which means that the entropy will go down.
For example, for the attribute Outl, we have three attribute values: s, o, r.
– When Outl=s, there are 2 instances, which are both N. The entropy of this
distribution is H(O = s) = − /0 log& 0 + & log& &4= 0. Obviously, at this branch, we &&
will choose N with a high degree of confidence.
– When Outl=o, there is a single instance, of class Y. The entropy here is going to be 0 as well.
– WhenOutl=r,thereare2Yinstancesand1Ninstance.Theentropyhereis𝐻(𝑜=𝑟)= − /’ log& ‘ + & log& &4 ≈ 0.9183
To find the average entropy (the “mean information”), we sum the calculated entropy at each daughter multiplied by the fraction of instances at that daughter: 𝑀𝐼(𝑂) = & (0) + ‘ (0) +
$%% % (0.9183) ≈ 0.4592
The overall Information Gain here is IG(O) = H(R)−MI(O) = 1−0.4592 = 0.5408.
Outl Temp H Wind ID sor hmchnTFABCDEF
30121112103001110 N32012012121110001
Total 6 2 1 3
P(Y) 1/2 0 1 2/3 P(N) 1/2 1 0 1/3
3 1 2 4 2 2 1/3 1 1/2 1/2 1/2 0
4 1 1 1 1 1 1 3/4 0 0 1 1 1 0
1/4 1 1 0 0 0 1 0 0.8112 0 0 0 0 0 0 0.5408 0
H M I IG S I GR
2/3 0 1/2 1/2 1/2 1 1 0 0 0.9183 0.9183 0 1 1 1
0.4592 0.5408 1.459 0.3707
0.7924 0.2076 1.459 0.1423
0 0.9183 0
0.9183 0.5001
2.585 0.3868
The table above lists the Mean Information and Information Gain, for each of the 5 attributes.
At this point, ID has the best information gain, so hypothetically we would use that to split the root node. At that point, we would be done, because each daughter is purely of a single class — however, we would be left with a completely useless classifier! (Because the IDs of the test instances won’t have been observed in the training data.)
Instead, let’s take the second-best attribute:Outl.
There are now three branches from our root node: for s, for o, and for r. The first two are
pure, so we can’t improve them anymore. Let’s examine the third branch (Outl=r):
– Three instances (D, E, and F) have the attribute value r; we’ve already calculated
the entropy here to be 0.9183.
– If we split now according to Temp, we observe that there is a single instance for the value m (of class N, the entropy is clearly 0); there are two instances for the value c, one of class Y and one of class N (so the entropy here is 1). The mean
information is ‘ (0) + & (1) ≈ 0.6667, and the information gain at this point is 0.9183 $$
– For Humi, we again have a single instance (with value h, class Y, H = 0), and two instances (of n) split between the two classes (H = 1). The mean information here will also be 0.6667, and the information gain 0.2516.
– For Wind, there are two F instances, both of class Y (H = 0), and one T instance of class N (H = 0). Here, the mean information is 0 and the information gain is 0.9183.
– ID would still look like a good attribute to choose, but we’ll continue to ignore it.
– All in all, we will choose to branch based on Wind for this daughter. All of the daughters of r are pure now, so our decision tree is complete:
– Outl=o ∪(Outl=r ∩ Wind=F) → Y (so we classify G as Y) – Outl=s ∪(Outl=r ∩ Wind=T) → N (so we classify H as N)
b) Using the Gain Ratio as a splitting criterion
Gain ratio is similar, except that we’re going to weight down (or up!) by the “split information” — the
entropy of the distribution of instances across the daughters of a given attribute.
For example, we found that, for the root node, Outl has an information gain of 0.5408. There are 2 (out of 6) instances at the s daughter, 1 at the o daughter, and 3 at the r daughter.
− 0.6667 ≈0.2516.
ThesplitinformationforOutlis𝑆𝐼(𝑜)= −/&log&&+’log&’+$log&$4≈1.459. %%%%%%
The Gain ratio is consequently 𝐺𝑅(𝑜) = ()(+) ≈ ..01.2 ≈ 0.3707 -((+) ‘.103
The values for split information and gain ratio for each attribute at the root node are shown in the table above. The best attribute (with the greatest gain ratio) at the top level this time is Wind.
Wind has two branches: T is pure, so we focus on improving F (which has 3 Y in- stances (C, D, E), and 1 N instance (A)). The entropy of this daughter is 0.8112.
– For Outl, we have a single instance at s (class N, H = 0), a single instance at o (class Y, H = 0), and 2 Y instances at r (H = 0). The mean information here is clearly 0; the information gain is 0.8112. The split information is 𝑆𝐼(𝑜|(𝑊 = 𝐹)) =
−/’log&’+’log&’+’log&’4=1.5,sothegainratiois𝐺𝑅H𝑜I(𝑊=𝐹)J= ..2”&≈ 1111&& ‘.0
– ForTemp,wehavetwohinstances(oneYandoneN,soH=1),asingleminstance (Y, H = 0), and a single c instance (Y, H = 0). The mean information is & (1) +
1 (0) + 1 (0) = 0.5, so the information gain is 0.8112-0.5 = 0.3112. The
distribution of instances here is the same as Outl, so the split information is also 1.5, and the gain ratio is 𝐺𝑅H𝑇I(𝑊 = 𝐹)J = ..$”& ≈ 0.2075
– ForHumi, we have 3hinstances (2 Y and 1 N, H = 0.9183), and 1ninstance (Y, H = 0): the mean information is $ (0.9183) + ‘(0) = 0.6887 and the information gain
is 0.8112 − 0.6887 = 0.1225. The split information is 𝑆𝐼(𝐻|(𝑊 = 𝐹)) =
−/$log&$+’log&’4≈0.8112, so the gain ratio is 𝐺𝑅H𝐻I(𝑊=𝐹)J= ..’&&0 ≈ 1111 ..2”&
– For ID, the mean information is obviously still 0, so the information gain is
0.8112. The split information at this point is − /’ log& ‘ + ‘ log& ‘ + ‘ log& ‘ +
‘ log& ‘4 = 2, so the gain ratio is approximately 0.4056. 11
Of our four choices at this point, Outl has the best gain ratio. The resulting daughters are all pure, so the decision tree is finished:
– Wind=F ∩(Outl=o ∪ Outl=r) → Y
– Wind=T ∪(Wind=F ∩ Outl=s) → N (so we classify G and H as N)
Note that this decision tree is superficially similar to the tree above but gives different classifications
because of the order in which the attributes are considered.
Note also that we didn’t need to explicitly ignore the ID attribute for Gain Ratio (as we needed to do for Information Gain) — the split information pushed down its “goodness” to the point where we didn’t want to use it anyway!
2. Let’srevisitthelogicbehindthevotingmethodofclassifiercombination(usedinBagging, Random Forests, and Boosting to some extent). We are assuming that the errors between the all classifiers are uncorrelated
(a) First, let’s assume our three independent classifiers all have the error rate of e = 0.4, calculated over 1000 instances with binary labels (500 A and 500 B).
(i) Build the confusion matrices for these classifiers, based on the assumptions above.
All our systems have the error rate of 0.4. It means that in 40% of the times our system makes a mistake and in 60% of the time it makes a correct prediction. So, from 500 instances with label A, 300 instances will be (correctly) predicted with the label A, and 200 instances will be labelled (incorrectly) as B.
Now since we are doing an ensembled method, we can assume that from the 300 instances that system 1 labelled as A (and 200 instances labelled as B), again 60% will be labelled (correctly) as A and 40% (incorrectly) as B, and so on.
Table below contains all the counts for our three-classifier ensemble.
Actual class
# for Sys1 P2
A (500 x 0.6=)
A (500 x 0.4=)
A (500 x 0.4=)
A (500 x 0.6=)
# for Sys 2
(300 x 0.6=) 180
(300 x 0.4=) 120
(200 x 0.6=) 120
(200 x 0.4=) 80
(200 x 0.4=) 80
(200 x 0.6=) 120
(300 x 0.4=) 120
(300 x 0.6=) 180
P3 # for Sys 3
A (180*0.6=) 108 B (180*0.4=) 72 A (120*0.6=) 72 B (120*0.4=) 48 A (120*0.6=) 72 B (120*0.4=) 48 A (80*0.6=) 48 B (80*0.4=) 32 A (80*0.4=) 32 B (80*0.6=) 48 A (120*0.4=) 48 B (120*0.6=) 72 A (120*0.4=) 48 B (120*0.6=) 72 A (180*0.4=) 72 B (180*0.6=) 108
(ii) Using that the majority voting, what the expected error rate of the voting ensemble?
Since we have 3 systems, using a majority voting is very easy. For all the predicted labels we check the results of the 3 system and if 2 out of 3 votes for one class we chose that class as the label for the whole system. The results for the voting system are demonstrated in the following table, where the incorrect predictions are highlighted.
Actual class
A 108 B 72 A 72 B
Majority Vote
Maj(A,A,A)= A Maj(A,A,B)= A Maj(A,B,A)= A
Maj(B,A,A)= A
Maj(A,B,B)= B
A 180 A 300
A 120 B 200
Maj(B,A,B)= B
Maj(B,B,A)= B
Maj(B,B,B)= B
Maj(A,A,A)= A
Maj(A,A,B)= A
Maj(A,B,A)= A
A 80 A 200
A 120 B 300
B 72 A 72 B 108
Maj(A,B,B)= B
Maj(B,A,B)= B Maj(B,B,A)= B Maj(B,B,B)= B
Maj(B,A,A)= A
From this table we can identify that the total count of incorrectly classified instances is:
error = 48 + 48 + 48 + 32 + 32 + 48 + 48 + 48 = 352
Therefore, the error rate of the final system (the ensemble of three learners with error rate of 0.4) is $0& = 35.2% = 0.352. It is better that the error rate of each system individually. It is
mostly because using three system has allowed us to disambiguate the instances where each system cannot classify correctly. In other words, the voting system helps the learners to correct each other’s mistake.
This relies on the assumption of errors being uncorrelated: if the errors were perfectly correlated, we would see no improvement; if the errors were mostly correlated, we would see only a little improvement.
(b) Now consider three classifiers, first with e1 = 0.1, the second and third with e2= e3= 0.2.
(i) Build the confusion matrices.
Similar to part A we can build a combined confusion matrix for all three systems, where the first system is 90% makes correct prediction (and therefore makes incorrect prediction for 10% of the instances). And the two next systems make correct predictions only for 80% of the instances (and so each system makes 20% incorrect predictions).The following table contains all the counts for the all different combination of the systems in this ensemble.
A (500 x 0.9=)
A (500 x 0.1=)
A (500 x 0.1=)
A (500 x 0.9=)
(450 x 0.8=) 360
(450 x 0.2=) 90
(50x 0.8=) 40
(50 x 0.2=) 10
(50 x 0.2=) 10
(50 x 0.8=) 40
(450 x 0.2=) 90
(450 x 0.8=) 360
A 288 B 72 A 81 B 18 A 36 B8 A8 B2 A2 B8 A8 B 32 A 18 B 72 A 72 B 288
(ii) Using the majority voting, what the expected error rate of the voting ensemble?
The results for the voting system are demonstrated in the following table, where the incorrect predictions are highlighted.
Actual # class
A 288 B 72 A 81 B
B 72 A 72 B 288
Majority Vote
Maj(A,A,A)= A Maj(A,A,B)= A Maj(A,B,A)= A
Maj(B,A,A)= A
Maj(A,B,B)= B
Maj(B,A,B)= B Maj(B,B,A)= B Maj(B,B,B)= B
A B A B A B A B
From this table we can identify that the
A 450 A 500
A 50 B 500
Maj(A,B,B)= B
Maj(B,A,B)= B
Maj(B,B,A)= B
Maj(B,B,B)= B
Maj(A,A,A)= A
Maj(A,A,B)= A
Maj(A,B,A)= A
Maj(B,A,A)= A
total count of incorrectly classified instances is:
Error = 18 + 8 + 8 + 2 + 2 + 8 + 8 + 18 = 72
Therefore, the error rate of the final system (the ensemble of three learners e1 = 0.1 and e2= e3= 0.2) is now 4& = 7.2% = 0.072. It is better that the error rate of the best system
(iii) What if we relax our assumption of independent errors? In other words, what will happen if the errors between the systems were very highly correlated instead? (Systems make similar mistakes.)
Basically, if all the systems make the same predictions; the error rate will be roughly the same as the correlated classifiers, and voting is unlikely to improve the ensemble. Even if two of the classifiers are correlated, and the third is uncorrelated, the two correlated systems will tend to “out-vote” the third system on erroneous instances.
Therefore, if we want to use ensemble method, it’s best to use uncorrelated learners (classifiers).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com