程序代写代做代考 decision tree algorithm GIVEN TABLE :-

GIVEN TABLE :-
ASSIGNMENT-3 SOLUTIONS
Occupation
Gender
Age
Salary
Level
Service
Female
45
$48000
Level 3
Service
Male
25
$25000
Level 1
Service
Male
33
$35000
Level 2
Management
Male
25
$45000
Level 3
Management
Female
35
$65000
Level 4
Management
Male
26
$45000
Level 3
Management
Female
45
$70000
Level 4
Sales
Female
40
$50000
Level 3
Sales
Male
30
$40000
Level 2
Staff
Female
50
$40000
Level 2
Staff
Male
25
$25000
Level 1
1) Construct a classification and regression tree to classify salary based on the other variables. Do as much as you can by hand, before turning to the software.
Ans.)
Candidate Split
Left Child Node
Right Child Node
1
Occupation = Service
Occupation=Management, Sales, Staff
2
Occupation = Management
Occupation=Service, Sales, Staff
3
Occupation = Sales
Occupation = Service, Management, Staff
4
Occupation = Staff
Occupation = Service, Management, Sales
5
Gender = Male
Gender = Female
6
Age <= 30 Age>30
7
Age<=40 Age>40
PL = no. of records at left of node T/no. of records in training set

PR = no. of records at right of node T/no. of records in training set
P (J/T {Left}) = no. of class j records at left of node T/ number of records in T
P (J/T {Right}) = no. of class j records at right of node T/ number of records in T ∅(S / T) = 2PLPR ∑ |P (J/ TL) – P (J/TR) |

Split
PL
PR
P(J/TL)
P(J/TR)
2PLPR
Q(S/T)
(S / T)
1
0.2727
0.7273
I : 0.333 II : 0.333 III : 0.333 IV : 0.0
I : 0.125 II : 0.250 III : 0.375 IV : 0.250
0.3966
0.583
0.2312
2
0.3636
0.6363
I :0.0 II : 0.0 III : 0.5 IV : 0.5
I : 0.285 II : 0.428 III : 0.285 IV : 0.0
0.4626
1.428
0.660
3
0.1818
0.8181
I :0.0 II : 0.5 III : 0.5 IV : 0.0
I : 0.222 II : 0.222 III : 0.333 IV : 0.222
0.2974
0.889
0.2643
4
0.1818
0.8181
I :0.5 II : 0.5 III : 0.0 IV : 0.0
I : 0.111 II : 0.222 III : 0.444 IV : 0.222
0.2974
1.333
0.3964
5
0.5454
0.4546
I : 0.333 II : 0.333 III : 0.333 IV : 0.0
I :0.0 II : 0.2 III : 0.4 IV : 0.4
0.4958
0.933
0.4626
6
0.4546
0.5454
I :0.4 II : 0.2 III : 0.4 IV : 0.0
I :0.0
II : 0.333 III : 0.333 IV : 0.333
0.4958
0.933
0.4626
7
0.7273
0.2727
I : 0.250
I : 0.333
0.3966
0.583
0.2312

II : 0.250 III : 0.375 IV : 0.125
II : 0.0
III : 0.333 IV : 0.333
So the maximum measure for ∅(S / T) is given by the training set when we select the split node to be occupation where Management will be the left node and Service, Sales and Staff will be right node.
Occupation /\
Management
Staff, Sales, Service
/\ Age<30 Age>30
|| Male Female
||
So now we calculate split for node containing Staff, Sales or Service and ignore node 4,5,6,7 in the calculations.
Level 3 (Node 4, 6)
Level 4 (Node 5, 7)
Candidate Split
Left Child Node
Right Child Node
1
Occupation = Service
Occupation= Sales, Staff
3
Occupation = Sales
Occupation = Service, Staff
4
Occupation = Staff
Occupation = Service, Sales
5
Gender = Male
Gender = Female
6
Age <= 30 Age>30
7
Age<=40 Age>40

Split
PL
PR
P(J/TL)
P(J/TR)
2PLPR
Q(S/T)
(S / T)
1
0.4285
0.5714
I : 0.333 II : 0.333
I : 0.250 II : 0.5
0.4896
0.332
0.1625

III : 0.333 IV : 0.0
III : 0.250 IV : 0.0
3
0.2857
0.7142
I :0.0 II : 0.5 III : 0.5 IV : 0.0
I :0.4 II : 0.4 III : 0.2 IV : 0.0
0.4080
0.8
0.3264
4
0.2857
0.7142
I :0.5 II : 0.5 III : 0.0 IV : 0.0
I :0.2 II : 0.4 III : 0.4 IV : 0.0
0.4080
0.8
0.3264
5
0.5714
0.4285
I :0.5 II : 0.5 III : 0.0 IV : 0.0
I :0.0
II : 0.333 III : 0.666 IV : 0.0
0.4896
1.333
0.6526
6
0.4285
0.5714
I : 0.666 II : 0.333 III : 0.0 IV : 0.0
I :0.0 II : 0.5 III : 0.5 IV : 0.0
0.4896
1.333
0.6526
7
0.7142
0.2857
I :0.4 II : 0.4 III : 0.2 IV : 0.0
I :0.0 II : 0.5 III : 0.5 IV : 0.0
0.4080
0.8
0.3264
So here Split 5 and 6 are giving us the maximum entropy. We will consider 5 to be the random split.
Occupation /\
Management
/\ /\
Age<30 Age>= Male Female /| |`/\
Male Female / \ Staff Sales or Service / | Age<30 Age>=30 | \
||
Staff, Sales, Service
Level 3 (Node 4, 6)
Level 4 (Node 5, 7)
Level 2 (Node 10)
Level 3 (Node 1, 8)
Level 1 (Node 2, 11)
Level 2 (Node 3, 9)

Decision Rules:-
Antecedent
Consequent
If Occ=Management & Age<30 & Male then Level 3 (Node 4,6) If Occ= Mgmt &Age >=30 & Female
then Level 4 (Node 5,7)
If Occ=Staff,Sales,Service & Male & Age<30 then Level 1 (Node 2,11) If Occ=Staff,Sales,Service & Male & Age>=30
then Level 2 (Node 3,9)
If Occ=Staff,Sales,Service & Female & Staff
then Level 2 (Node 10)
If Occ=Staff,Sales,Service & Male & Sales or Service
then Level 3( Node 1,8)
NOTE: Software implementation of CART from Rattle is shown at the end.
2.) Construct a C4.5 decision tree to classify salary based on the other variables. Do as much
as you can by hand, before turning to the software.
Ans.)
Candidate Split
Child Nodes
1 2 3 4 5
Occupation=Service
Occupation=Management
Occupation=Sales
Occupation=Staff
Gender = Male
Gender=Female
Age≤25
Age>25
Age≤35
Age>35
Age≤45
Age>45
Entropy before splitting , where
Less than $35,000
Between ≥$35,000 and <$45,000 Between ≥$45,000 and <$55,000 Greater than $55,000 Level 1 Level 2 Level 3 Level 4 P = 2/11 P = 3/11 P = 4/11 P = 2/11 H(T)=−pj log2(pj)=-2/11log(2/11)–3/11log(3/11)–4/11log(4/11)–2/11log(2/11)=1.927bits j Split on Occupation: P(Service) = 3/11; P(Management) = 4/11 ; P(Sales) = 2/11 ; P(Staff) = 2/11 Occupation -> Service:-
P(<35,000) = 1/3; P(≥35,000 & <45,000) = 1/3 ; P(≥45,000 & < 55,000) = 1/3 ; P(>55,000) = 0 Occupation -> Management:-
P(<35,000) = 0; P(≥35,000 & <45,000) = 0 ; P(≥45,000 & < 55,000) = 1/2 ; P(>55,000) = 1/2 Occupation -> Sales:-

P(<35,000) = 0; P(≥35,000 & <45,000) = 1/2 ; P(≥45,000 & < 55,000) = 1/2 ; P(>55,000) = 0 Occupation -> Staff:-
P(<35,000) = 1/2; P(≥35,000 & <45,000) = 1/2 ; P(≥45,000 & < 55,000) = 0 ; P(>55,000) = 0
Entropy of above 4 Branches:- Service, Management, Sales, Staff H(Service) = – 1/3log(1/3) – 1/3log(1/3) – 1/3log(1/3) – 0log(0) = 1.58
H(Management) = – 0log(0) – 0log(0) – 1/2log(1/2) – 1/2log(1/2) H(Sales) = – 0log(0) – 1/2log(1/2) – 1/2log(1/2) – 0log(0) H(Staff) = – 1/2log(1/2) – 1/2log(1/2) – 0log(0) – 0log(0)
= 1 = 1 = 1
Combining Entropy of four branches, along with corresponding proportion Pi,
H o (T ) =
Therefore, Information Gain by split on Occupation attribute is H(T) – Ho (T) = 1.927 – 1.1572 = 0.7698 bits.
Split on Gender:
P(Male) = 6/11; P(Female) = 5/11
Gender-> Male:-
P(<35,000) = 1/3; P(≥35,000 & <45,000) = 1/3 ; P(≥45,000 & < 55,000) = 1/3 ; P(>55,000) = 0 Gender -> Female:-
P(<35,000) = 0; P(≥35,000 & <45,000) = 1/5 ; P(≥45,000 & < 55,000) = 2/5 ; P(>55,000) = 2/5
Entropy of above 2 Branches:- Male, Female
H(Male) = – 1/3log(1/3) – 1/3log(1/3) – 1/3log(1/3) – 0log(0) = 1.58 H(Female) = – 0log(0) – 1/5log(1/5) – 2/5log(2/5) – 2/5log(2/5 = 1.52
Combining Entropy of two branches, along with corresponding proportion Pi, HG (T) = k PH (T ) = 6/11*(1.58) + 5/11(1.52) = 1.55 bits
k
 i =1
P H (T ) = 3/11*(1.58) + 4/11(1) + 2/11*(1) + 2/11*(1) = 1.1572 bits ioi
iGi i=1
Therefore, Information Gain by split on Gender attribute is H(T) – HG (T) = 1.927 – 1.55 = 0.377 bits Split on Age:
P(≤25) = 3/11; P(>25) = 8/11
Age -> ≤ 25:-
P(<35,000) = 2/3; P(≥35,000 & <45,000) = 0 ; P(≥45,000 & < 55,000) = 1/3 ; P(>55,000) = 0 Age -> >25:-
P(<35,000) = 0; P(≥35,000 & <45,000) = 3/8 ; P(≥45,000 & < 55,000) = 3/8 ; P(>55,000) = 2/8
Entropy of above 2 Branches:- Age≤25 , Age >25
H(Age≤25) = -2/3log(2/3) – 1/3log(1/3) = 0.908 H(Age>25) = – 3/8log(3/8) – 3/8log(3/8) – 2/8log(2/8) = 1.5575
Combining Entropy of two branches, along with corresponding proportion Pi, H Age (T ) = k P H (T ) = 3/11*(0.908) + 8/11(1.5575) = 1.38 bits
i Age i i =1
Therefore, Information Gain by split on Gender attribute is H(T) – H Age (T) = 1.927 – 1.38 = 0.547 bits. Split on Age:
P(≤35) = 7/11; P(>35) = 4/11
Age -> ≤ 35:-
P(<35,000) = 2/7; P(≥35,000 & <45,000) = 2/7 ; P(≥45,000 & < 55,000) = 2/7 ; P(>55,000) = 1/7 Age -> >35:-
P(<35,000) = 0; P(≥35,000 & <45,000) = 1/4 ; P(≥45,000 & < 55,000) = 2/4 ; P(>55,000) = 1/4
Entropy of above 2 Branches:- Age≤35 , Age >35
H(Age≤35) = -2/7log(2/7) – 2/7log(2/7) – 2/7log(2/7) – 1/7log(1/7) = 1.93

H(Age>35) = – 1/4log(1/4) – 1/4log(1/4) – 2/4log(2/4) = 1.5 Combining Entropy of two branches, along with corresponding proportion Pi,
H Age (T ) =
Therefore, Information Gain by split on Gender attribute is H(T) – H Age (T) = 1.927 – 1.77 = 0.157 bits.
Split on Age:
P(≤45) = 10/11; P(>45) = 1/11
Age -> ≤ 45:-
P(<35,000) = 2/10; P(≥35,000 & <45,000) = 2/10 ; P(≥45,000 & < 55,000) = 4/10 ; P(>55,000) = 2/10 Age -> >45:-
P(<35,000) = 0; P(≥35,000 & <45,000) = 1/11 ; P(≥45,000 & < 55,000) = 0 ; P(>55,000) = 0
Entropy of above 2 Branches:- Age≤45 , Age >45
H(Age≤45) = -2/10log(2/10) – 2/10log(2/10) – 4/10log(4/10) – 2/10log(2/10) = 1.92 H(Age>45) = – 1/11log(1/11) = 0.3136
Combining Entropy of two branches, along with corresponding proportion Pi, H Age (T ) = k P H (T ) = 10/11*(1.92) + 1/11(0.3136) = 1.7735 bits
k
 i =1
P H (T ) = 7/11*(1.93) + 4/11(1.5) = 1.77 bits i Age i
i Age i i =1
Therefore, Information Gain by split on Gender attribute is H(T) – H Age (T) = 1.927 – 1.7735 = 0.1465 bits.
Candidate Split
1
2 3 4 5
Child Nodes
Occupation = Service =Management
= Sales = Staff
Gender = Male
= Female
Age≤25 >25
Age≤35 >35
Age≤45 >45
Information Gain
0.7689 bits
0.377 bits 0.547 bits 0.157 bits 0.1465 bits

C4.5 Tree –
Service
Level 1,2,3 Age>25
Level 2,3 Female
Mgmt
Occupation Sales
Level 3,4
Staff
Age≤25
Level 1
Male
Level 2
Decision Rules:-
Age≤25
Age>25
Level 2,3
Age>25
Level 2,3 Female
Level 1,2 Age>25
Level 2
Level 3 Level 4
Age≤25
Level 1
Level 3 Male Level 2
Level 3
Antecedent
Consequent
If Occ=Service & Age≤25
then Level 1
If Occ=Service & Age>25 & Male
then Level 2
If Occ=Service & Age>25 & Female
then Level 3
If Occ=Management & Age≤25
then Level 3
If Occ=Management & Age>25
then Level 4
If Occ=Sales & Age>25 & Male
then Level 2
If Occ=Sales & Age>25 & Female
then Level 3
If Occ=Staff & Age≤25
then Level 1
If Occ=Staff & Age>25
then Level 2
3) Compare the two decision trees and discuss the benefits and drawbacks of each.

Ans.) C4.5 algorithm uses Entropy to calculate the impurity of a cluster. The higher the entropy (disorder) of a cluster, the more information you need to describe the cluster, the more information you need to describe a cluster, the less certain you can be of what the cluster contains. As the probability of a single category in a cluster gets closer and closer to 1, the entropy of the cluster (H) gets closer and closer to 0, the disorder of the cluster gets closer and closer to 0, and the cluster becomes most valuable in the sense that it contains a single category.
If the dependent variable is continuous, CART algorithm is used to construct decision trees. For continuous variables there is no notion of ‘pure’ partition. Disorder is measured by variance. Homogeneous clusters are formed from items that their dependent variable values are close to each other. The criteria is to form clusters with minimum ‘within cluster’ variance and maximum ‘between clusters’ means – impurity and gain are measured by the pair(mean,SD).
Note that C4.5 provides a separate branch for each field value, whereas CART is restricted to binary splits. A possible drawback of C4.5’s strategy for splitting categorical variables is that it may lead to an overly bushy tree, as can be seen in the above two trees, with many leaf nodes containing few records.
3.) Compare the two sets of decision rules and discuss the benefits and drawbacks of each.
Ans.) As can be seen from the above two decision rules, there is a difference in the root nodes. In the C4.5 algorithm the root node is Occupation = Service, while in the CART algorithm the root node is Occupation = Management. Moreover the decision tree from CART algorithm has fewer splits than the C4.5 algorithm. In C4.5 algorithm the first node’s consequent is Level 1, while in CART the consequent is Level 3.
CART uses the Gini diversity index for classifying for classifying tests, while C4.5 uses criteria based on the information. CART prunes trees using a complex model whose parameters are estimated by cross-validation. C4.5 uses a single-pass algorithm derived from binomial confidence limits.
CART looks for alternative tests that approximate the results when tested attribute has an unknown value, but C4.5 distributes cases among probabilistic results.
CART Implementation
Fig. -> With 100% Training Data Set Decision Rules:-
Antecedent
Consequent
If OCCUPATION=Management, Sales GENDER=Female OCCUPATION=Management
then SALARY=Level 4
If OCCUPATION=Management, Sales
then SALARY=Level 3

GENDER=Female OCCUPATION=Sales
If OCCUPATION=Management, Sales GENDER=Male OCCUPATION=Management
then SALARY=Level 3
If OCCUPATION=Management, Sales GENDER=Male OCCUPATION=Sales
then SALARY=Level 2
If OCCUPATION=Service, Staff AGE>=29 OCCUPATION=Service GENDER=Female
then SALARY=Level 3
If OCCUPATION=Service, Staff AGE>=29 OCCUPATION=Service GENDER=Male
then SALARY=Level 2
If OCCUPATION=Service, Staff AGE>=29 OCCUPATION=Staff
then SALARY=Level 2
If OCCUPATION=Service, Staff AGE< 29 then SALARY=Level 1