CS计算机代考程序代写 decision tree THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 1

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 1
Tutorial 1: Solutions
ELEC5518: IoT for Critical Infrastructure Data Analytics
I. OBJECTIVES
In this tutorial, we see some examples on different data analytic techniques commonly used for big data.
EXAMPLE 1: K-MEANS CLUSTERING
Consider the following data set consisting of the scores of two variables on each of seven individuals. This data set is to be grouped into two clusters. Use K-Mean Clustering with centroids (2,3,2) (4,6,4) to cluster the data.
A B C 1 1.0 1.0 3.0 2 1.5 2.0 2.5 3 3.0 4.0 1.0 4 5.0 7.0 4.5 5 3.5 5.0 4.0 6 4.5 5.0 2.5 7 3.5 4.5 4.5
Fig. 1. Example Data Set 1
Solution: To form each cluster, we need to calculate the distance from each data point to all centroids. The data point belong to the cluster which has the lowest distance to. The clusters can be found as follows:
Subject
6 5 4 3 2 1
7
6
5.5
55 4 4 4.5
3
2 22.5 1 1 1.5
3 3.5
Step 1
center point
(2,3,2)
Cluster members
(5,7,4.5) (3.5,5,4) (4.5,5,2.5) (3.5,4.5,4.5)
mean point (new center point) (1.83,2.34,2.17) (4.125,5.375,5.17)
(1,1,3) (1.5,2,2.5) (3,4,1)
(4,6,4)
Step 2 We repeat step 1 with new centroids to update the clusters.
center point (1.83,2.34,2.17)
Cluster members
mean point (new center point) (1.83,2.34,2.17) (4.125,5.375,5.17)
(1,1,3) (1.5,2,2.5)(3,4,1)
(4.125,5.375,5.17) (5,7,4.5) (3.5,5,4) (4.5,5,2.5) (3.5,4.5,4.5) After this step, we have no further change in the clusters.

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 2
EXAMPLE 2: REGRESSION ANALYSIS
A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see the table below).
Observation Number 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Fig. 2. Example Data Set 2
Solution:
Temperature (xi) 50
53
54
55
56
59
62
65
67
71
72
74
75
76
79
80
82
85
87
90
93
94
95
97
100
Yield (yi) 122 118 128 121 125 136 144 142 149 161 167 168 162 171 175 182 180 183 188 200 194 206 207 210 219
220
200
180
160
140
120
100
50 55 60 65 70 75 80 85 90 95 100
Temperature
In the regression analysis, we find a linear relationship between the inputs and outputs. In this example, we have only one input (xi) and one output (yi), which have been measured 25 times. More specifically, we find α and β such that the line y = αx + β has a good fit to the data. In order to find α and β, we need to minimize the Euclidian distance between the actual data and the estimated regression line:
N
min 􏰃(yi − (αxi + β))2. (1)
α,β
i=1
We find the derivative of function ∆(α, β) = 􏰂Ni=1(yi − (αxi + β))2, as follows:
∂∆ 􏰃N
∂α = −2 ∂∆ 􏰃N
xi(yi − (αxi − β)), (2) (yi −(αxi −β)). (3)
i=1
∂β =2
The maximum value of ∆(α, β) can be found by setting ∂∆ = ∂∆ = 0. Therefore, we have:
i=1
∂∆ 1􏰃N
∂α ∂β
􏰂N 1􏰂N􏰂N
i=1 xiyi − N i=1 xi i=1 yi
∂β = 0 ⇒ β = N
∂∆ 􏰃
(4)
∂α = 0 ⇒
i=1
􏰂N x2 − 1 (􏰂N )2 (5) i=1 i N i=1
N
(yi − αxi),
xi(yi − αxi) − Nβ = 0 ⇒ α =
(4)
i=1
Yield

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 3
For the above example, by replacing the data in (4) and (5), we have α = 1.9952 and β = 17.0016. The resulting line and error can be shown below.
250
200
150
100
50
40 50 60 70 80 90 100 110
y = 1.9952 x + 17.0016
10 5 0 -5
Temperature
residuals
-10
40 50 60 70 80 90 100 110
Fig. 3. Regression line for Example 2.
The root mean square error can be found as follows:
􏰆􏰅􏰅 1 􏰃N
e=􏰄N
(yi −(αxi +β))2 =3.8555. (6)
i=1
Yield

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 4
EXAMPLE 3: DECISION TREE CLASSIFIERS
Suppose the following data set. The task is to predict that a certain player is going to play tennis on a given day. This table shows how John has played over a number of days and show things that might influence John’s decision to play tennis. On the day 15, it is raining, the humidity is high, and the wind is weak. Determine weather John is playing tennis or not? Find the optimal decision tree for this data set.
TABLE I EXAMPLE 3 DATA SET
Day
Outlook
Humidity
Wind
Play
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Sunny Sunny Rain Sunny Rain Overcast Rain Rain Overcast Sunny Rain Sunny Overcast Overcast
High
High Normal High High High High Normal Normal Normal Normal Normal High Normal
Weak Strong Strong Weak Strong Weak Weak Weak Strong Weak Weak Strong Strong Weak
No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes
Solution: A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a node in the. Splitting stops when every subset is pure (all elements belong to a single class) – this can always be achieved, unless there are duplicate training examples with different classes. Decision tree tries to understand when John plays. It split the data into subsets until it reaches to pure states.
To find the optimal decision tree, we need to find the attribute which has the highest information gain.
The entropy for each attribute can be calculated as follows:
HOutlook =HHumidity =HWind =− 9 log2( 9 )− 5 log2( 5 )=0.94. (7)
The conditional entropies can be found as follows:
HSunny|Outlook = −2 log2(2) − 3 log2(3) = 0.9710, 5555
HOvercast|Outlook = −0 log2(0) − 4 log2(4) = 0, 4444
HRain|Outlook = −2 log2(2) − 3 log2(3) = 0.9710 5555
HHigh|Humidity = −3 log2(3) − 4 log2(4) = 0.9852, 7777
HNormal|Humidity = −6 log2(6) − 1 log2(1) = 0.5917, 7777
HWeak|Wind = −6 log2(6) − 2 log2(2) = 0.8113, 8888
Outlook 9 Yes 5 No
Humidity 9 Yes 5 No
Wind
9 Yes 5 No
Sunny 2Yes 3No
Overcast 4Yes 0No
Rain 3Yes 2No
High 3Yes 4No
Normal 6Yes 1No
Weak 6Yes 2No
Strong 3Yes 2No
14 14 14 14

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 5
HStrong|Wind = −3 log2(3) − 3 log2(3) = 1, 6666
The information gains can then be found as follows:
IOutlook = HOutlook − 5 HSunny|Outlook − 4 HOvercast|Outlook + 5 HRain|Outlook = 0.2464, 14 14 14
IHumidity = HHumidity − 7 HHigh|Humidity − 7 HNormal|Humidity = 0.1515, 14 14
IWind = HWind − 8 HWeak|Wind − 6 HStrong|Wind = 0.0478, 14 14
which shows that the attribute “Outlook” has more information, so we start with it for building the decision tree. The overcast state is a pure state, so we do not need to split it further. For other states we need to
2 Yes / 3 No Sunny
D1 Sunny High Weak No
D2 Sunny High Strong No
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes D11 Sunny Normal Strong Yes
9 Yes / 5 No Outlook
4 Yes / 0 No Overcast
D3 Overcats High Weak Yes
D7 Overcast Normal Strong Yes D12 Overcast Normal Strong Yes D13 Overcast Normal Weak Yes
3 Yes / 2 No Rain
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes D6 Rain Normal Strong No D10 Rain Normal Weak Yes D14 Rain High Strong No
Fig. 4. First Step of Decision Tree
find the best attribute. In the “Sunny” state we have:
HHigh|Sunny|Outlook = −0 log2(0) − 3 log2(3) = 0, 3333
HNormal|Sunny|Outlook = −2 log2(2) − 0 log2(0) = 0, 2222
HWeak|Sunny|Outlook = −1 log2(1) − 2 log2(2) = 0.9183, 3333
HStrong|Sunny|Outlook = −1 log2(1) − 1 log2(1) = 1, 2222
the information gains can be found as follows:
IHumidity|Sunny = HSunny|Outlook − 3HHigh|Sunny|Outlook − 2HNormal|Sunny|Outlook = 0.9710, 55
IWind|Sunny = HSunny|Outlook − 3HWeak|Sunny|Outlook − 2HStrong|Sunny|Outlook = 0.02, 55
(8)
which means that after the state “Sunny”, we have to branch the tree using attribute “Humidity”.

THE UNIVERSITY OF SYDNEY ⃝c MAHYAR SHIRVANIMOGHADDAM 6
At state “Rain”, we have
HHigh|Rain|Outlook = −1 log2(1) − 1 log2(1) = 1, 2222
HNormal|Rain|Outlook = −1 log2(1) − 2 log2(2) = 0.9183, 3333
HWeak|Rain|Outlook = −3 log2(3) − 0 log2(0) = 0, 3333
HStrong|Rain|Outlook = −0 log2(0) − 2 log2(2) = 0, 2222
the information gains can be found as follows:
IHumidity|Rain = HRain|Outlook − 2HHigh|Rain|Outlook − 3HNormal|Rain|Outlook = 0.02, 55
IWind|Rain = HRain|Outlook − 3HWeak|Rain|Outlook − 2HStrong|Rain|Outlook = 0.9710, 55
(9)
which means that after the state “Rain” we have to branch the tree using attribute “Wind”. The complete decision tree can then be shown as follows:
2 Yes / 3 No Sunny
Humidity
9 Yes / 5 No Outlook
4 Yes / 0 No YES Overcast
D3 Overcats High Weak Yes
D7 Overcast Normal Strong Yes D12 Overcast Normal Strong Yes D13 Overcast Normal Weak Yes
3 Yes / 2 No Rain
Wind
NO High
D1 Sunny High Weak No D2 Sunny High Strong No D8 Sunny High Weak No
Normal
YES
YES
Weak
Strong NO
Fig. 5. Second Step of Decision Tree
D9 Sunny Normal Weak Yes D11 Sunny Normal Strong Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes D10 Rain Normal Weak Yes
D6 Rain Normal Strong No D14 Rain High Strong No
D15 Rain High Weak ! YES