Chapter 5
Data Analytic
In this chapter, we review some of the most common data analytic techniques. We particularly review, clustering, regression, and decision tree analysis and provide several examples for each technique.
53
54 CHAPTER 5. DATA ANALYTIC 5.1 Overview of data analytic techniques
Several techniques have been so far proposed for analysing the data. They are however application specific, which means that depending on the data and the information we want to get from data, the analysis might be different. Table 5.1 shows different data analytic techniques and their applications in solving different problem. In the following, we briefly explain some of these techniques
Table 5.1: Data analytic techniques.
The problem to solve
The category of techniques
Algorithms
I want to group items by simi- larity. I want to find structure (commonalities) in the data
Clustering
K-mean clustering
I want to discover relationships between actions or items
Association Rules
Apriori
I want to determine the relation- ship between the outcome and the input variables
Regression
Linear Regression, Logis- tic regression
I want to assign (know) labels to objects
Classifications
Naive Bayes, Decision Trees
I want to find the structure in a temporal process, I want to fore- cast the behaviour of a temporal process
Time series analysis
ACF, PACF, ARIMA
I want to analyse my text data
Text analysis
Regula expression, docu- ment representation (Bag of Words), TF-IDF
and provide several examples for each of them.
5.2 K-mean Clustering
Clustering techniques are usually used to categories the data according to some commonalities. For example, clustering is used to group the documents by topics, or group the customers by their purchase patterns. The simple rule for clustering is that the items in a cluster are more similar to each other than they are to items in other clusters. The similarity in this context needs to be carefully characterized which is data dependent. Alternatively, the distance is used to cluster the data which is the inverse of similarity and is much more suitable when the data is quantitative.
Clustering techniques are not predictive, which means that they cannot be used to predict future data. They are more suitable to find similarities and relationships. Here, we explain K-mean clustering, which is commonly used for clustering numerical data, usually a set of measurements.
5.2. K-MEAN CLUSTERING 55 In this method, a distance metric has to be defined over the variable space.
A distance function has the following properties:
• Distance is not negative (absolute value),
• distance from one record to itself is zero,
• distance from record i to record j is the same as the distance from record j to record i,
• distance between two records can not be greater than the sum of the distance between each record and a third record.
Euclidian distance is usually used as the distance metric for N-dimensional data. Let x1 = (x1,1,x1,2,··· ,x1,N) and x2 = (x2,1,x2,2,··· ,x2,N) denote two N-dimensional data points. The Euclidian distance between these points is defined as follows:
N
d1,2 = (x1,i − x2,i)2. (5.1)
i=1
The output of the K-mean clustering is the center of each cluster, and the assign input data to a cluster. The center of each cluster is called centroid. K-mean clustering finds the “k cluster centres” and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
5.2.1 Use cases
K-mean clustering is often used as an exploratory technique to discover structure in the data and summarize the properties of each cluster. Examples include but not limited to:
• Biology: The height, weight and average lifespan of animals,
• Markets: Household income, yearly purchase amount in dollars, number
of household members,
• Medical: Patient record with measures of BMI, HBA1C, HDL
• Image/Video Processing: Analyze similarity of pixels between video frames.
5.2.2 Algorithm
The K-mean clustering algorithm works as follows:
Step 1. Choose the number of clusters, K,
Step 2. Select K random centroids, c(0) , · · · , c(0) , 1K
56 CHAPTER 5. DATA ANALYTIC Step 3. Compute the distance from each data point xi to each centroid.
Step 4. Assign each point to the closest centroid (the ith cluster is denoted by C(i) ).
Step 5. Compute the center of mass (centroid) of the newly defined cluster as follows
1 mi
c(1) = x(0),
j mj 1,i i=1
where mi is the number of data points in the ith cluster.
Step 6. Repeat Step 3, 4, and 5 until record assignments no longer change.
Example
(5.2)
Let us consider the following data set,
{(1.4033, -0.0124), (2.3754, 3.2762), (-0.6941 , 1.5441), (1.6466, 0.9527),
(1.2391 , 1.5361), (0.0192 , 0.8463) , (0.6748 , 0.9069), (1.2570 , 2.1173) , (3.6838 , 2.0568), (3.0771 , 2.0629), (-0.6643 , -0.5558), (-1.6037 , -1.5735), (-0.6414 , -1.5344), (-0.1849 , -1.4047), (-0.7556 , -2.4721), (-0.4827 , -0.2808), (-0.6366 , -0.8374), (-1.1517 , -1.3775), (-0.8531 , -0.3149), (-1.3936 -1.8558)},
which can be visualized as follows:
4
3
2
1
0
-1
-2
Randomly Generated Data
-3
-2 -1 0 1 2 3 4
Figure 5.1: Randomly generated data.
For K = 3, we randomly choose 3 centroids. We choose c1 = (−1, −1), c2 = (1,1), and c3 = (3,3). We calculate the distance from each point to each centroid. The results are listed in the following table:
5.2. K-MEAN CLUSTERING
Data point
x1 2.6078 1.0898 3.4094 x2 4.0712 2.6595 0.6829 x3 0.6241 1.7794 3.9707 x4 2.6471 0.6484 2.4542 x5 2.3023 0.5870 2.2900 x6 1.0308 0.9927 3.6774 x7 1.6774 0.3383 3.1285 x8 2.5184 1.1464 1.9538 x9 4.8015 2.8844 1.1650
x10 4.2133 2.3332 0.9403 x11 1.5916 2.2782 5.1059 x12 2.6434 3.6610 6.4894 x13 2.5597 3.0195 5.8156 x14 2.5391 2.6808 5.4356 x15 3.4807 3.8907 6.6369 x16 1.3813 1.9593 4.7846 x17 1.8730 2.4606 5.2868 x18 2.3823 3.2066 6.0332 x19 1.3230 2.2722 5.0827 x20 2.8828 3.7262 6.5485
57
Distance to c1
Distance to c2
Distance to c3
The clusters can then be formed by assigning each point to the closest cen- troid:
Cluster
C1 C2 C3
sults are listed below:
Members
New centroid (-0.8238, -0.9694) (1.0400, 1.0578) (3.0454,2.4653)
x3, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
x1, x4, x5, x6, x7, x8
x2 , x9 , x10
We then recalculate the distance from each point to the new centroids. Re-
Data point
x1 2.4239 1.1302 2.9725 x2 5.3160 2.5893 1.0519 x3 2.5167 1.8010 3.8514 x4 3.1301 0.6157 2.0602 x5 3.2454 0.5180 2.0314 x6 2.0018 1.0424 3.4321 x7 2.4013 0.3951 2.8370 x8 3.7225 1.0815 1.8220 x9 5.4292 2.8262 0.7579
x10 4.9408 2.2715 0.4036 x11 0.4433 2.3469 4.7842 x12 0.9866 3.7300 6.1585 x13 0.5938 3.0898 5.4397 x14 0.7731 2.7504 5.0410 x15 1.5043 3.9604 6.2310 x16 0.7684 2.0274 4.4708 x17 0.2290 2.5303 4.9462 x18 0.5235 3.2763 5.6906 x19 0.6552 2.3383 4.7883 x20 1.0538 3.7962 6.1949
The new clusters are formed by assigning each point to the closest centroid:
Distance to c1
Distance to c2
Distance to c3
Cluster
C1 C2 C3
Members
x2 , x9 , x10
New centroid (-0.8367, -1.2207) (0.7923, 1.1273) (3.0454,2.4653)
x11, x12, x13, x14, x15, x16, x17, x18, x19, x20
x1, x3, x4, x5, x6, x7, x8
After this the clusters and centroids will not changed. The results of the clustering can then be shown as follows:
58 CHAPTER 5. DATA ANALYTIC
4
3
2
1
0
-1
-2
Cluster Assignments and Centroids
Cluster 1
Cluster 2
Cluster 3
Centroids
-3
-2 -1 0 1 2 3 4
Figure 5.2: Clustering of the randomly generated data.
5.2.3 Finding K
In the K-clustering algorithm, the number of clusters K, is usually determined in such a way to minimize the sum-of-squares (WSS), which is defined as follows:
K mi
WSS=|xi,j −ci|2, (5.3)
i=1 j=1
where xi,j is the jth point in the ith cluster. It is clear that when K is set to N, the number of data points, WSS will become 0. In practice, we need to perform the K-clustering techniques several times to find the suitable value for K. When choosing K, the following rules should be applied:
Do you have any clusters with few data points?
Are there splits on variables that you would expect, but don’t see? Do any of the centroids seem too close to each other?
Try decreasing the value of K Try increasing the value K Try decreasing the value of K
5.2.4 Pros and Cons of K-mean Clustering
K-mean clustering is very easy to implement and there are several computa- tionally efficient implementations which allows for quick analysis of big data. It is also scalable, which means that that it is easy to assign new data to existing clusters, as we only need to find the closest centroid. The output of this tech- nique is very concise and the coordinated of the K cluster centers are obtained.
K-mean clustering however is not suitable for categorial variables. it is very sensitive to the initialization (i.e., the first random centroids). Variables should be all measured on similar or compatible scales to have a meaningful distance calculation. The number of clusters must be known or decided a priori and a wring guess will result in a possible poor clustering. Also, for big data, K-mean clustering tends to produced equi-sized clusters, which is not always desirable.
5.3. REGRESSION ANALYSIS 59 5.3 Regression Analysis
Regression focuses on the relationship between an outcome and its input vari- ables. It provides an estimate of the outcome based on the input values and models how changes in the input variables affect the outcome. The possible use cases include but not limited to
• Estimate the lifetime value (LTV) of a customer and understand what influences LTV
• Estimate the probability that a loan will default and understand what leads to default
5.3.1 Linear Regression
Linear regression is used to estimate a continuous value as a linear (additive) function of other variables. For example, the income as a function of years of education, age, and gender or the house sales price as a function of square footage, number of bedrooms/bathrooms, and lot size. For this analysis, the outcome variable is continuous but the input variables can be continuous or discrete.
The output of this mode is a set of estimated coefficients that indicate the relative impact of each input variable and a linear expression for estimating the outcome as a function of input variables.
In linear regression analysis, we choose coefficients b0, b1, · · · , bp−1, in order to minimize the following expression
N
(yi − (b0 + b1xi,1 + b2xi,2 + · · · + bp−1xi,p−1))2 , (5.4) i=1
where xi,j are the inputs. In the following example, we use the regression analysis to estimate the income as a linear function of age, education and gender.
Example
Let us consider the following data set for 6 customers:
ID Income Age Education Gender 1 113 69 12 1
2 91 52 18 0
3 121 65 14 0
4 81 58 12 0 5 68 31 16 1 6 92 51 15 1
The income can be represented as a function of other variables as follows:
Income=f(Age,Education,Gender)=b0 +b1 ×Age+b2 ×Education+b3 ×Gender. (5.5)
60 CHAPTER 5. DATA ANALYTIC
For this data set, the best linear fitting is obtained, when b0 = 50.4682,b1 = 1.7172,b2 = 3.2703, and,b3 = 8.1611. We can then interpret the results as follows:
• b2 = 3.2703: Extra 3270.3 AUD per annum for each level of education, • b3 = 8.1611, Men earn extra 8161.1 AUD per annum than women,
• b1 = 1.7172, Extra 1717.2 AUD per annum if you get 1 year older!
5.3.2 Special Case: 1-input 1-Output
In the regression analysis, we find a linear relationship between the inputs and outputs. In this example, we have only one input (xi) and one output (yi), which have been measured N times. More specifically, we find α and β such thattheliney=αx+βhasagoodfittothedata. Inordertofindαandβ, we need to minimize the Euclidian distance between the actual data and the estimated regression line:
N
min (yi − (αxi + β))2. (5.6)
α,β
i=1
We find the derivative of function ∆(α, β) = Ni=1(yi − (αxi + β))2, as follows:
(5.7)
∂∆ N
∂α = −2 ∂∆ N
xi(yi − (αxi − β)), (yi −(αxi −β)).
i=1
(5.8) The maximum value of ∆(α,β) can be found by setting ∂∆ = ∂∆ = 0.
∂α =2
i=1
(yi−αxi),
x(y −αx)−Nβ=0⇒α=
∂α ∂α
N xiyi−1N xiN yi i=1 N i=1 i=1
Therefore, we have: ∂∆ 1N
∂β =0⇒β=N
(5.9)
∂∆ ∂α
=0⇒
N i=1
i=1
i i i
(4)
N x2−1(N )2 i=1 i N i=1
The root mean square error can be found as follows:
1 N
e = N
(5.10)
(5.11)
(yi − (αxi + β))2.
i=1
5.3. REGRESSION ANALYSIS 61 Example
A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see the table below).
Observation Number 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Temperature (xi) 50
53
54
55
56
59
62
65
67
71
72
74
75
76
79
80
82
85
87
90
93
94
95
97
100
Yield (yi) 122 118 128 121 125 136 144 142 149 161 167 168 162 171 175 182 180 183 188 200 194 206 207 210 219
220
200
180
160
140
120
100
50 55 60 65 70 75 80 85 90 95 100
Temperature
Figure 5.3: Example Data Set 2
For the above example, by replacing the data in (4) and (5), we have α =
1.9952 and β = 17.0016. The resulting line and error can be shown below. 5.3.3 Pros and Cons of Regression Analysis
The regression analysis provides a concise relationship between the inputs and outputs. It is robust to redundant or correlated variables. We can also inter-
Yield
62 CHAPTER 5. DATA ANALYTIC
250 200 150 100
50
40 50 60 70 80 90 100 110
y = 1.9952 x + 17.0016
10 5 0 -5
Temperature
residuals
-10
40 50 60 70 80 90 100 110
Figure 5.4: Regression line for Example 2.
pret the coefficients in order to find the relative impact of each variable on the outcomes.
It however does not handle missing values well. Regression analysis assume that each variable affects the outcome linearly and additively, which might not be correct always. It also does not easily handle variables that affect the outcome in a discontinuous way. It also does not work well with categorial attributes with a lot of distinct values, such as ZIP codes.
5.4 Classifiers
Classification analysis usually try to assign label to objects. A common approach is the Decision Tree Algorithm, which is used for classification and it returns probability scores of class membership and assigns label based on highest scoring class. Some Decision Tree algorithms return simply the most likely class. The input of this algorithm are variables that can be continuous or discrete and the outputs are: 1) A tree that describes the decision flow, 2) Leaf nodes that return either a probability score, or simply a classification, and 3) a set of “decision rules”, e.g., “IF income < $50,000 AND mortgage> $100K THEN default=T with 75% probability”.
Before we explain the Decision Tree Analysis, we need to introduce some backgrounds in Information theory.
5.4.1 Information Theory: A quick review
Consider that you want to represent an event R which has 4 possible outcomes,
A, B, C, and D, which are equally likely, i.e., p(A) = p(B) = p(C) = p(D) = 1 . 4
Yield
5.4. CLASSIFIERS 63
It is possible to represent the outcome of this event by 2 bits only, that is A, B, C, and D are respectively represented by 00, 01, 10, and 11. Then is we cant to communicate as sequence of events, we can send the bit representation instead. For example, BABBCD can be sent by sending 010001011011.
Now, let us consider the same event but the outcomes are not equally proba- ble. More specifically, p(A) = 1, p(B) = 1, and p(C) = p(D) = 1. We can still
use 2 bits to represent the outcome of this event. However, can we do better than this? It is possible to invent a coding for this event that only use 1.75 bits on average! But how?
Let us use the following code to represent the outcomes:
A0 B 10 C 110 D 111
Then every sequence of events, can be uniquely represented by this code. For example, the sequence 110001000 corresponds to CAABAA. The rule of thump for this coding is that the outcomes which are more probable, are represented with less bits!
5.4.2 Entropy
C. E. Shannon introduced the notion of Entropy into the field of communications in 1948. Entropy is a measure of information that a random variable has. Suppose that X is and event and it has N possible outcomes, x1, x2, ···, xN with probabilities p1, p2, · · · , pN , respectively, where Ni=1 pi = 1. The entropy of X is defined by:
N
H(X) = − pi log2(pi). (5.12)
i=1
In nutshell, high entropy corresponds to more unpredictable events, e.g., the event has uniform distribution, and low entropy means that the event is from a varied (peak and valleys) distribution, that most outputs are predictable.
Shannon proved that the smallest possible number of bits, on average, to represent a stream of the outcomes of event X equals to H(X). Let us go back to event R with 4 possible outcomes A, B, C, and D. When all the outcomes are equally probable, we have
H (R) = −4 × ( 1 ) log2 ( 1 ) = 2, (5.13) 44
andwhenp(A)= 1,p(B)= 1,andp(C)=p(D)= 1,theentropyofRisgiven
by:
248
248
H (R) = −( 1 ) log2 ( 1 ) − ( 1 ) log2 ( 1 ) − 2 × ( 1 ) log2 ( 1 ) = 1.75. (5.14) 224488
64 CHAPTER 5. DATA ANALYTIC 5.4.3 Conditional Entropy
Let X and Y denote two random variables. The conditional entropy of X given Y, is defined as follows:
H(X|Y)=p(Y =y)H(X|Y =y) y
= −p(Y = y)p(X = x|Y = y)log2(p(X = x|Y = y)). (5.15) yx
Conditional entropy can be represented as the average number of bits to rep- resent an event X when the outcome of event Y is known. Let us consider the following events and outcomes:
XY Math Yes History No CS Yes
Math No Math No CS Yes History No
Math Yes
The entropy of X and Y can be calculated as follows:
H(X) = −4 log2(4) − 2 log2(2) − 2 log2(2) = 1.5, 888888
H(Y ) = −4 log2(4) − 4 log2(4) = 1. 8888
The conditional entropies can be found as follows:
H(X|Y)=p(Y =Yes)H(X|Y =Yes)+p(Y =No)H(X|Y =No)
12222 =2 −4log2(4)−4log2(4)
12222 +2 −4log2(4)−4log2(4)
= 1,
(5.16) (5.17)
(5.18)
H(Y|X)=p(X =Math)H(Y|X =Math)+p(X =CS)H(Y|X =CS) + p(X = History)H(Y |X = History)
42222 =8 −4log2(4)−4log2(4)
222 +8 −2log2(2)−0
5.4. CLASSIFIERS 65 222
+8 −2log2(2)−0
= 0.5. (5.19)
5.4.4 Information Gain
Consider events X and Y. The information gain of X and Y is the amount of information that the outcome of the event X adds when we already know the outcome of event Y. It is explicitly defined as follows:
I(X|Y ) = H(X) − H(X|Y ). (5.20)
For the above example, the information gain is I(X|Y ) = I(Y |X) = 0.5. If you are going to collect information from someone (e.g. asking questions sequentially in a decision tree), the “best” question is the one with the highest information gain.
Exercise
Prove that I(X|Y ) = I(Y |X).
5.4.5 The Decision Tree Algorithms
Suppose that we have the data set S, which has K attributes A1, A2, · · · , AK . These data set is the training set, in order to construct the decision tree and find the decision rules. We choose the most informative attribute, Am, in order to partition S according to the values of Am.
The most information attribute is the one which has the highest information gain. After each partitioning, the conditional attributes are partitioned to reach to pure nodes. A pure node is a node that has only one outcome (Therefore zero information gain!). We explain the decision tree algorithm in the following example.
Suppose the following data set. The task is to predict that a certain player is going to play tennis on a given day. This table shows how John has played over a number of days and show things that might influence John’s decision to play tennis. On the day 15, it is raining, the humidity is high, and the wind is weak. Determine weather John is playing tennis or not? Find the optimal decision tree for this data set.
A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Each split corresponds to a node in the. Splitting stops when every subset is pure (all elements belong to a single class)- this can always be achieved, unless there are duplicate training examples with different classes. Decision tree tries to understand when John plays. It split the data into subsets until it reaches to pure states.
To find the optimal decision tree, we need to find the attribute which has the highest information gain.
66 CHAPTER 5. DATA ANALYTIC Table 5.2: Example 3 data set
Day
Outlook
Humidity
Wind
Play
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14
Sunny
Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain
High High High High Normal Normal Normal High Normal Normal Normal High Normal High
Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong
No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Outlook 9 Yes 5 No
Humidity 9 Yes 5 No
Wind
9 Yes 5 No
Sunny 2Yes 3No
Overcast 4Yes 0No
Rain 3Yes 2No
High 3Yes 4No
Normal 6Yes 1No
Weak 6Yes 2No
Strong 3Yes 2No
The entropy for each attribute can be calculated as follows:
HOutlook = HHumidity = HWind = − 9 log2( 9 ) − 5 log2( 5 ) = 0.94. (5.21) 14 14 14 14
The conditional entropies can be found as follows:
HSunny|Outlook = −2 log2(2) − 3 log2(3) = 0.9710, 5555
HOvercast|Outlook = −0 log2(0) − 4 log2(4) = 0, 4444
HRain|Outlook = −2 log2(2) − 3 log2(3) = 0.9710 5555
HHigh|Humidity = −3 log2(3) − 4 log2(4) = 0.9852, 7777
HNormal|Humidity = −6 log2(6) − 1 log2(1) = 0.5917, 7777
HWeak|Wind = −6 log2(6) − 2 log2(2) = 0.8113, 8888
HStrong|Wind =−3log2(3)−3log2(3)=1, 6666
The information gains can then be found as follows:
IOutlook = HOutlook − 5 HSunny|Outlook − 4 HOvercast|Outlook + 5 HRain|Outlook = 0.2464, 14 14 14
IHumidity = HHumidity − 7 HHigh|Humidity − 7 HNormal|Humidity = 0.1515, 14 14
IWind = HWind − 8 HWeak|Wind − 6 HStrong|Wind = 0.0478, 14 14
5.4. CLASSIFIERS
2 Yes / 3 No Sunny
D1 Sunny High Weak No
D2 Sunny High Strong No
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes D11 Sunny Normal Strong Yes
67
9 Yes / 5 No Outlook
4 Yes / 0 No Overcast
D3 Overcats High Weak Yes
D7 Overcast Normal Strong Yes D12 Overcast Normal Strong Yes D13 Overcast Normal Weak Yes
3 Yes / 2 No Rain
Figure 5.5: First Step of Decision Tree
which shows that the attribute “Outlook” has more information, so we start with it for building the decision tree. The overcast state is a pure state, so we do not need to split it further. For other states we need to find the best attribute. In the “Sunny” state we have:
HHigh|Sunny|Outlook = −0 log2(0) − 3 log2(3) = 0, 3333
HNormal|Sunny|Outlook = −2 log2(2) − 0 log2(0) = 0, 2222
HWeak|Sunny|Outlook = −1 log2(1) − 2 log2(2) = 0.9183, 3333
HStrong|Sunny|Outlook = −1 log2(1) − 1 log2(1) = 1, 2222
the information gains can be found as follows:
(5.22)
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes D6 Rain Normal Strong No D10 Rain Normal Weak Yes D14 Rain High Strong No
IHumidity|Sunny = HSunny|Outlook − 3HHigh|Sunny|Outlook − 2HNormal|Sunny|Outlook = 0.9710, 55
IW ind|Sunny = HSunny|Outlook − 3 HW eak|Sunny|Outlook − 2 HStrong|Sunny|Outlook = 0.02, 55
which means that after the state “Sunny”, we have to branch the tree using attribute “Humidity”.
At state “Rain”, we have
HHigh|Rain|Outlook = −1 log2(1) − 1 log2(1) = 1, 2222
HNormal|Rain|Outlook = −1 log2(1) − 2 log2(2) = 0.9183, 3333
HWeak|Rain|Outlook = −3 log2(3) − 0 log2(0) = 0, 3333
68 CHAPTER 5. DATA ANALYTIC HStrong|Rain|Outlook = −0 log2(0) − 2 log2(2) = 0,
2222 the information gains can be found as follows:
(5.23)
IHumidity|Rain = HRain|Outlook − 2HHigh|Rain|Outlook − 3HNormal|Rain|Outlook = 0.02, 55
IW ind|Rain = HRain|Outlook − 3 HW eak|Rain|Outlook − 2 HStrong|Rain|Outlook = 0.9710, 55
which means that after the state “Rain” we have to branch the tree using attribute “Wind”. The complete decision tree can then be shown as follows:
2Yes/3No Sunny
Humidity
9 Yes / 5 No Outlook
4 Yes / 0 No YES Overcast
D3 Overcats High Weak Yes
D7 Overcast Normal Strong Yes D12 Overcast Normal Strong Yes D13 Overcast Normal Weak Yes
3 Yes / 2 No Rain
Wind
NO High
D1 Sunny High Weak No D2 Sunny High Strong No D8 Sunny High Weak No
Normal
YES
YES
Weak
Strong NO
D9 Sunny Normal Weak Yes D11 Sunny Normal Strong Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes D10 Rain Normal Weak Yes
D6 Rain Normal Strong No D14 Rain High Strong No
D15 Rain High Weak ! YES Figure 5.6: Second Step of Decision Tree
5.5 Exercises
Exercise 1: K-mean Clustering
In this exercise, you will write the code (MATLAB, C, Python, R, etc.) to imple-
ment the K-mean clustering algorithm. You can download the poker hand data
set from here http://archive.ics.uci.edu/ml/machine-learning-databases/ poker/poker-hand-training-true.data. This data set contains 11 attributes
which are explained here http://archive.ics.uci.edu/ml/machine-learning-databases/ poker/poker-hand.names. Use this data set and cluster the data into 3, 4, and
5.5. EXERCISES 69 10 clusters. Try to explain the results. Explain how the algorithm is sensitive
to initial centroids.
Exercise 2: Linear Regression Analysis
Consider the general linear regression problem that you want to fit a linear curve to the data set. The inputs variables are x1, x2, · · · , xl and the output is y and we have N observations. The goal is to determine the values of b0 , b1 , · · · , bl in order to minimize the error:
N
min (yi −(b0 +b1x1,i ++b2x2,i +···++blxl,i))2 . (5.24)
b0 ,b1 ,··· ,bl
i=1
Hint: You need to follow the same direction as we explained in Section 5.3.2 where l = 1.
Exercise 3: Decision Tree Analysis
Consider trying to predict Miles per Gallon (MPG) for a set of cars using the dataset available here https://alliance.seas.upenn.edu/~cis520/wiki/index. php?n=Lectures.DecisionTrees#toc2. The cars can be categorized as Good, OK, or Bad. For attributes displacements, horsepower, weight, acceleration, and model year, you can take the average value in order to categorize the cars into two categories. For example, the cars can be categorized into two groups according to their displacement (> 96) or (< 96).
Draw the complete decision tree for this data set and predict the MPG for the following cases:
Car 1. It is an Asian car with 4 cylinders, displacement 102, horsepower 80, weight 2400, acceleration 13.5, and model year 79.
Car 2. It is an Europe car with 4 cylinders, displacement 112, horsepower 60, weight 1950, acceleration 15.5, and model year 71.
5.5.1 Further Reading
1. Information Theory, http://www.seas.upenn.edu/~cis520/papers/Bishop_ 1.6.pdf.
2. Linear Models for Regression, http://www.seas.upenn.edu/~cis520/ papers/Bishop_3.1-3.3.pdf
3. Decision Tree, http://www.seas.upenn.edu/~cis520/papers/Nilsson-ch6-ml. pdf
4. HavealookatMachineLearningCourseatPennStatehttps://alliance. seas.upenn.edu/~cis520/dynamic/2016/wiki/index.php?n=Lectures. Lectures.
70 CHAPTER 5. DATA ANALYTIC