程序代写代做代考 decision tree algorithm SQL COMP9318 Tutorial 2: Classification

COMP9318 Tutorial 2: Classification

Wei Wang @ UNSW

Q1 I

Consider the following training dataset and the original decision tree induction
algorithm (ID3).
Risk is the class label attribute. The Height values have been already
discretized into disjoint ranges.

1. Calculate the information gain if Gender is chosen as the test attribute.

2. Calculate the information gain if Height is chosen as the test attribute.

3. Draw the final decision tree (without any pruning) for the training dataset.

4. Generate all the “IF-THEN” rules from the decision tree.

Gender Height Risk

F (1.5, 1.6] Low
M (1.9, 2.0] High
F (1.8, 1.9] Medium
F (1.8, 1.9] Medium
F (1.6, 1.7] Low
M (1.8, 1.9] Medium
F (1.5, 1.6] Low
M (1.6, 1.7] Low
M (2.0,∞] High
M (2.0,∞] High
F (1.7, 1.8] Medium
M (1.9, 2.0] Medium
F (1.8, 1.9] Medium
F (1.7, 1.8] Medium
F (1.7, 1.8] Medium

Solution to Q1 I

1. The original entropy is IRisk = I (Low ,Medium,High) = I (4, 8, 3) = 1.4566.
Consider Gender .

Gender entropy

F I (3, 6, 0)
M I (1, 2, 3)

The expected entropy is 9
15
· I (3, 6, 0) + 6

15
· I (1, 2, 3) = 1.1346. The

information gain is 1.4566− 1.1346 = 0.3220
2. Consider Height.

Height entropy

(1.5, 1.6] I (2, 0, 0)
(1.6, 1.7] I (2, 0, 0)
(1.7, 1.8] I (0, 3, 0)
(1.8, 1.9] I (0, 4, 0)
(1.9, 2.0] I (0, 1, 1)
(2.0,∞] I (0, 0, 2)

The expected entropy is 2
15
· I (2, 0, 0) + 2

15
· I (2, 0, 0) + 3

15
· I (0, 3, 0) + 4

15
·

I (0, 4, 0) + 2
15
· I (0, 1, 1) + 2

15
· I (0, 0, 2) = 0.1333. The information gain is

1.4566− 0.1333 = 1.3233

Solution to Q1 II

3. ID3 decision tree:
I According to the computation above, we should first choose Height to split
I After split, the only problematic partition is the (1.9, 2.0] one. However, the

only remaining attribute Gender cannot divide them. As there is a draw, we
can take any label.

I The final tree is show in the figure below.

Height

low

1.5

1.6

low

1.
6


1.

7

medium

1
.7


1
.8

medium
1
.8


1
.9

medium
or high

1.9

2.0

high

2.0
–∞

4. The rules are
I IF height ∈ (1.5, 1.6], THEN Rish = Low.
I IF height ∈ (1.6, 1.7], THEN Rish = Low.
I IF height ∈ (1.7, 1.8], THEN Rish = Medium.
I IF height ∈ (1.8, 1.9], THEN Rish = Medium.
I IF height ∈ (1.9, 2.0], THEN Rish = Medium (or High).
I IF height ∈ (2.0,∞], THEN Rish = High.

Q2 I

Consider applying the SPRINT algorithm on the following training dataset

Age CarType Risk

23 family High
17 sports High
43 sports High
68 family Low
32 truck Low
20 family High

Answer the following questions:

1. Write down the attribute lists for attribute Age and CarType, respectively.

2. Assume the first split criterion is Age < 27.5. Write down the attribute lists for the left child node (i.e., corresponding to the partition whose Age < 27.5). 3. Assume that the two attribute lists for the root node are stored in relational tables name AL Age and AL CarType, respectively. We can in fact generate the attribute lists for the child nodes using standard SQL statements. Write down the SQL statements which will generate the attribute lists for the left child node for the split criterion Age < 27.5. 4. Write down the final decision tree constructed by the SPRINT algorithm. Solution to Q2 I I Attribute list of Age is: Age class Index 17 High 2 20 High 6 23 High 1 32 Low 5 43 High 3 68 Low 4 Attribute list of CarType is: CarType class Index family High 1 sports High 2 sports High 3 family Low 4 truck Low 5 family High 6 I Attribute list of Age is: Age class Index 17 High 2 20 High 6 23 High 1 Solution to Q2 II Attribute list of CarType is: CarType class Index family High 1 sports High 2 family High 6 I SQL for the attribute list of Age: SELECT Age, Class, Index FROM AL_Age WHERE Age < 27.5 SQL for the attribute list of CarType: SELECT C.CarType, C.Class, C.Index FROM AL_Age A, AL_CarType C WHERE A.Age < 27.5 AND A.index = C.index I Consider the attribute list of Age: there are 5 possible “cut” positions, each of them have gini index value as: Age above below ginisplit 17 – 20 (1, 0) (3, 2) 0.40 20 – 23 (2, 0) (2, 2) 0.33 23 – 32 (3, 0) (1, 2) 0.22 32 – 43 (3, 1) (1, 1) 0.42 43 – 68 (4, 1) (0, 1) 0.27 Solution to Q2 III therefore, the best split should be Age > 27.5.
Consider the attribute list of CarType:

CarType High Low

f 2 1
s 2 0
t 0 1

Consider all the possible cuts:

CarType High Low

f 2 1
s, t 2 1

CarType High Low

s 2 0
f, t 2 2

CarType High Low

t 0 1
f, s 4 1

Each of them have gini index value as: 0.44, 0.33, 0.27, respectively.
Therefore, the best split is CarType in (′truck ′).
Obviously, splitting on Age is better. Therefore, we shall split by
Age > 27.5.
The attribute lists for each of the child node have already been computed.
Since the tuples in the partition for Age < 27.5 are all “high”, we only need to look at the partition for Age ≥ 27.5. Solution to Q2 IV Age class Index 32 Low 5 43 High 3 68 Low 4 CarType class Index sports High 3 family Low 4 truck Low 5 It is obvious that CarType in (′sports ′) can immediately cut this partition into two “pure” partitions and thus will have 0 as the gini index value. So we can skip a lot of calculations. The final tree is: Age < 27.5 high ye s CarType in (′sport) high ye s low no Q3 I Consider a (simplified) email classification example. Assume the training dataset contains 1000 emails in total, 100 of which are spams. 1. Calculate the class prior probability distribution. How would you classify a new incoming email? 2. A friend of you suggests that whether the email contains a $ char is a good feature to detect spam emails. You look into the training dataset and obtain the following statistics ($ means emails containing a $ and $̄ are those not containing any $). Class $ $̄ SPAM 91 9 NOSPAM 63 837 Describe the (naive) Bayes Classifier you can build on this new piece of “evidence”. How would this classifier predict the class label for a new incoming email that contains a $ character? 3. Another friend of you suggest looking into the feature of whether the email’s length is longer than a fixed threshold (e.g., 500 bytes). You obtain the following results (this feature denoted as L (L̄)). Q3 II Class L L̄ SPAM 40 60 NOSPAM 400 500 How would a naive Bayes classifier predict the class label for a new incoming email that contains a $ character and is shorter than the threshold? Solution to Q3 I 1. The prior probabilities are: P(SPAM) = 100 1000 = 0.10 P(NOSPAM) = 1000− 100 1000 = 0.90 2. In order to build a (näıve) bayes classifier, we need to calculate (and store) the likelyhood of the feature for each class. P($ | SPAM) 91 100 = 0.91 P($ | NOSPAM) 63 900 = 0.07 Solution to Q3 II To classify the new object, we calculate the posterior probability for both classes as: P(SPAM | X ) = 1 P(X ) · P(X | SPAM) · P(SPAM) = 1 P(X ) · P($ | SPAM) · P(SPAM) = 1 P(X ) · 0.91 · 0.10 = 1 P(X ) · 0.091 P(NOSPAM | X ) = 1 P(X ) · P(X | NOSPAM) · P(NOSPAM) = 1 P(X ) · P($ | NOSPAM) · P(NOSPAM) = 1 P(X ) · 0.07 · 0.90 = 1 P(X ) · 0.063 So the prediction will be SPAM. 3. The likelyhood of the new feature for each class is: P(L | SPAM) 40 100 = 0.40 P(L | NOSPAM) 400 900 = 0.44 Solution to Q3 III (Note: we can easily obtain probabilities, e.g., P(L̄ | SPAM) = 1− P(L̄ | SPAM) = 0.60) To classify the new object, we calculate the posterior probability for both classes as: P(SPAM | X ) = 1 P(X ) · P(X | SPAM) · P(SPAM) = 1 P(X ) · P($, L̄ | SPAM) · P(SPAM) = 1 P(X ) · P($ | SPAM) · P(L̄ | SPAM) · P(SPAM) = 1 P(X ) · 0.60 · 0.91 · 0.10 = 1 P(X ) · 0.055 P(NOSPAM | X ) = 1 P(X ) · P(X | NOSPAM) · P(NOSPAM) = 1 P(X ) · P($, L̄ | NOSPAM) · P(NOSPAM) = 1 P(X ) · P($ | NOSPAM) · P(L̄ | NOSPAM) · P(NOSPAM) = 1 P(X ) · 0.56 · 0.07 · 0.90 = 1 P(X ) · 0.035 Solution to Q3 IV So the prediction will be SPAM. Q4 I Based on the data in the following table, 1. estimate a Bernoulli Naive Bayes classifer (using the add-one smoothing) 2. apply the classifier to the test document. 3. estimate a multinomial Naive Bayes classifier (using the add-one smoothing) 4. apply the classifier to the test document You do not need to estimate parameters that you don’t need for classifying the test document. docID words in document class = China? training set 1 Taipei Taiwan Yes 2 Macao Taiwan Shanghai Yes 3 Japan Sapporo No 4 Sapporo Osaka Taiwan No test set 5 Taiwan Taiwan Taiwan Sapporo Bangkok ? Solution to Q3 I We use the following abbreviations to denote the words, i.e., TP = Taipei, TW = Taiwan, MC = Macao, SH = Shanghai, JP = Japan, SP = Sapporo, OS = Osaka. The size of the vocabulary is 7. 1. (Bernoulli NB) We take each word in the vocabulary as a feature/attribute, and hence can obtain the following “rational” training set. docID TP TW MC SH JP SP OS class 1 1 1 0 0 0 0 0 Y 2 0 1 1 1 0 0 0 Y 3 0 0 0 0 1 1 0 N 4 0 1 0 0 0 1 1 N The testing document is (ignoring the unknown token Bangkok): docID TP TW MC SH JP SP OS class 5 0 1 0 0 0 1 0 ? Solution to Q3 II By looking at the test data, we calculate the necessary probabilities for the ’Y’ class as (note that there are 2 possible values for each variable) P(Y ) = 2 4 P(TP = 0|Y ) = 1 + 1 2 + 2 P(TW = 1|Y ) = 2 + 1 2 + 2 P(MC = 0|Y ) = 1 + 1 2 + 2 P(SH = 0|Y ) = 1 + 1 2 + 2 P(JP = 0|Y ) = 2 + 1 2 + 2 P(SP = 1|Y ) = 0 + 1 2 + 2 P(OS = 0|Y ) = 2 + 1 2 + 2 Solution to Q3 III Finally, P(Y |X ) ∝ P(Y ) · P(TP = 0|Y ) · P(TW = 1|Y ) · P(MC = 0|Y ) · P(SH = 0|Y ) · P(JP = 0|Y ) · P(SP = 1|Y ) · P(OS = 0|Y ) = 1 2 1 2 3 4 1 2 1 2 3 4 1 4 3 4 = 27 4096 ≈ 0.0066 Solution to Q3 IV We calculate the necessary probabilities for the ’N’ class as P(N) = 2 4 P(TP = 0|N) = 2 + 1 2 + 2 P(TW = 1|N) = 1 + 1 2 + 2 P(MC = 0|N) = 2 + 1 2 + 2 P(SH = 0|N) = 2 + 1 2 + 2 P(JP = 0|N) = 1 + 1 2 + 2 P(SP = 1|N) = 2 + 1 2 + 2 P(OS = 0|N) = 1 + 1 2 + 2 Solution to Q3 V Finally, P(N|X ) ∝ P(N) · P(TP = 0|N) · P(TW = 1|N) · P(MC = 0|N) · P(SH = 0|N) · P(JP = 0|N) · P(SP = 1|N) · P(OS = 0|N) = 1 2 3 4 1 2 3 4 3 4 1 2 3 4 1 2 = 81 4096 ≈ 0.020 Therefore, doc 5 should belong to the ’No’ class. 2. (Multinomial NB) We form the mega-documents for each class as: Doc class TP TW MC TW SH Y JP SP SP OS TW N The testing document is (ignoring the out-of-vocabulary (OOV) words Bangkok): Doc class TW TW TW SP ? Solution to Q3 VI By looking at the test data, we calculate the necessary probabilities for the ’Y’ class as (note that there are 7 possible values for the variable wi ) P(Y ) = 2 4 P(wi = TW |Y ) = 2 + 1 5 + 7 P(wi = SP|Y ) = 0 + 1 5 + 7 Finally, P(Y |X ) ∝P(Y ) · P(wi = TW |Y ) · P(wi = TW |Y ) · P(wi = TW |Y ) · P(wi = SP|Y ) = 1 2 1 4 1 4 1 4 1 12 = 1 1536 ≈ 0.000651 Solution to Q3 VII We calculate the necessary probabilities for the ’Y’ class as P(N) = 2 4 P(wi = TW |N) = 1 + 1 5 + 7 P(wi = SP|N) = 2 + 1 5 + 7 Finally, P(N|X ) ∝ P(N) · P(wi = TW |N) · P(wi = TW |N) · P(wi = TW |N) · P(wi = SP|N) = 1 2 1 6 1 6 1 6 1 4 = 1 1728 ≈ 0.000579 Therefore, doc 5 should belong to the ’Yes’ class. Q5 I Consider a binary classification problem. 1. First, we randomly obtained 47 training examples among which we have 22 negative instances (denoted as “-”), and 25 positive instances (denoted as “+”). What is your estimate of the probability that a novel test instance belongs to the positive class? 2. We then identify a feature x , and rearrange the 47 training examples based on their x values. The result is shown in the table below. Q5 II x y count 1 - 6 1 + 2 2 - 5 2 + 2 3 - 7 3 + 6 4 - 3 4 + 7 5 - 1 5 + 8 Table: Training Data For each of the group of training examples with the same x value, compute its probability pi and logit(p) := log p 1−p . 3. What is your estimate of the probability that a novel test instance belongs to the positive class if its x value is 1? 4. We can run a linear regression on the (x , logit) pairs from each group. Will this be the same as what Logistic Regression does? Solution to Q5 I 1. Pr(+) = 25 47 . 2. See table below. x cnt(y = 0) cnt(y = 1) p logit(p) 1 6 2 0.250000 -1.098612 2 5 2 0.285714 -0.916291 3 7 6 0.461538 -0.154151 4 3 7 0.700000 0.847298 5 1 8 0.888889 2.079442 3. Pr(+|x = 1) = 2 8 . 4. Not the same. The main reason is that Logistic regression will maximize the likelihood of the data, and this is in generally different from minimizing the SSE as in Linear Regression. Q6 I Consider two-dimensional vectors A = ( 23 ) and B = ( −1 0 ) and C = A + B. I Represent the vectors in the non-orthogonal bases B = ( 1 2 0 −2 ) . I Let Zp be a vector Z represented in the polar coordinate: (ρ, θ). What if we still do Zp = Ap + Bp in the old “linear” way? Will Zp be the same as Cp? I Can you construct a matrix M such that its impact on vectors represented in polar coordinates exhibit “linearality”? i.e., M(x + y) = Mx + My? Solution to Q6 I I C = ( 13 ) ( 23 ) = BA ′ ⇒ A′ = ( 5 −1.5 )( −1 0 ) = BB′ ⇒ B′ = ( −1 0 ) ( 0 ) = BC′ ⇒ C′ = ( 4 −1.5 ) Obviously, we still have C′ = A′ + B′. I (Obivously) No. I One possible M = ( a zb −z ). Then M( r ρ ) = ( (a+b)r 0 ) . Hence, the “linearality” holds. Q7 I Consider a set of d-dimensional points arranged in a data matrix Xn×d =   o1o2... on  . Now we consider a linear projection Ad×m of all the points to a m-dimensional space (m < d). Specifically, each o is mapped to a new vector π(oi ) = A >oi .

I Computer r :=
‖π(oi )‖

2

‖oi‖2
. Can you guess what will be the maximum and

minimum values of r?

Solution to Q7 I

I Since

‖π(o)‖2 = π(o)>π(o) = A>(o)>A>(o)

= o
>
AA
>
o = o

>
(
AA
>
)
o

Therefore,

r =
‖π(o)‖2

‖o‖2
=

o>
(
AA>

)
o

o>o

Comment: The above is the Rayleigh Quotient (c.f., its Wikipedia page)
where M = AA>. The maximum and minimum values of r are determined
by the maximum and minimum eigenvalues of M, respectively. This
property is also used in the technical proof of the spectral clustering too
(not required).