COMP9318 Tutorial 2: Classification
Wei Wang @ UNSW
Q1 I
Consider the following training dataset and the original decision tree induction
algorithm (ID3).
Risk is the class label attribute. The Height values have been already
discretized into disjoint ranges.
1. Calculate the information gain if Gender is chosen as the test attribute.
2. Calculate the information gain if Height is chosen as the test attribute.
3. Draw the final decision tree (without any pruning) for the training dataset.
4. Generate all the “IF-THEN” rules from the decision tree.
Gender Height Risk
F (1.5, 1.6] Low
M (1.9, 2.0] High
F (1.8, 1.9] Medium
F (1.8, 1.9] Medium
F (1.6, 1.7] Low
M (1.8, 1.9] Medium
F (1.5, 1.6] Low
M (1.6, 1.7] Low
M (2.0,∞] High
M (2.0,∞] High
F (1.7, 1.8] Medium
M (1.9, 2.0] Medium
F (1.8, 1.9] Medium
F (1.7, 1.8] Medium
F (1.7, 1.8] Medium
Solution to Q1 I
1. The original entropy is IRisk = I (Low ,Medium,High) = I (4, 8, 3) = 1.4566.
Consider Gender .
Gender entropy
F I (3, 6, 0)
M I (1, 2, 3)
The expected entropy is 9
15
· I (3, 6, 0) + 6
15
· I (1, 2, 3) = 1.1346. The
information gain is 1.4566− 1.1346 = 0.3220
2. Consider Height.
Height entropy
(1.5, 1.6] I (2, 0, 0)
(1.6, 1.7] I (2, 0, 0)
(1.7, 1.8] I (0, 3, 0)
(1.8, 1.9] I (0, 4, 0)
(1.9, 2.0] I (0, 1, 1)
(2.0,∞] I (0, 0, 2)
The expected entropy is 2
15
· I (2, 0, 0) + 2
15
· I (2, 0, 0) + 3
15
· I (0, 3, 0) + 4
15
·
I (0, 4, 0) + 2
15
· I (0, 1, 1) + 2
15
· I (0, 0, 2) = 0.1333. The information gain is
1.4566− 0.1333 = 1.3233
Solution to Q1 II
3. ID3 decision tree:
I According to the computation above, we should first choose Height to split
I After split, the only problematic partition is the (1.9, 2.0] one. However, the
only remaining attribute Gender cannot divide them. As there is a draw, we
can take any label.
I The final tree is show in the figure below.
Height
low
1.5
–
1.6
low
1.
6
–
1.
7
medium
1
.7
–
1
.8
medium
1
.8
–
1
.9
medium
or high
1.9
–
2.0
high
2.0
–∞
4. The rules are
I IF height ∈ (1.5, 1.6], THEN Rish = Low.
I IF height ∈ (1.6, 1.7], THEN Rish = Low.
I IF height ∈ (1.7, 1.8], THEN Rish = Medium.
I IF height ∈ (1.8, 1.9], THEN Rish = Medium.
I IF height ∈ (1.9, 2.0], THEN Rish = Medium (or High).
I IF height ∈ (2.0,∞], THEN Rish = High.
Q2 I
Consider applying the SPRINT algorithm on the following training dataset
Age CarType Risk
23 family High
17 sports High
43 sports High
68 family Low
32 truck Low
20 family High
Answer the following questions:
1. Write down the attribute lists for attribute Age and CarType, respectively.
2. Assume the first split criterion is Age < 27.5. Write down the attribute
lists for the left child node (i.e., corresponding to the partition whose
Age < 27.5).
3. Assume that the two attribute lists for the root node are stored in
relational tables name AL Age and AL CarType, respectively. We can in
fact generate the attribute lists for the child nodes using standard SQL
statements. Write down the SQL statements which will generate the
attribute lists for the left child node for the split criterion Age < 27.5.
4. Write down the final decision tree constructed by the SPRINT algorithm.
Solution to Q2 I
I Attribute list of Age is:
Age class Index
17 High 2
20 High 6
23 High 1
32 Low 5
43 High 3
68 Low 4
Attribute list of CarType is:
CarType class Index
family High 1
sports High 2
sports High 3
family Low 4
truck Low 5
family High 6
I Attribute list of Age is:
Age class Index
17 High 2
20 High 6
23 High 1
Solution to Q2 II
Attribute list of CarType is:
CarType class Index
family High 1
sports High 2
family High 6
I SQL for the attribute list of Age:
SELECT Age, Class, Index
FROM AL_Age
WHERE Age < 27.5
SQL for the attribute list of CarType:
SELECT C.CarType, C.Class, C.Index
FROM AL_Age A, AL_CarType C
WHERE A.Age < 27.5
AND A.index = C.index
I Consider the attribute list of Age: there are 5 possible “cut” positions,
each of them have gini index value as:
Age above below ginisplit
17 – 20 (1, 0) (3, 2) 0.40
20 – 23 (2, 0) (2, 2) 0.33
23 – 32 (3, 0) (1, 2) 0.22
32 – 43 (3, 1) (1, 1) 0.42
43 – 68 (4, 1) (0, 1) 0.27
Solution to Q2 III
therefore, the best split should be Age > 27.5.
Consider the attribute list of CarType:
CarType High Low
f 2 1
s 2 0
t 0 1
Consider all the possible cuts:
CarType High Low
f 2 1
s, t 2 1
CarType High Low
s 2 0
f, t 2 2
CarType High Low
t 0 1
f, s 4 1
Each of them have gini index value as: 0.44, 0.33, 0.27, respectively.
Therefore, the best split is CarType in (′truck ′).
Obviously, splitting on Age is better. Therefore, we shall split by
Age > 27.5.
The attribute lists for each of the child node have already been computed.
Since the tuples in the partition for Age < 27.5 are all “high”, we only
need to look at the partition for Age ≥ 27.5.
Solution to Q2 IV
Age class Index
32 Low 5
43 High 3
68 Low 4
CarType class Index
sports High 3
family Low 4
truck Low 5
It is obvious that CarType in (′sports ′) can immediately cut this partition
into two “pure” partitions and thus will have 0 as the gini index value. So
we can skip a lot of calculations.
The final tree is:
Age < 27.5
high
ye
s
CarType in (′sport)
high
ye
s
low
no
Q3 I
Consider a (simplified) email classification example. Assume the training
dataset contains 1000 emails in total, 100 of which are spams.
1. Calculate the class prior probability distribution. How would you classify a
new incoming email?
2. A friend of you suggests that whether the email contains a $ char is a
good feature to detect spam emails. You look into the training dataset
and obtain the following statistics ($ means emails containing a $ and $̄
are those not containing any $).
Class $ $̄
SPAM 91 9
NOSPAM 63 837
Describe the (naive) Bayes Classifier you can build on this new piece of
“evidence”. How would this classifier predict the class label for a new
incoming email that contains a $ character?
3. Another friend of you suggest looking into the feature of whether the
email’s length is longer than a fixed threshold (e.g., 500 bytes). You
obtain the following results (this feature denoted as L (L̄)).
Q3 II
Class L L̄
SPAM 40 60
NOSPAM 400 500
How would a naive Bayes classifier predict the class label for a new
incoming email that contains a $ character and is shorter than the
threshold?
Solution to Q3 I
1. The prior probabilities are:
P(SPAM) =
100
1000
= 0.10
P(NOSPAM) =
1000− 100
1000
= 0.90
2. In order to build a (näıve) bayes classifier, we need to calculate (and store)
the likelyhood of the feature for each class.
P($ | SPAM) 91
100
= 0.91
P($ | NOSPAM) 63
900
= 0.07
Solution to Q3 II
To classify the new object, we calculate the posterior probability for both
classes as:
P(SPAM | X ) =
1
P(X )
· P(X | SPAM) · P(SPAM)
=
1
P(X )
· P($ | SPAM) · P(SPAM)
=
1
P(X )
· 0.91 · 0.10 =
1
P(X )
· 0.091
P(NOSPAM | X ) =
1
P(X )
· P(X | NOSPAM) · P(NOSPAM)
=
1
P(X )
· P($ | NOSPAM) · P(NOSPAM)
=
1
P(X )
· 0.07 · 0.90 =
1
P(X )
· 0.063
So the prediction will be SPAM.
3. The likelyhood of the new feature for each class is:
P(L | SPAM) 40
100
= 0.40
P(L | NOSPAM) 400
900
= 0.44
Solution to Q3 III
(Note: we can easily obtain probabilities, e.g.,
P(L̄ | SPAM) = 1− P(L̄ | SPAM) = 0.60)
To classify the new object, we calculate the posterior probability for both
classes as:
P(SPAM | X ) =
1
P(X )
· P(X | SPAM) · P(SPAM)
=
1
P(X )
· P($, L̄ | SPAM) · P(SPAM)
=
1
P(X )
· P($ | SPAM) · P(L̄ | SPAM) · P(SPAM)
=
1
P(X )
· 0.60 · 0.91 · 0.10 =
1
P(X )
· 0.055
P(NOSPAM | X ) =
1
P(X )
· P(X | NOSPAM) · P(NOSPAM)
=
1
P(X )
· P($, L̄ | NOSPAM) · P(NOSPAM)
=
1
P(X )
· P($ | NOSPAM) · P(L̄ | NOSPAM) · P(NOSPAM)
=
1
P(X )
· 0.56 · 0.07 · 0.90 =
1
P(X )
· 0.035
Solution to Q3 IV
So the prediction will be SPAM.
Q4 I
Based on the data in the following table,
1. estimate a Bernoulli Naive Bayes classifer (using the add-one smoothing)
2. apply the classifier to the test document.
3. estimate a multinomial Naive Bayes classifier (using the add-one
smoothing)
4. apply the classifier to the test document
You do not need to estimate parameters that you don’t need for classifying the
test document.
docID words in document class = China?
training set 1 Taipei Taiwan Yes
2 Macao Taiwan Shanghai Yes
3 Japan Sapporo No
4 Sapporo Osaka Taiwan No
test set 5 Taiwan Taiwan Taiwan Sapporo Bangkok ?
Solution to Q3 I
We use the following abbreviations to denote the words, i.e., TP = Taipei, TW
= Taiwan, MC = Macao, SH = Shanghai, JP = Japan, SP = Sapporo, OS =
Osaka. The size of the vocabulary is 7.
1. (Bernoulli NB) We take each word in the vocabulary as a
feature/attribute, and hence can obtain the following “rational” training
set.
docID TP TW MC SH JP SP OS class
1 1 1 0 0 0 0 0 Y
2 0 1 1 1 0 0 0 Y
3 0 0 0 0 1 1 0 N
4 0 1 0 0 0 1 1 N
The testing document is (ignoring the unknown token Bangkok):
docID TP TW MC SH JP SP OS class
5 0 1 0 0 0 1 0 ?
Solution to Q3 II
By looking at the test data, we calculate the necessary probabilities for the
’Y’ class as (note that there are 2 possible values for each variable)
P(Y ) =
2
4
P(TP = 0|Y ) =
1 + 1
2 + 2
P(TW = 1|Y ) =
2 + 1
2 + 2
P(MC = 0|Y ) =
1 + 1
2 + 2
P(SH = 0|Y ) =
1 + 1
2 + 2
P(JP = 0|Y ) =
2 + 1
2 + 2
P(SP = 1|Y ) =
0 + 1
2 + 2
P(OS = 0|Y ) =
2 + 1
2 + 2
Solution to Q3 III
Finally,
P(Y |X ) ∝ P(Y ) · P(TP = 0|Y ) · P(TW = 1|Y ) · P(MC = 0|Y ) · P(SH = 0|Y )
· P(JP = 0|Y ) · P(SP = 1|Y ) · P(OS = 0|Y )
=
1
2
1
2
3
4
1
2
1
2
3
4
1
4
3
4
=
27
4096
≈ 0.0066
Solution to Q3 IV
We calculate the necessary probabilities for the ’N’ class as
P(N) =
2
4
P(TP = 0|N) =
2 + 1
2 + 2
P(TW = 1|N) =
1 + 1
2 + 2
P(MC = 0|N) =
2 + 1
2 + 2
P(SH = 0|N) =
2 + 1
2 + 2
P(JP = 0|N) =
1 + 1
2 + 2
P(SP = 1|N) =
2 + 1
2 + 2
P(OS = 0|N) =
1 + 1
2 + 2
Solution to Q3 V
Finally,
P(N|X ) ∝ P(N) · P(TP = 0|N) · P(TW = 1|N) · P(MC = 0|N) · P(SH = 0|N)
· P(JP = 0|N) · P(SP = 1|N) · P(OS = 0|N)
=
1
2
3
4
1
2
3
4
3
4
1
2
3
4
1
2
=
81
4096
≈ 0.020
Therefore, doc 5 should belong to the ’No’ class.
2. (Multinomial NB) We form the mega-documents for each class as:
Doc class
TP TW MC TW SH Y
JP SP SP OS TW N
The testing document is (ignoring the out-of-vocabulary (OOV) words
Bangkok):
Doc class
TW TW TW SP ?
Solution to Q3 VI
By looking at the test data, we calculate the necessary probabilities for the
’Y’ class as (note that there are 7 possible values for the variable wi )
P(Y ) =
2
4
P(wi = TW |Y ) =
2 + 1
5 + 7
P(wi = SP|Y ) =
0 + 1
5 + 7
Finally,
P(Y |X ) ∝P(Y ) · P(wi = TW |Y ) · P(wi = TW |Y )
· P(wi = TW |Y ) · P(wi = SP|Y )
=
1
2
1
4
1
4
1
4
1
12
=
1
1536
≈ 0.000651
Solution to Q3 VII
We calculate the necessary probabilities for the ’Y’ class as
P(N) =
2
4
P(wi = TW |N) =
1 + 1
5 + 7
P(wi = SP|N) =
2 + 1
5 + 7
Finally,
P(N|X ) ∝ P(N) · P(wi = TW |N) · P(wi = TW |N)
· P(wi = TW |N) · P(wi = SP|N)
=
1
2
1
6
1
6
1
6
1
4
=
1
1728
≈ 0.000579
Therefore, doc 5 should belong to the ’Yes’ class.
Q5 I
Consider a binary classification problem.
1. First, we randomly obtained 47 training examples among which we have
22 negative instances (denoted as “-”), and 25 positive instances (denoted
as “+”).
What is your estimate of the probability that a novel test instance belongs
to the positive class?
2. We then identify a feature x , and rearrange the 47 training examples
based on their x values. The result is shown in the table below.
Q5 II
x y count
1 - 6
1 + 2
2 - 5
2 + 2
3 - 7
3 + 6
4 - 3
4 + 7
5 - 1
5 + 8
Table: Training Data
For each of the group of training examples with the same x value,
compute its probability pi and logit(p) := log
p
1−p .
3. What is your estimate of the probability that a novel test instance belongs
to the positive class if its x value is 1?
4. We can run a linear regression on the (x , logit) pairs from each group.
Will this be the same as what Logistic Regression does?
Solution to Q5 I
1. Pr(+) = 25
47
.
2. See table below.
x cnt(y = 0) cnt(y = 1) p logit(p)
1 6 2 0.250000 -1.098612
2 5 2 0.285714 -0.916291
3 7 6 0.461538 -0.154151
4 3 7 0.700000 0.847298
5 1 8 0.888889 2.079442
3. Pr(+|x = 1) = 2
8
.
4. Not the same. The main reason is that Logistic regression will maximize
the likelihood of the data, and this is in generally different from
minimizing the SSE as in Linear Regression.
Q6 I
Consider two-dimensional vectors A = ( 23 ) and B =
(
−1
0
)
and C = A + B.
I Represent the vectors in the non-orthogonal bases B =
(
1 2
0 −2
)
.
I Let Zp be a vector Z represented in the polar coordinate: (ρ, θ). What if
we still do Zp = Ap + Bp in the old “linear” way? Will Zp be the same as
Cp?
I Can you construct a matrix M such that its impact on vectors represented
in polar coordinates exhibit “linearality”? i.e., M(x + y) = Mx + My?
Solution to Q6 I
I C = ( 13 )
( 23 ) = BA
′ ⇒ A′ =
(
5
−1.5
)(
−1
0
)
= BB′ ⇒ B′ =
(
−1
0
)
( 0 ) = BC′ ⇒ C′ =
(
4
−1.5
)
Obviously, we still have C′ = A′ + B′.
I (Obivously) No.
I One possible M = ( a zb −z ). Then M(
r
ρ ) =
(
(a+b)r
0
)
. Hence, the
“linearality” holds.
Q7 I
Consider a set of d-dimensional points arranged in a data matrix
Xn×d =
o1o2...
on
. Now we consider a linear projection Ad×m of all the points to
a m-dimensional space (m < d). Specifically, each o is mapped to a new vector
π(oi ) = A
>oi .
I Computer r :=
‖π(oi )‖
2
‖oi‖2
. Can you guess what will be the maximum and
minimum values of r?
Solution to Q7 I
I Since
‖π(o)‖2 = π(o)>π(o) = A>(o)>A>(o)
= o
>
AA
>
o = o
>
(
AA
>
)
o
Therefore,
r =
‖π(o)‖2
‖o‖2
=
o>
(
AA>
)
o
o>o
Comment: The above is the Rayleigh Quotient (c.f., its Wikipedia page)
where M = AA>. The maximum and minimum values of r are determined
by the maximum and minimum eigenvalues of M, respectively. This
property is also used in the technical proof of the spectral clustering too
(not required).