COMP20008 Elements of Data Processing Classification
Semester 1, 2020
Contact: uwe.aickelin@unimelb.edu.au
Where are we now?
2
Classification
• Introduction to classification
• Comparing Classification and Regression
• Decision tree classification
• Will make connections to entropy
• k nearest neighbour classification
3
Classification vs Regression
4
Classification
• Classification
• Making predictions about the future, based on historical data
• Predictive modelling/classification
• The sexy part of data science ?!
• A foundation for machine intelligence, AI, machines replacing humans and taking our jobs ….
5
Classification – Example 1
Predicting disease from microarray data
Training
Gene 1
Gene 2
Gene 3
…
Gene n
Develop cancer <1 year
Person 1
2.3
1.1
0.3
...
2.1
1
Person 2
3.2
0.2
1.2
...
1.1
1
Person 3
1.9
3.8
2.7
...
0.2
1
...
...
...
...
...
...
..
Person m
2.8
3.1
2.5
...
3.4
0
Testing
Gene 1
Gene 2
Gene 3
...
Gene n
Develop cancer <1 year
Person X
2.1
0.9
0.6
...
1.9
?
6
Classification – Example 2
Animal classification
Test data
https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
7
Classification- Example 3
Banking: classifying borrowers
Test data
Tid
Home Owner
Marital status
Annual Income
Defaulted Borrower
11
No
Single
55K
?
8
Classification – Example 4
Detecting tax cheats
Tid
Refund
Marital
Taxable
Status
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10 10
No
Single
90K
Yes
Test data
Tid
Refund
Marital Status
Taxable Income
Tax Cheat
11
Yes
Married
125K
?
9
Classification: Definition
• •
Given a collection of records (training set )
• Each record contains a set of attributes, one class label.
Find a predictive model for each class label as a function of the values of all attributes, i.e. y = f(x1, x2, ..., xn)
• y: discrete value, target variable
• x1,... xn: attributes, predictors
• f: is the predictive model (a tree, a rule, a mathematical formula)
Goal: previously unseen records should be assigned a class as accurately as possible
A test set is used to determine the accuracy of the model, i.e. the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it
• •
10
Classification framework
11
Regression: Recall Definition
• Given a collection of records (training set)
• Each record contains a set of attributes and one target variable
• Learn predictive model from data, i.e. y = f(x1, x2, ..., xn)
• y: target variable is continuous (key difference to classification) • x1,... xn: attributes, predictors
12
Regression example 1
Predicting ice-creams consumption from temperature: y = f(x)
?
13
Regression example 2
Predicting the activity level of a target gene
Gene 1
Gene 2
Gene 3
...
Gene n
Person 1
2.3
1.1
0.3
...
2.1
Person 2
3.2
0.2
1.2
...
1.1
Person 3
1.9
3.8
2.7
...
0.2
...
...
...
...
...
...
Person m
2.8
3.1
2.5
...
3.4
Gene n+1
3.2
1.1
0.2
...
0.9
Gene n+1
?
Gene 1
Gene 2
Gene 3
...
Gene n
Person m+1
2.1
0.9
0.6
...
1.9
14
Quiz
• Is it classification or regression?
• Prediction of rain or no rain tomorrow based on today’s rain data
• Prediction of the amount of rain tomorrow based on today’s rain data • Prediction of exam mark based on study hours
• Prediction of exam pass or fail based on study hours
• Prediction of salary based on degree results
• Prediction of complications of surgery based on pre-op medical data
15
Decision Trees
16
Decision Tree example: Prediction of tax cheats
Splitting Attributes
Tid
Refund
Marital
Taxable
Status
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10 10
No
Single
90K
Yes
Refund
Yes No
NO MarSt Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
Training Data
Model: Decision Tree
17
Decision Tree example: Prediction of tax cheats 2
MarSt
Married
NO
NO
Single, Divorced
Tid
Refund
Marital
Taxable
Status
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10 10
No
Single
90K
Yes
Yes
No
Refund
TaxInc
< 80K
NO
> 80K
YES
Training Data
There can be more than one tree that fits the same data!
Model: Decision Tree
18
Decision Tree Classification Task
Tree Induction algorithm
Induction
Tid
1 2 3 4 5 6 7 8 9
Yes
Attrib2
Large
Attrib3
Class
125K
No
0
Attrib1
No
No
Medium
100K
No
No
Yes
Small
Medium
70K
120K
No
No
No
Large
95K
Yes
No
Yes
Medium
Large
60K
220K
No
No
No
Small
85K
Yes
No
10
No
Medium
Small
75K
90K
Yes
Learn Model
1
Model
Decision Tree
Training Set
Apply Model
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Deduction
19
Apply DT Model to Test Data (1)
Start from the root of the tree
Refund
Yes No
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No 10
Married
80K
?
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
20
Apply DT Model to Test Data (2)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No 10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
21
Apply DT Model to Test Data (3)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
22
Apply DT Model to Test Data (4)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
23
Apply DT Model to Test Data (5)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
24
Apply DT Model to Test Data (6)
Test Data
Refund
Marital
Taxable
Status
Income
Cheat
No
10
Married
80K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Assign Cheat to “No”
Married
NO
< 80K
NO
> 80K
YES
25
Apply DT Model to Test Data (7)
Start from the root of tree.
Test Data
Refund
Marital
Taxable
Cheat
Status
Income
No 10
Single
100K
?
Refund
Yes No
NO
MarSt
Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
26
Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute • Branch represents an outcome of the test
• Leaf nodes represent class labels
27
Decision Tree Algorithm
Tree Induction algorithm
Induction
Tid
1 2 3 4 5 6 7 8 9
Yes
Attrib2
Large
Attrib3
125K
Class
No
0
Attrib1
No
No
Medium
100K
No
No
Yes
Small
Medium
70K
120K
No
No
No
Large
95K
Yes
No
Yes
Medium
Large
60K
220K
No
No
No
Small
85K
Yes
No
10
No
Medium
Small
75K
90K
Yes
Learn Model
1
Model
Decision Tree
Training Set
Apply Model
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Deduction
28
Decision Tree Algorithm (2)
• Many Algorithms:
• We will look at a representative one
(Hunt’s algorithm)
• How would you code an algorithm that automatically constructs a decision tree for the tax cheat data?
Tid Refund Marital Taxable Cheat Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
29
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that reach node t • General Procedure:
• If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
• If Dt is an empty set, then t is a leaf node labeled by the default class yd
• If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
• Recursively apply the procedure to each subset
Tid Refund Marital Taxable Cheat Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
Refund Yes
No
30
Tid Refund Marital Taxable Cheat Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Hunt’s Algorithm Example
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Dt
Refund Yes
No
31
Stopping condition: leaf node
• If Dt contains records that all belong to the same class yt, then t is a leaf node labeled as yt • If Dt is an empty set, then t is a leaf node labeled as the default class yd
Refund Yes
No
32
Decision Tree splitting example
Splitting Attributes
Refund
Yes No
NO MarSt Single, Divorced
TaxInc
Married
NO
< 80K
NO
> 80K
YES
33
Tree induction (creation)
34
Tree Induction
• Determine how to split the records
• How to specify the attribute test condition? • How to determine the best split?
• Determine when to stop splitting
• When a node only has instances of a single class
35
How to Specify Test Conditions?
• Depends on attribute types • Nominal
• Ordinal
• Continuous
• Depends on number of ways to split • 2-way split
• Multi-way split
36
Splitting Nominal Attributes
• Multi-waysplit:Useasmanypartitionsasdistinctvalues Family CarType Luxury
Sports
• Binary split: Divide values into two subsets, find an optimal partitioning
{Sports, CarType Luxury}
{Family}
OR {Family, CarType Luxury}
{Sports}
37
Splitting Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values Small Size Large
Medium
• Binary split: Divides values into two subsets, find optimal partitioning
{Small, Size Medium}
{Large}
{Medium, Size OR Large}
{Small}
• What about this split?
{Small, Large}
Size
{Medium}
38
Splitting Continuous Attributes
• Different ways of handling
• Discretisation to form an ordinal categorical attribute
• Static – discretise once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles) or clustering
• Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and find the best cut • can be more computing intensive
39
Splitting Continuous Attributes
Taxable Income > 80K?
Yes No
(i) Binary split
Taxable Income?
Tid
Refund
Marital Status
Taxable Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10 10
No
Single
90K
Yes
< 10K
> 80K
[10K,25K) [25K,50K) [50K,80K) (ii) Multi-way split
40
Tree Induction
• Determine how to split the records
• How to specify the attribute test condition? • How to determine the best split?
• Determine when to stop splitting
• When a node only has instances of a single class
41
How to determine the Best Split
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
Family Sports
Yes
No
Luxury
C0: 6 C1: 4
C0: 4 C1: 6
Which test condition is better?
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
42
How to determine the best split
• Greedy approach: Nodes with homogeneous class distribution are better • Need a measure of node impurity:
C0: 5 C1: 5
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
C0: 9 C1: 1
43
Measures of Node Impurity
• Entropy
• We have seen entropy in the feature correlation section, where it was used to
measure the amount of uncertainty in an outcome
• Entropy can also be viewed as an impurity measure
• The set {A,B,C,A,A,A,A,A} has low entropy: low uncertainty and low impurity • The set {A,B,C,D,B,E,A,F} has high entropy: high uncertainty and high impurity
44
Node Impurity Criteria based on entropy
Entropy (H) at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t) Measures homogeneity of a node:
• Maximum (log nc) when records are equally distributed amongst all classes (nc is number of classes)
• Minimum (0.0) when all records belong to one class
45
Examples of computing Entropy
P(C1)=0/6=0 P(C2)=6/6=1
Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
C1
0
C2
6
C1
1
C2
5
C1
2
C2
4
46
Question: What is the entropy?
P(C1) = P(C2) = Entropy =
47
Question: What is the entropy?
13
P(C1) = 33 P(C2) = 33
Entropy = – ((13/33)log(13/33) + (20/33)log(20/33)) = 0.97
20
48
How good is a Split?
Compare the impurity (entropy) of parent node (before splitting) with the impurity (entropy) of the children nodes (after splitting)
H(vj): impurity measure of node vj
j: children node index
N(vj): number of data points in child node vj N: number of data points in parent node The larger the gain, the better
49
How good is a Split?
Note: the information gain is equivalent to the mutual information between the class feature and the feature being split on
Thus splitting using the information gain is to choose the feature with highest information shared with the class variable
50
How good is a Split?
Own Car?
Car Type?
Family Sports
Yes
No
Luxury
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
51
How to determine the Best Split?
Before Splitting: 10 records of class 0, 10 records of class 1
C0:10 C1:10
C0:10 C1:10
Own Car?
Car Type?
Family Sports
Yes
No
Luxury
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
Which test condition is better?
– Compute the gain of both splits
– Choose the one with the larger gain
52
C0: 1 C1: 7
How to determine the Best Split?
Before Splitting: 10 records of class 0, 10 records of class 1
Own Car?
Car Type?
Family Sports
Yes
No
Luxury
C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
Own Car: Information gain=0.029
Car type: Information gain=0.62
We should choose Car type as the better split
53
C0: 8 C1: 0
C0: 1 C1: 7
Creating a decision tree
• Calculate information gain [Left Child],[Right Child] for all: • Refund [Yes], [No]
• Marital status [Single],[Married],[Divorced]
• Taxable income
• [60,60], (60,220]
• [60,70], (70,220]
• [60,75],(75,220]
• [60,85],(85,220]
• [60,90],(90,220]
• [60,95],(95,220]
• [60,100],(100,220] • [60,120],(120,220] • [60,125],(125,220]
• Choose feature + split with the highest information gain and use this as the root node and its split
• Do recursively, terminating when a node consists of only Cheat=No or Cheat=Yes.
Tid Refund Marital Taxable Cheat
1 Yes
2 No
3 No
4 Yes
5 No
6 No
7 Yes
8 No
9 No
10 No
Status
Single Married Single Married Divorced Married Divorced Single Married Single
Income
125K No 100K No 70K No 120K No 95K Yes 60K No 220K No 85K Yes 75K No 90K Yes
10
54
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
55
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
56
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
A:50 B:150
A:25 B:25
A:25 B:125
57
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
A:25 B:25
A:25 B:125
A:50 B:150
58
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
Entropy(left child)= A:25
A:25 B:125
A:50 B:150
-(25/50)log(25/50) -(25/50)log(25/50)
B:25
59
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Entropy(root)= -(50/200)log(50/200) –(150/200)log(150/200)
Entropy(left child)= A:25 -(25/50)log(25/50) B:25 -(25/50)log(25/50)
A:25 B:125
Entropy(right child)= -(25/150)log(25/150)
–(125/150)log(125/150)
A:50 B:150
60
Question 2c) from 2016 exam
Given a dataset with two classes, A and B, suppose the root node of a decision tree has 50 instances of class A and 150 instances of class B. Consider a candidate split of this root node into two children, the first with (25 class A and 25 class B), the second with (25 class A and 125 class B). Write a formula to measure the utility of this split using the entropy criterion. Explain how this formula helps measure split utility.
Split utility= Information Gain
=Entropy(root) – Entropy(root|split)
=Entropy(root) – [(50/200)*Entropy(left child) +(150/200)*Entropy(right child)]
61
MI trees: advantages and disadvantages
• Advantages
• Easy to interpret
• Relatively efficient to construct
• Fast for deciding about a test instance
• Disadvantages
• A simple greedy construction strategy, producing a set of If ..then rules.
Sometimes this is too simple for data with complex structure: • Everything should be as simple as possible, but not simpler
• For complex datasets, the tree might grow very big and difficult to understand 62
K-nearest neighbour classifier
63
Nearest Neighbour Classifier
Basic idea: “If it walks like a duck and quacks like a duck, then it’s probably a duck”
Training Records
Choose k of the “nearest” records
Compute Distance
Test Record
64
Nearest-Neighbour Classifier
Unknown record
Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbours to retrieve
To classify an unknown record:
1. Compute distance to other
training records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking a majority vote)
65
Definition of Nearest Neighbour
X
X
X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbours of a record x are data points that have the k smallest distance to x
66
Distance measure
• Compute distance between two points p=(p1,p2,…), q=(q1,q2,…) • Euclidean distance
d(p,q)= ∑(p−q)2
i
• Can also use Pearson coefficient (similarity measure) • Determine the class from nearest neighbour list 1
• take the majority vote of class labels among th𝑑𝑑e k-nearest neighbours
2
• weigh the vote according to distance, e.g. w =
ii
67
1 nearest-neighbour
Voronoi Diagram defines the classification boundary
The green area takes the class of the green point
68
K-Nearest Neighbour classifier
Choosing a value for k:
• If k is too small, its sensitive to noise points
• If k is too large, the neighbourhood may include points from other classes
X
http://2.bp.blogspot.com/-EK4tA2525EM/U-c6Q4jJuwI/AAAAAAAADjw/hdqRXuunpnQ/s1600/knn.png
69
Points to remember
• Understand what is the difference between classification and regression and why it is useful to build models for these tasks
• Understand how a decision tree may be used to make predictions about the class of a test instance
• Understand the key steps in building a decision tree: • how to split the instances
• how to specify the attribute test condition • how to determine the best split
• how to decide when to stop splitting
70
Points to remember
• Understand the operation and rationale of the k nearest neighbour algorithm for classification
• Understand the advantages and disadvantages of using the k nearest neighbour or decision tree for classification
71
Acknowledgements
Materials are partially adopted from …
• Previous COMP2008 slides including material produced by James Bailey, Pauline Lin, Chris Ewin, Uwe Aickelin and others
• https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
• CS059 – Data Mining – Slides
• http://www-users.cs.umn.edu/ ~kumar/dmbook/dmslides/chap4_basic_classification.ppt
• http://people.duke.edu/~kh269/teaching/b351/20evaluatinglearning.pptx
72