408216 Data Mining and Knowledge Engineering
Data Mining Algorithms for Classification
*
Store the training records
Use training records to
predict the class label of
unseen cases
Examples:
Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
K-nearest neighbors of a record x are data points that have the k smallest distance to x
Compute distance between two points:
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d
weight factor, w = 1-d
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
For nominal data, Euclidean distance cannot be used and a distance measure based on whether an exact match exists or not can be used
For the credit scoring example we will use k=3 and distance between two instances is based on a 0/1 scheme
To classify Sam (Debt=High, Income=High, Married = Yes), we find the 3 nearest neighbours to Sam, who are Joe, Sue, and Mary
The risk values of the neighbours are Good, Good and Poor, which means that Sam is classified as a Good risk
Accuracy of prediction depends on the value of k and the choice of distance metric
Most DM products choose a value of 10 as default for k
Distance metric needs to be defined by the miner – this requires some care
In datasets that contain numeric data, need to be careful with attributes that are defined on different scale ranges – eg Income and Age
for nominal data (E.g. in age groupings) need to recognise that distance between pairs of age groupings are not always the same:
d((21, 30), (51,60)) > d((21, 30), (41,50))
*
Chapter 4, Data Mining: Practical Machine Learning Tools and Techniques (3nd edition) / Ian Witten, Eibe Frank; Elsevier, 2011.
Chapter 4, Introduction to Data Mining / Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Pearson Education, Inc, 2006.
Unknown record�
�
�
X�
Joe
Sue
John
Mary
Fred
Joe
0
1
2
1
2
Sue
1
0
1
2
1
John
2
1
0
3
2
Mary
1
2
3
0
1
Fred
2
1
2
1
0
Name
Name
Debt
Income
Married?
Risk
Actual
Probability: Good Risk
Probability: Poor Risk
Risk Predicted
Joe
High
High
Yes
Good
0.82
0.18
Good
Sue
Low
High
Yes
Good
0.69
0.31
Good
John
Low
High
No
Poor
0
1.0
Poor
Mary
High
Low
Yes
Poor
0
1.0
Poor
Fred
Low
Low
Yes
Poor
0
1.0
Poor
Atr1
………
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1
………
AtrN
Unseen Case
Unknown record
X
X
X
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor
å
–
=
i
i
i
q
p
q
p
d
2
)
(
)
,
(
X
Joe Sue John Mary Fred
Joe 0 1 2 1 2
Sue 1 0 1 2 1
John 2 1 0 3 2
Mary 1 2 3 0 1
Fred 2 1 2 1 0
Name Debt Income Married?
Risk
Actual
Probability:
Good Risk
Probability:
Poor Risk
Risk
Predicted
Joe High High Yes Good 0.82 0.18 Good
Sue Low High Yes Good 0.69 0.31 Good
John Low High No Poor 0 1.0 Poor
Mary High Low Yes Poor 0 1.0 Poor
Fred Low Low Yes Poor 0 1.0 Poor