程序代写代做代考 algorithm scheme data mining 408216 Data Mining and Knowledge Engineering

408216 Data Mining and Knowledge Engineering

Data Mining Algorithms for Classification

*

Store the training records
Use training records to
predict the class label of
unseen cases

Examples:

Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification

Basic idea:

If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records
Test Record

Compute Distance

Choose k of the “nearest” records

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Compute distance between two points:

Euclidean distance

Determine the class from nearest neighbor list

take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d
weight factor, w = 1-d

Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

For nominal data, Euclidean distance cannot be used and a distance measure based on whether an exact match exists or not can be used
For the credit scoring example we will use k=3 and distance between two instances is based on a 0/1 scheme

To classify Sam (Debt=High, Income=High, Married = Yes), we find the 3 nearest neighbours to Sam, who are Joe, Sue, and Mary
The risk values of the neighbours are Good, Good and Poor, which means that Sam is classified as a Good risk

Accuracy of prediction depends on the value of k and the choice of distance metric
Most DM products choose a value of 10 as default for k
Distance metric needs to be defined by the miner – this requires some care
In datasets that contain numeric data, need to be careful with attributes that are defined on different scale ranges – eg Income and Age
for nominal data (E.g. in age groupings) need to recognise that distance between pairs of age groupings are not always the same:

d((21, 30), (51,60)) > d((21, 30), (41,50))

*

Chapter 4, Data Mining: Practical Machine Learning Tools and Techniques (3nd edition) / Ian Witten, Eibe Frank; Elsevier, 2011.
Chapter 4, Introduction to Data Mining / Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Pearson Education, Inc, 2006.

Unknown record�


X�

Joe
Sue
John
Mary
Fred

Joe
0
1
2
1
2

Sue
1
0
1
2
1

John
2
1
0
3
2

Mary
1
2
3
0
1

Fred
2
1
2
1
0

Name

Name
Debt
Income
Married?
Risk
Actual
Probability: Good Risk
Probability: Poor Risk
Risk Predicted

Joe
High
High
Yes
Good
0.82
0.18
Good

Sue
Low
High
Yes
Good
0.69
0.31
Good

John
Low
High
No
Poor
0
1.0
Poor

Mary
High
Low
Yes
Poor
0
1.0
Poor

Fred
Low
Low
Yes
Poor
0
1.0
Poor

Atr1
………
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases

Atr1
………
AtrN
Unseen Case

Unknown record

X
X
X
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor

å

=
i
i
i
q
p
q
p
d
2
)
(
)
,
(

X

Joe Sue John Mary Fred
Joe 0 1 2 1 2
Sue 1 0 1 2 1
John 2 1 0 3 2
Mary 1 2 3 0 1
Fred 2 1 2 1 0

Name Debt Income Married?
Risk
Actual
Probability:
Good Risk
Probability:
Poor Risk
Risk
Predicted
Joe High High Yes Good 0.82 0.18 Good
Sue Low High Yes Good 0.69 0.31 Good
John Low High No Poor 0 1.0 Poor
Mary High Low Yes Poor 0 1.0 Poor
Fred Low Low Yes Poor 0 1.0 Poor