程序代写代做代考 algorithm scheme data mining 408216 Data Mining and Knowledge Engineering

408216 Data Mining and Knowledge Engineering

Data Mining Algorithms for Classification

Store the training records
Use training records to
predict the class label of
unseen cases

Examples:

Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification

Basic idea:

If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records
Test Record

Compute Distance

Choose k of the “nearest” records

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Compute distance between two points:

Euclidean distance

Determine the class from nearest neighbor list

take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d
weight factor, w = 1-d

Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes

For nominal data, Euclidean distance cannot be used and a distance measure based on whether an exact match exists or not can be used
For the credit scoring example we will use k=3 and distance between two instances is based on a 0/1 scheme

To classify Sam (Debt=High, Income=High, Married = Yes), we find the 3 nearest neighbours to Sam, who are Joe, Sue, and Mary
The risk values of the neighbours are Good, Good and Poor, which means that Sam is classified as a Good risk

Accuracy of prediction depends on the value of k and the choice of distance metric
Most DM products choose a value of 10 as default for k
Distance metric needs to be defined by the miner – this requires some care
In datasets that contain numeric data, need to be careful with attributes that are defined on different scale ranges – eg Income and Age
for nominal data (E.g. in age groupings) need to recognise that distance between pairs of age groupings are not always the same:

d((21, 30), (51,60)) > d((21, 30), (41,50))

Chapter 4, Data Mining: Practical Machine Learning Tools and Techniques (3nd edition) / Ian Witten, Eibe Frank; Elsevier, 2011.
Chapter 4, Introduction to Data Mining / Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Pearson Education, Inc, 2006.

Unknown record�
�

�
X�

Joe
Sue
John
Mary
Fred

Joe
0
1
2
1
2

Sue
1
0
1
2
1

John
2
1
0
3
2

Mary
1
2
3
0
1

Fred
2
1
2
1
0

Name

Name
Debt
Income
Married?
Risk
Actual
Probability: Good Risk
Probability: Poor Risk
Risk Predicted

Joe
High
High
Yes
Good
0.82
0.18
Good

Sue
Low
High
Yes
Good
0.69
0.31
Good

John
Low
High
No
Poor
0
1.0
Poor

Mary
High
Low
Yes
Poor
0
1.0
Poor

Fred
Low
Low
Yes
Poor
0
1.0
Poor

Atr1
………
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases

Atr1
………
AtrN
Unseen Case

Unknown record

X
X
X
(a) 1-nearest neighbor
(b) 2-nearest neighbor
(c) 3-nearest neighbor

å
–
=
i
i
i
q
p
q
p
d
2
)
(
)
,
(

Joe Sue John Mary Fred
Joe 0 1 2 1 2
Sue 1 0 1 2 1
John 2 1 0 3 2
Mary 1 2 3 0 1
Fred 2 1 2 1 0

Name Debt Income Married?
Risk
Actual
Probability:
Good Risk
Probability:
Poor Risk
Risk
Predicted
Joe High High Yes Good 0.82 0.18 Good
Sue Low High Yes Good 0.69 0.31 Good
John Low High No Poor 0 1.0 Poor
Mary High Low Yes Poor 0 1.0 Poor
Fred Low Low Yes Poor 0 1.0 Poor

Related Posts