Data Mining (EECS 4412)
K Nearest Neighbour Classifier
Parke Godfrey
EECS
Lassonde School of Engineering York University
Thanks to
Aijun
for creation & use
Professor
An
of these slides.
2
K
Nearest Neighbor Classifiers
–
Learning by analogy:
Tell me who your friends you who you are
are
and I’ll tell
A
common class among the (K) examples
that are most
new example
is assigned to the most
similar to
it.
3
K
To determine the class of a new example E:
Calculate the distance between E and all examples in the training set
Select k nearest examples to E in the training set
n
n
n n
Assign E to the most common class among its k
–
Nearest Neighbor Algorithm (k
–
NN)
nearest neighbors
–
Response
Response
Class: Response
No response
No response
No response
4
K
n No model is built: Store all training examples
n Any processing is delayed until a new instance must be
classified.
Response
No response
No response
–
Response
5
Nearest Neighbor: Instance Based Learning
Class: Response
No response
Distance Between Neighbors
n Each example is represented with a set of numerical attributes
John:
Age=35 Income=95K
No. of credit cards=3
Rachel:
Age=41 Income=215K
No. of credit cards=2
n “Closeness” can be defined in terms of the distance between two examples.
n The Euclidean distance between X=(x
=(y
Euclidean
,x
1
2
3
n
,y
,y
,…
y
) is defined as:
1
2
,x
3
,…
x
n
) and Y
n D(X,Y)= å(x-y)2
ii i=1
n
Distance (John, Rachel)=
sqrt
–
2
–
215K)
2
+(3
–
2)
2
]
[(35
41)
+(95K
6
Example : 3
–
Nearest Neighbors
Customer
Age Income No. credit Response cards
John
Rachel
Hannah
Tom
Nellie
David3750K2 ?
3535K3 No 2250K2 Yes 63 200K 1 No 59 170K 1 No 2540K4 Yes
7
Customer
John Rachel Hannah Tom
Nellie
David
Age Income No. (K) cards
Example (3
35 35
22 50
63 200 591701
25404 Yes
2
2
–
37502
Yes
3 No 2 Yes 1 No
37)
+(35
–
Response Distance from David
NN)
2 No sqrt
2 37)
152.23
sqrt [(35
+(3
sqrt [(22
– 2
50)
– ]=
–
2
+(2
sqrt [(63
2)
15.16
– ]=
50)
+(200
–
2
2)
15
50)
2
+(1 [(59
50)
2
2 sqrt
–
2)
]=
122
+(4
–
2)
15.74
+(1 [(25
– ]=
50)
–
37)
–
– –
37)
+(50 2
2
2)
]= +(170
37)
+(40
– 2
– 2
2
8
A Problem and its Solution
John:
Age=35
Rachel:
Age=41 Income=215K
No. of credit cards=2
me=95K Distance (John, Rachel)=
Inco
No. of credit cards=3
sqrt
[(35
–
45)
+(95,000
–
215,000)
+(3
–
2)
]
to numbers between 0
Example: Income
If Highest income = 200K Lowest income=0
Davis’s income is normalized to 50/200, John’s income is normalized
to 35/200, etc.)
2
2
2
Distance between examples could be
some attributes with relatively large numbers (e.g., income in our example).
n
dominated by Important to normalize features (e.g., map numbers
n
–
1)
9
Customer
John Rachel Hannah Tom Nellie David
Age Income No. (K) cards
Response
k
NN with Normalization of Variables
–
55/63= 35/200=
0.55 0.175 0.75
22/63= 50/200=
0.34
63/63= 200/200
1 =1 0.25
59/63= 170/200
0.93 =0.85 0.25
25/63= 40/200= 4/4= 0.39 0.2 1
37/63= 50/200= 2/4= 0.58 0.25 0.5
Yes
Yes
3⁄4= No
2/4= Yes 0.25 0.5
1⁄4= No
1⁄4= No
10
Another Problem with k
n Distance works naturally with numerical attributes
–
NN
D(Rachel,
What if we have nominal attributes?
Johm
)= Example: married
–
–
50)
+(3
–
2)
]=
15.16
sqrt
[(35
37)
2
+(35
2
2
Customer
Married
Income (K)
No. cards
Response
John
Yes
35
3
No
Rachel
No
50
2
Yes
Hannah
No
200
1
No
Tom
Yes
170
1
No
Nellie
No
40
4
Yes
David
Yes
50
2
11
K
Method 1: Convert nominal attributes to numerical attributes
Þ
–
NN with Nominal Attributes
n
E.g., yes
1 and no
Þ
n n n
Method 2:
0 2, red
Blue Problem?
3, etc.
Þ
Þ
Þ
m Distance(x, y) = ådist(x , y )
1, yellow
n
where
dist(xi,yi)=ïí ï
i =1
if xi andyi arenominalandxi 1yi
ì
0
1
|norm(x )-norm(y )|
ii
if xi and yi arenominaland xi = yi
if x and y arecontinuous îiiii
and m is the number of attributes
12
Example
n Distance between David and John:
D(David, John)= 0 + |0.25
–
0.175|+ |0.5
–
0.75|
Customer
Married
Income (K)
No. cards
Response
John
Yes
35/200=0.175
3⁄4=0.75
No
Rachel
No
50/200=0.25
2/4=0.5
Yes
Hannah
No
200/200=1
1⁄4=0.25
No
Tom
Yes
170/200=0.85
1⁄4=0.25
No
Nellie
No
40/200=0.2
4/4=1
Yes
David
Yes
50/200=0.25
2/4=0.5
13
Nearest Neighbor Classifier Strengths and Weaknesses
Strengths:
n Simple to implement and use
n Comprehensible
K
–
–
easy to explain prediction
nearest neighbors. n Can be used to do regression (how?)
n Robust to noisy data by averaging k
–
n Can learn complex target functions Weaknesses:
n Need a lot of space to store all examples.
n Takes more time to classify a new example than with a model (need to calculate and compare distance from new example to all other examples).
14
Decision Tree vs K
Classification Tree Model
Age < 40
No Yes
Class=No Response
Yes
Income>100K No
Response
No Response
No cards>2 Yes No
No response
15
Response
No Response
–
Nearest Neighbor Classifier K-Nearest Neighbors
Response
No response
No response
Class: Response