ETX2250/ETF5922: Data Visualization and Analytics
k Nearest Neighbours
Lecturer:
Department of Econometrics and Business Statistics
Week 11
Nearest Neighbours
Recall that we discussed smoothing by nearby points when we discussed density plots Now we will use nearest neighbours in the context of classi cation.
Illustration with 2 predictors since these are easy to understand.
Data is a subsample of 100 observations from the Credit default data.
2/56
Credit Data
3/56
Default or not?
4/56
Default or not?
5/56
Basic Idea
For the rst point (limit 350000, age 50) most nearby points in the training data are not defaults (orange). Predict no default
For the second point (limit 50000, age 45) most nearby points in the training data are defaults (black). Predict default
6/56
More formally
Concentrate on the nearest points, also known as the k Nearest Neighbours or k-NN. Let be the set of training points closest to .
The predicted probability of Default is
7/56
k,xN∈i k )tluafeD = iy(I ∑ 1
x k k,xN k
What do we mean by near?
Recall the distance metrics from week 7, which would be appropriate here?
kNN requires many distance estimates – we need to calculate the distance between each point that we need to classify and all of the points that we have observed.
Euclidean distances is computationally cheap to calculate so this is most common (and what we use!) Remember the Euclidean distance between and is
8/56
2)mv− mu(…+2)2v− 2u(+ 2)1v− 1u(√ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
}mv,…,2v,1v{ = v }mu,…,2u,1u{ = u
Non-parametric
kNN is a non-parametric method (linear regression is a parametric method)
This means that unlike linear regression, we do not assume a particular form of relationship between the predictors
and the outcome.
Does NOT mean there are no assumptions at all. k nearest neighbours assumes that points that are close by in space are more similar.
9/56
mx,…,2x,1x
Your turn:
Thinking of the ethics of data based decision making, what is one bene t of a simple algorithm like kNN?
First example (k=4)
11/56
First example (k=4)
12/56
Prediction
The four nearest neighbours are all Non-defaults
Therefore the predicted default probability for an individual with limit of 350000 and age 50 is zero. We predict this individual does not default.
13/56
Second example (k=4)
14/56
Second example (k=4)
15/56
Prediction
The four nearest neighbours include three Defaults and one Non-default
Therefore the predicted default probability for an individual with limit of 50000 and age 45 is 0.75. Using a threshold of 0.5, we predict this individual does default.
In practice a different threshold can be used.
16/56
Your Turn
Consider the case where
What is the predicted probability of default for the rst example (350000 limit 50 years age)? What is the predicted probability of default for the second example (50000 limit 45 years age)?
17/56
01 = k
First example (k=10)
18/56
Second example (k=10)
19/56
Answers
What is the predicted probability of default for the rst example (350000 limit 50 years age)? The predicted probability of default is 0.
What is the predicted probability of default for the second example (50000 limit 45 years age)? The predicted probability of default is 0.6.
20/56
Decision Boundary
Data again
grid<-Credit[,c(2,6)]
colnames(grid)<-c('Limit','Age')
grid%>%
as_tibble()%>%
add_column(Class=as.factor(pull(Credit,25)))%>%
ggplot(aes(x=Limit,y=Age,col=Class))+geom_point()+scale_color_colorblind()
22/56
Decision Boundaries
For the next few slides a grid of evaluation points has been created. If k-NN predicts “Default” the grid point is colored black.
If k-NN predicts “No default” the grid point is colored yellow. Plotting these gives an idea of the decision boundaries.
23/56
Decision boundary (k=1)
24/56
Decision boundary (k=3)
25/56
Decision boundary (k=11)
26/56
Decision boundary (k=17)
27/56
Choosing
Smaller values of Larger values of Larger values of
lead to very complicated decision boundaries.
provide smoother boundaries although there are still some unstable regions. also slow down computation for big datasets.
28/56
k
k k
k
Choosing
A general rule of thumb is to choose
Also it is advised to choose a value of that is NOT a multiple of the number of classes. This avoids ties.
When ties are present, a class is chosen randomly.
For the two class problem, choose odd .
29/56
k
k n√ ≈ k
k
Choosing
Finally, data can be split into a training sample and test sample.
We can then try many possible values of .
The optimal value can be selected so that the test misclassi cation error rate is minimised. Another alternative is to use cross validation to nd the k that reduces the predicition error.
30/56
k
k
Instability of kNN
High variability
A problem of k-NN classi cation is that it is highly variable.
Now assume we use a xed value of but make small changes in the training data This change the decision boundaries dramatically.
This is much more pronounced for smaller values of .
The next few slides show different subsamples of training data.
32/56
k
k
Original Training Sample (k=3)
33/56
Different Training Sample (k=3)
34/56
Original Training Sample (k=11)
35/56
Different Training Sample (k=11)
36/56
An ethical problem
A classi er with such high variability obviously poses statistical problems.
It also poses ethical issues.
Suppose the algorithm is used to decide who recieves a loan or who is audited by the tax o ce.
It should be possible to clearly explain to this person why their loan application was rejected or why they were chosen for audit.
Having such a sensitive algorithm makes this di cult.
37/56
k-NN Classi cation in R
Package
The k-NN algorithm can be implemented using the class package.
We will go through an example using the data that has already been cleaned and is in the le
knnexample.RData
Load this le using
load(here(‘data/knnexample.RData’))
39/56
Data
A data frame of 250 randomly selected training observations for the predictors LIMIT_BAL and AGE (train_x)
The values of the target variable for the training data stored as a factor (train_y)
A data frame of 250 randomly selected test observations for the predictors LIMIT_BAL and AGE
(test_x)
The values of the target variable for the test data stored as a factor (test_y)
40/56
Standardisation
An important issue with k-NN is that distances can be sensitive to units of measurement. For this reason we should standardise so that all predictors have a standard deviation of 1.
train_x_std<-scale(train_x)
mean_train_x<-attr(train_x_std,"scaled:center")
std_train_x<-attr(train_x_std,"scaled:scale")
41/56
Standardise training data
The test data should also be standardised
This tranformation should be identical for training and test data so using the mean and standard deviation of the training data.
test_x_std<-scale(test_x,
center = mean_train_x,
scale = std_train_x)
Now the data are ready to be used in the knn function
42/56
Using knn
With k=1
yhat_k1 <- knn(train_x_std,test_x_std,
train_y,k=1)
print(yhat_k1)
## [1] No Default Default Default Default
## [7] No Default No Default No Default Default
## [13] No Default No Default No Default Default
## [19] No Default No Default Default No Default Default Default
## [25] No Default No Default Default No Default Default No Default
## [31] Default No Default No Default No Default No Default No Default
## [37] Default No Default No Default Default Default No Default
## [43] No Default No Default No Default No Default Default No Default
## [49] No Default No Default Default No Default No Default No Default
## [55] No Default No Default No Default Default No Default No Default
## [61] Default No Default No Default Default No Default No Default
## [67] Default Default No Default No Default No Default No Default
No Default No Default
No Default No Default
No Default No Default
43/56
Test misclassi cation rate
To compute test misclassi cation
miscl_k1<-mean(test_y!=yhat_k1)
print(miscl_k1)
## [1] 0.32
Overall, 32% of the test sample is incorrectly predicted
As an exercise, you can now try to obtain the misclassi cation rate with
44/56
5=k
Answer
yhat_k5<- knn(train_x_std,
test_x_std,
train_y,k=5)
miscl_k5<-mean(test_y!=yhat_k5)
print(miscl_k5)
## [1] 0.292
The misclassi cation rate when is 0.292.
45/56
5=k
Sensitivity and speci city
It is also worth reporting results in a cross tabulation
table(test_y,yhat_k5)
##
## test_y
## Default
## No Default
yhat_k5
Default No Default
8 52 21 169
Letting "default" be the "positive" class, sensitivity is 8/(8+52) or 0.133, and speci city is 169/(21+169) or 0.889
46/56
Additional features
If the argument prob=TRUE is included then the function returns probabilities.
prob_k5 <- knn(train_x_std,
test_x_std,
train_y,k=5,prob = TRUE)
47/56
Predictions
The output still contains predictions
print(prob_k5)
## [1] No Default Default No Default No Default Default No Default
## [7] No Default No Default No Default Default No Default No Default
## [13] No Default No Default No Default No Default No Default No Default
## [19] No Default No Default Default No Default No Default No Default
## [25] No Default No Default No Default No Default Default No Default
## [31] No Default No Default No Default Default Default No Default
## [37] No Default No Default No Default No Default No Default No Default
## [43] No Default No Default No Default No Default No Default No Default
## [49] No Default No Default No Default Default
## [55] No Default Default Default Default
## [61] Default No Default Default Default
## [67] No Default No Default No Default No Default No Default No Default
## [73] No Default No Default No Default No Default No Default No Default
## [79] No Default No Default No Default No Default No Default No Default
Default No Default
No Default Default
No Default No Default
48/56
Probabilities
Probabilities can be stripped out using the attr function
print(attr(prob_k5,"prob"))
## [1] 1.0000000 0.5714286 0.6666667 0.6000000 0.6000000 0.8000000 1.0000000
## [8] 1.0000000 0.6666667 0.6000000 0.8333333 0.8000000 1.0000000 0.5714286
## [15] 1.0000000 0.6666667 1.0000000 1.0000000 0.6000000 1.0000000 0.6000000
## [22] 0.8333333 0.6666667 0.6000000 1.0000000 1.0000000 0.5714286 1.0000000
## [29] 0.6000000 0.5000000 0.6666667 0.8571429 1.0000000 0.6000000 0.6000000
## [36] 0.8000000 0.6666667 0.6000000 0.6000000 0.5000000 0.6000000 0.6666667
## [43] 0.6000000 0.6000000 0.8333333 1.0000000 0.6666667 0.8000000 1.0000000
## [50] 1.0000000 0.8000000 0.6000000 0.6000000 1.0000000 1.0000000 0.6000000
## [57] 0.6666667 0.8000000 1.0000000 0.6000000 0.8000000 1.0000000 0.6000000
## [64] 0.6000000 0.6666667 0.8000000 0.5714286 0.5714286 1.0000000 1.0000000
## [71] 0.6666667 0.6000000 1.0000000 0.6666667 0.8000000 0.8000000 1.0000000
## [78] 0.8000000 0.8000000 1.0000000 0.6666667 0.6000000 1.0000000 0.8000000
## [85] 0.5714286 1.0000000 0.8000000 0.6666667 1.0000000 0.8750000 1.0000000
## [92] 1.0000000 0.8000000 0.7142857 0.6000000 0.7142857 1.0000000 0.8571429
49/56
Prediction and Probabilities
Prediction
Probability
No Default 1.0000000 No Default 0.6666667
For the rst observation the probability of No Default is 1, but for the second observation the probablity of Default is 0.5714.
The probability is relative to the prediction.
Default
0.5714286
No Default
0.6000000
50/56
Ties in neighbours
If there are only 5 neighbours all probabilities should be either 0.6, 0.8 or 1. This is not true. Why?
There can be ties in nding the nearest neighbours.
Two (or more) observations may tied as equally fth closest.
In this case both are used.
51/56
Regression
The same principle can be used for regression.
The predicted value is simply the average value of for the -nearest neighbours. This leads to a highly non-linear regression function.
52/56
ky
Computational challenges
Where does the time to do this analysis go?
Time to work with training data: negligible (standardization)
Time to make predictions: Extensive (need to work out the distance between each observation we want to predict and every other training observation)
This is called being a "lazy learner" (when learning is delayed until it needs to make a prediction) and can impact implementation in real time applications.
Lazy learning is useful when real time updates are desirable.
One solution is to dimension reduce (remember our clustering lectures?) Also have computational issues with lots of predictors ( )
This is known as the curse of dimensionality
54/56
m
Conclusion
Using k-NN is a simple but often effective method for classi cation. There are limitations
Complicated decision boundaries.
High variance.
Also it works poorly when there is a large number of predictors.
55/56
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer:
Department of Econometrics and Business Statistics
Week 11