程序代写CS代考 algorithm ETX2250/ETF5922: Data Visualization and Analytics

ETX2250/ETF5922: Data Visualization and Analytics
k Nearest Neighbours
Lecturer:
Department of Econometrics and Business Statistics 
 Week 11

Nearest Neighbours
Recall that we discussed smoothing by nearby points when we discussed density plots Now we will use nearest neighbours in the context of classication.
Illustration with 2 predictors since these are easy to understand.
Data is a subsample of 100 observations from the Credit default data.
2/56

Credit Data
3/56

Default or not?
4/56

Default or not?
5/56

Basic Idea
For the rst point (limit 350000, age 50) most nearby points in the training data are not defaults (orange). Predict no default
For the second point (limit 50000, age 45) most nearby points in the training data are defaults (black). Predict default
6/56

More formally
Concentrate on the nearest points, also known as the k Nearest Neighbours or k-NN. Let be the set of training points closest to .
The predicted probability of Default is
7/56
k,xN∈i k )tluafeD = iy(I ∑ 1
x k k,xN k

What do we mean by near?
Recall the distance metrics from week 7, which would be appropriate here?
kNN requires many distance estimates – we need to calculate the distance between each point that we need to classify and all of the points that we have observed.
Euclidean distances is computationally cheap to calculate so this is most common (and what we use!) Remember the Euclidean distance between and is
8/56
2)mv− mu(…+2)2v− 2u(+ 2)1v− 1u(√ −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
}mv,…,2v,1v{ = v }mu,…,2u,1u{ = u

Non-parametric
kNN is a non-parametric method (linear regression is a parametric method)
This means that unlike linear regression, we do not assume a particular form of relationship between the predictors
and the outcome.
Does NOT mean there are no assumptions at all. k nearest neighbours assumes that points that are close by in space are more similar.
9/56
mx,…,2x,1x

Your turn:
Thinking of the ethics of data based decision making, what is one benet of a simple algorithm like kNN?

First example (k=4)
11/56

First example (k=4)
12/56

Prediction
The four nearest neighbours are all Non-defaults
Therefore the predicted default probability for an individual with limit of 350000 and age 50 is zero. We predict this individual does not default.
13/56

Second example (k=4)
14/56

Second example (k=4)
15/56

Prediction
The four nearest neighbours include three Defaults and one Non-default
Therefore the predicted default probability for an individual with limit of 50000 and age 45 is 0.75. Using a threshold of 0.5, we predict this individual does default.
In practice a different threshold can be used.
16/56

Your Turn
Consider the case where
What is the predicted probability of default for the rst example (350000 limit 50 years age)? What is the predicted probability of default for the second example (50000 limit 45 years age)?
17/56
01 = k

First example (k=10)
18/56

Second example (k=10)
19/56

Answers
What is the predicted probability of default for the rst example (350000 limit 50 years age)? The predicted probability of default is 0.
What is the predicted probability of default for the second example (50000 limit 45 years age)? The predicted probability of default is 0.6.
20/56

Decision Boundary

Data again
grid<-Credit[,c(2,6)] colnames(grid)<-c('Limit','Age') grid%>%
as_tibble()%>%
add_column(Class=as.factor(pull(Credit,25)))%>%
ggplot(aes(x=Limit,y=Age,col=Class))+geom_point()+scale_color_colorblind()
22/56

Decision Boundaries
For the next few slides a grid of evaluation points has been created. If k-NN predicts “Default” the grid point is colored black.
If k-NN predicts “No default” the grid point is colored yellow. Plotting these gives an idea of the decision boundaries.
23/56

Decision boundary (k=1)
24/56

Decision boundary (k=3)
25/56

Decision boundary (k=11)
26/56

Decision boundary (k=17)
27/56

Choosing
Smaller values of Larger values of Larger values of
lead to very complicated decision boundaries.
provide smoother boundaries although there are still some unstable regions. also slow down computation for big datasets.
28/56
k
k k
k

Choosing
A general rule of thumb is to choose
Also it is advised to choose a value of that is NOT a multiple of the number of classes. This avoids ties.
When ties are present, a class is chosen randomly.
For the two class problem, choose odd .
29/56
k
k n√ ≈ k
k

Choosing
Finally, data can be split into a training sample and test sample.
We can then try many possible values of .
The optimal value can be selected so that the test misclassication error rate is minimised. Another alternative is to use cross validation to nd the k that reduces the predicition error.
30/56
k
k

Instability of kNN

High variability
A problem of k-NN classication is that it is highly variable.
Now assume we use a xed value of but make small changes in the training data This change the decision boundaries dramatically.
This is much more pronounced for smaller values of .
The next few slides show different subsamples of training data.
32/56
k
k

Original Training Sample (k=3)
33/56

Different Training Sample (k=3)
34/56

Original Training Sample (k=11)
35/56

Different Training Sample (k=11)
36/56

An ethical problem
A classier with such high variability obviously poses statistical problems.
It also poses ethical issues.
Suppose the algorithm is used to decide who recieves a loan or who is audited by the tax oce.
It should be possible to clearly explain to this person why their loan application was rejected or why they were chosen for audit.
Having such a sensitive algorithm makes this dicult.
37/56

k-NN Classication in R

Package
The k-NN algorithm can be implemented using the class package.
We will go through an example using the data that has already been cleaned and is in the le
knnexample.RData
Load this le using
load(here(‘data/knnexample.RData’))
39/56

Data
A data frame of 250 randomly selected training observations for the predictors LIMIT_BAL and AGE (train_x)
The values of the target variable for the training data stored as a factor (train_y)
A data frame of 250 randomly selected test observations for the predictors LIMIT_BAL and AGE
(test_x)
The values of the target variable for the test data stored as a factor (test_y)
40/56

Standardisation
An important issue with k-NN is that distances can be sensitive to units of measurement. For this reason we should standardise so that all predictors have a standard deviation of 1.
train_x_std<-scale(train_x) mean_train_x<-attr(train_x_std,"scaled:center") std_train_x<-attr(train_x_std,"scaled:scale") 41/56 Standardise training data The test data should also be standardised This tranformation should be identical for training and test data so using the mean and standard deviation of the training data. test_x_std<-scale(test_x, center = mean_train_x, scale = std_train_x) Now the data are ready to be used in the knn function 42/56 Using knn With k=1 yhat_k1 <- knn(train_x_std,test_x_std, train_y,k=1) print(yhat_k1) ## [1] No Default Default Default Default ## [7] No Default No Default No Default Default ## [13] No Default No Default No Default Default ## [19] No Default No Default Default No Default Default Default ## [25] No Default No Default Default No Default Default No Default ## [31] Default No Default No Default No Default No Default No Default ## [37] Default No Default No Default Default Default No Default ## [43] No Default No Default No Default No Default Default No Default ## [49] No Default No Default Default No Default No Default No Default ## [55] No Default No Default No Default Default No Default No Default ## [61] Default No Default No Default Default No Default No Default ## [67] Default Default No Default No Default No Default No Default No Default No Default No Default No Default No Default No Default 43/56 Test misclassication rate To compute test misclassication miscl_k1<-mean(test_y!=yhat_k1) print(miscl_k1) ## [1] 0.32 Overall, 32% of the test sample is incorrectly predicted As an exercise, you can now try to obtain the misclassication rate with 44/56 5=k Answer yhat_k5<- knn(train_x_std, test_x_std, train_y,k=5) miscl_k5<-mean(test_y!=yhat_k5) print(miscl_k5) ## [1] 0.292 The misclassication rate when is 0.292. 45/56 5=k Sensitivity and specicity It is also worth reporting results in a cross tabulation table(test_y,yhat_k5) ## ## test_y ## Default ## No Default yhat_k5 Default No Default 8 52 21 169 Letting "default" be the "positive" class, sensitivity is 8/(8+52) or 0.133, and specicity is 169/(21+169) or 0.889 46/56 Additional features If the argument prob=TRUE is included then the function returns probabilities. prob_k5 <- knn(train_x_std, test_x_std, train_y,k=5,prob = TRUE) 47/56 Predictions The output still contains predictions print(prob_k5) ## [1] No Default Default No Default No Default Default No Default ## [7] No Default No Default No Default Default No Default No Default ## [13] No Default No Default No Default No Default No Default No Default ## [19] No Default No Default Default No Default No Default No Default ## [25] No Default No Default No Default No Default Default No Default ## [31] No Default No Default No Default Default Default No Default ## [37] No Default No Default No Default No Default No Default No Default ## [43] No Default No Default No Default No Default No Default No Default ## [49] No Default No Default No Default Default ## [55] No Default Default Default Default ## [61] Default No Default Default Default ## [67] No Default No Default No Default No Default No Default No Default ## [73] No Default No Default No Default No Default No Default No Default ## [79] No Default No Default No Default No Default No Default No Default Default No Default No Default Default No Default No Default 48/56 Probabilities Probabilities can be stripped out using the attr function print(attr(prob_k5,"prob")) ## [1] 1.0000000 0.5714286 0.6666667 0.6000000 0.6000000 0.8000000 1.0000000 ## [8] 1.0000000 0.6666667 0.6000000 0.8333333 0.8000000 1.0000000 0.5714286 ## [15] 1.0000000 0.6666667 1.0000000 1.0000000 0.6000000 1.0000000 0.6000000 ## [22] 0.8333333 0.6666667 0.6000000 1.0000000 1.0000000 0.5714286 1.0000000 ## [29] 0.6000000 0.5000000 0.6666667 0.8571429 1.0000000 0.6000000 0.6000000 ## [36] 0.8000000 0.6666667 0.6000000 0.6000000 0.5000000 0.6000000 0.6666667 ## [43] 0.6000000 0.6000000 0.8333333 1.0000000 0.6666667 0.8000000 1.0000000 ## [50] 1.0000000 0.8000000 0.6000000 0.6000000 1.0000000 1.0000000 0.6000000 ## [57] 0.6666667 0.8000000 1.0000000 0.6000000 0.8000000 1.0000000 0.6000000 ## [64] 0.6000000 0.6666667 0.8000000 0.5714286 0.5714286 1.0000000 1.0000000 ## [71] 0.6666667 0.6000000 1.0000000 0.6666667 0.8000000 0.8000000 1.0000000 ## [78] 0.8000000 0.8000000 1.0000000 0.6666667 0.6000000 1.0000000 0.8000000 ## [85] 0.5714286 1.0000000 0.8000000 0.6666667 1.0000000 0.8750000 1.0000000 ## [92] 1.0000000 0.8000000 0.7142857 0.6000000 0.7142857 1.0000000 0.8571429 49/56 Prediction and Probabilities Prediction Probability No Default 1.0000000 No Default 0.6666667 For the rst observation the probability of No Default is 1, but for the second observation the probablity of Default is 0.5714. The probability is relative to the prediction. Default 0.5714286 No Default 0.6000000 50/56 Ties in neighbours If there are only 5 neighbours all probabilities should be either 0.6, 0.8 or 1. This is not true. Why? There can be ties in nding the nearest neighbours. Two (or more) observations may tied as equally fth closest. In this case both are used. 51/56 Regression The same principle can be used for regression. The predicted value is simply the average value of for the -nearest neighbours. This leads to a highly non-linear regression function. 52/56 ky Computational challenges Where does the time to do this analysis go? Time to work with training data: negligible (standardization) Time to make predictions: Extensive (need to work out the distance between each observation we want to predict and every other training observation) This is called being a "lazy learner" (when learning is delayed until it needs to make a prediction) and can impact implementation in real time applications. Lazy learning is useful when real time updates are desirable. One solution is to dimension reduce (remember our clustering lectures?) Also have computational issues with lots of predictors ( ) This is known as the curse of dimensionality 54/56 m Conclusion Using k-NN is a simple but often effective method for classication. There are limitations Complicated decision boundaries. High variance. Also it works poorly when there is a large number of predictors. 55/56 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Lecturer: Department of Econometrics and Business Statistics   Week 11

Related Posts