K Nearest Neighbours
I The first classification algorithm we present is called K nearest neighbors or KNN.
I KNN assigns a label to a new observation yi based on“what is the most common class around the vector xi ”?
I The idea is that, if you have similar X0s values, then also the label y should be similar.
Copyright By PowCoder代写 加微信 powcoder
I One dimensional space [draw example]
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Figure: A three class nearest neighbours visual example with two features X1, X2
K Nearest Neighbours
Algorithm (K Nearest Neighbours)
1. Store all the training data (yi,xi1,…,xip)
2. For each new point xq = (xq1,…xqp), find the K nearest
neighbours in the training set, indexed by N0. 3. For each possible value j 2 C, compute
Pˆ ( Y = j | X = x q ) = K1 X I ( y j = j ) i 2N0
4. Assign the class j for which Pˆ(Y = j|X = xq) is maximal.
K Nearest Neighbours Estimates Complicated Functions
Figure: A three class nearest neighbours visual example with two features X1, X2
Some thoughts on KNN
I KNN is among the simplest of all machine learning algorithms.
I Its explicitly non-parametric (you dont estimate any coe cients)
I KNN is referred to as a lazy learning algorithm: the function is only approximated locally and all computation is deferred until classification.
I So there is actually no explicit training step.
I Results from KNN are sensitive to the local structure of the data
(example highly skewed,“majority voting”distorted, weights).
I Its (potentially) too slow, as it may be impossible to hold all the training data in memory.
Applications in Research: Fetzer and Marden, 2014
Figure: Classifiying historical landsat imagery into Forest or Non Forest Status.
[See forested.R]
Applications in Research: Fetzer and Marden, 2014
Figure: Classifiying historical landsat imagery into Forest or Non Forest Status.
[See forested.R]
Applications in Research: Fetzer and Marden, 2014
Figure: Classifiying historical landsat imagery into Forest or Non Forest Status.
[See forested.R]
Applications in Research: Fetzer and Marden, 2014
Figure: Classifiying historical landsat imagery into Forest or Non Forest Status.
[See forested.R]
Plotting Out Band Values for Red, Near Infrared
> options(stringsAsFactors=FALSE)
> library(data.table)
> library(plyr)
> library(class)
> FORESTED <- data.table(read.csv("R/forested.csv"))
> COMPOSITE <- data.table(read.csv("R/composite.csv"))
> MODIS <- data.table(read.csv("R/modis.csv"))
> setnames(MODIS, “mean”, “landcover”)
> setnames(FORESTED, “mean”, “forestcover”)
> DF<- join(join(FORESTED[, c("system.index", "forestcover"), wit
+ MODIS[, c("system.index", "landcover"), with=F])
> DF[, Forested := forestcover>.8]
> DF[, MODISforested := (landcover>0 & landcover<=5)]
> DF[, B3:=scale(B3)]
> DF[, B4:=scale(B4)]
> df.xlim <- range(DF$B3)
> df.ylim<- range(DF$B4)
> plot(DF[Forested==TRUE][1:120, c(“B3”, “B4”), with=F], col=”gre
> points(DF[Forested==FALSE][1:80, c(“B3”, “B4”), with=F], col=”r
Plotting Out Band Values for Red, Near Infrared
Figure: Classifiying historical landsat imagery into Forest or Non Forest Status.
Results for KNN with K=10
10−nearest neighbour
Figure: K = 10
How do we perform in terms of classification?
> knn10 <- knn(DF[train, c("B3","B4"), with=F], test = DF[-t
+ cl = DF[train]$Forested, k = 10, prob = TRUE)
> prob <- attr(knn10, "prob")
> prob <- ifelse(knn10==TRUE, prob, 1-prob)
> table(prob>0.5,DF[-train]$Forested)
FALSE TRUE
FALSE 59 14
TRUE 13 114
Overall error rate is quite low: 27/200 = 13.5%.
I Pixels are wrongly classified as forested (type 1 error) 13/200
I Pixels are wrongly classified as non forested 14/200
I Suppose you are sending an enforcement agent to fine people who
have illegally deforested land (predicted value = FALSE), i.e. for 73
pixels (around 36.5%) .
I It costs a lot of money to send police out there, but in 19%
(14/(59+14)) of the cases you find the forest is actually standing…
Changing c ̄…
Make it less likely, that we have pixels wrongly classified as being
forested…reduce the threshold c ̄.
> table(prob>0.3,DF[-train]$Forested)
FALSE TRUE
FALSE 57 7
TRUE 15 121
I Now, we will send enforcement teams out 64 times, but only in 10% of the cases, will the forest still be standing (i.e. there is nothing to enforce)
I While the threshold c ̄ = 0.5, theoretically minimizes overall prediction error, you may still prefer di↵erent c ̄, in case there are di↵erent costs associated with type 1 or type 2 errors.
Exercise: Plot an ROC Curve…
Can you try to compute ROC curves and plot them out?
How to Choose K?
I As you will have guessed, it turns out that K is a tuning parameter that we need to select.
I In binary (two class) classification problems, it is helpful to choose K to be an odd number – why?
I Cross validation is our preferred method, i.e. for each di↵erent value of K, we perform cross validation and plot out the overall average error rates.
I Why does the choice of K matter?
Results for KNN with K=15
15−nearest neighbour
Figure: K = 15
Results for KNN with K=1
1−nearest neighbour
Figure: K = 1
Cross Validation and choice of K
> RES<-NULL
> for(K in 1:50) {
+ knncv <- knn.cv(DF[, c("B3","B4"), with=F],
+ cl = DF$Forested, k = K, prob = TRUE)
+ RES<-rbind(RES,cbind(K,1-(table(knncv,DF$Forested)[1,1]+table(knncv,DF$Forested)[2,2])/1000)) +}
> plot(RES)
0 10 20 30 40 50 K
0.11 0.12 0.13 0.14
Comparing Logistic Regression Versus kNN Classification
I Cross Validation suggested that we should use k ⇡ 10 for kNN classification.
I Lets compare the performance of logistic regression versus 10NN for this example.
> set.seed(1151)
> train<-sample(1:1000, 800)
> glm.fit<-glm(Forested ~ B3 + B4, data=DF[train], fa
> glm.predict <- predict.glm(glm.fit, DF[-train], type="response")
> knn10 <- knn(DF[train, c("B3","B4"), with=F], test = DF[-train,
+ cl = DF[train]$Forested, k = 10, prob = TRUE)
> prob <- attr(knn10, "prob")
> prob <- ifelse(knn10==TRUE, prob, 1-prob)
> table(prob>0.5,DF[-train]$Forested)
FALSE TRUE
FALSE 59 14
TRUE 13 114
> table(glm.predict>0.5,DF[-train]$Forested)
FALSE TRUE
FALSE 58 9
TRUE 14 119
Comparing Logistic Regression Versus kNN Classification
I The result is not surprising. The data suggests a near linear decision boundary, both kNN as well as logistic regression estimate that linear boundary reasonably well.
I You may still decide that you prefer logistic regression, since it is much more interpretable and less of a black box.
I With logistic regression, you get interpretable coe cients; here we do not care much for inference, so it doesnt really matter.
I But in general, a (reasonable) rule of thumb is that you should prefer a simple parametric interpretable method over a non parametric method in case they yield similar results.
When does kNN outperform?
Figure: Example of Non-Linear Classification Problems
When does kNN outperform?
Figure: Example of Non-Linear Classification Problems
kNN Regression
I With numeric features, KNN is a simple way to compute predictions.
I Compute the average of the K-Nearest Neighbours as the predicted value.
KNN is a discriminative classifier [non examinable]
KNN is a discriminative algorithm since it models the conditional probability of a sample belonging to a given class. To see this just consider how one gets to the decision rule of kNNs.
A class label corresponds to a set of points which belong to some region in the feature space R. If you draw sample points from the actual probability distribution, p(x), independently, then the probability of drawing a sample from that class is,
P=Z p(x)dx R
What if you have N points? The probability that K points of those N points fall in the region R follows the binomial distribution,
Prob(K) = ✓KN◆PK (1 P)N K
As N ! 1 this distribution is sharply peaked, so that the probability can be approximated by its mean value KN.
KNN is a discriminative classifier[non examinable]
An additional approximation is that the probability distribution over R remains approximately constant, so that one can approximate the integral by,
where V is the total volume of the region. Under this approximations
p(x)dx ⇡ p(x)V
Now, if we had several classes, we could repeat the same analysis for
p(x)⇡ K NV
each one, which would give us,
p(x|Ck)= V
where Kk is the amount of points from class k which falls within that region a is the total number of points which fall in that region. Notice kNk=N.
Repeating the analysis with the binomial distribution, it is easy to see that we can estimate the prior P(Ck) = Nk
Using Bayes rule,
P(Ck|x) = p(x|Ck)p(Ck) = Kk p(x) K
which is the rule for kNNs.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com