ECE 657A: Classification – Lecture 7: k-NN, SVM and Kernal Methods
ECE 657A: Classification
Lecture 7: k-NN, SVM and Kernal Methods
Mark Crowley
February 15, 2017
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 1 / 22
Class Admin Announcements
Today’s Class
Announcements
K-nearest neighbor classification
Kernel Methods
Support Vector Machines
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 2 / 22
Class Admin Announcements
Announcements
Assignment 1 complete!
Next week: no classes
Project Proposal due Friday.
TA: Rasoul Mohammadi Nasiri – r26mohammadinasiri@uwaterloo.ca –
E5 5013
New TA: Atinderdeep Saini – as3saini@uwaterloo.ca
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 3 / 22
Similarity Based Classifiers
Outline of
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 4 / 22
Similarity Based Classifiers
Parametric vs Non-parametric Models
Analysing some data, produce a probability model
Supervised: p(y |x, θ)
Unsupervised: p(x, θ)
Number of parameters θ of the model
Parametric: fix in size, regardless of the amount of data N (µ, σ)
Usually fast.
Use strong assumptions, could be wrong.
Can use MLE or MAP.
Examples: linear regression, logistic regression, Naive Bayes, Simple
Neural Networks, k-means, HMMs
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 5 / 22
Similarity Based Classifiers
Parametric vs Non-parametric Models
Non-parametric: number of parameters grows with the amount of training
data.
More flexible.
Slower than parametric or even completely intractable
for large datasets
Examples: k-nearest neighbors, Support Vector Machines (SVMs),
Parzen Window/KDE, Decision Trees, Gaussian processes
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 6 / 22
Similarity Based Classifiers k-Nearest Neighbor Classifier
k-Nearest Neighbor (KNN) Classifier
A simple example of a supervised, non-parametric, classification model.
General Algorithm
For some test datapoint x define a set of K points in training set
nearest to x
Count how many members of each class are in the nearest neighbors
set
Return empirical fraction for each class as a probability
Optionally take highest probability as class label for x
p(y = c |x,D,K ) =
1
K
∑
i
∈ NK (x,D)I(yi = c)
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 7 / 22
Similarity Based Classifiers k-Nearest Neighbor Classifier
Figure: KNN in 2D for K=3.
x1 :p(y = 1|x1,D,K = 3) = 2/3 (1)
x1 :p(y = 0|x1,D,K = 3) = 1/3 (2)
x2 :p(y = 1|x2,D,K = 3) = 3/3 = 1 (3)
x2 :p(y = 1|x2,D,K = 3) = 0/3 = 0 (4)
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 8 / 22
Similarity Based Classifiers k-Nearest Neighbor Classifier
Special case K=1
If K = 1 then we have 1-Nearest Neighbour.
Induces a Voronoi Diagram. All points in polygons have point as their
nearest point.
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 9 / 22
Similarity Based Classifiers k-Nearest Neighbor Classifier
KNN Example
−3 −2 −1 0 1 2 3
−2
−1
0
1
2
3
4
5
train p(y=1|data,K=10)
20 40 60 80 100
20
40
60
80
100
120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p(y=2|data,K=10)
20 40 60 80 100
20
40
60
80
100
120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−3 −2 −1 0 1 2 3
−2
−1
0
1
2
3
4
5
predicted label, K=10
c1
c2
c3
Figure: (a) 3-class dataset. (b) Prob of class 1 for K = 10. (c) Class 2. (d) MAP
estimate of class label.
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 10 / 22
Similarity Based Classifiers k-Nearest Neighbor Classifier
Comments on KNN
Non-parametric because the number of neighboring points used
depends on the data.
Biased towards classes with more samples.
Sensitive to noise in the data (eg. random neighbors).
Computationally expensive in large dimensions or large K
Could compute distance using only some dimensions (which ones?)
Remove redundant points from training set (eg. ones surrounded by
same class)
For low dimensions could use search trees to organize training points
into independent subsets.
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 11 / 22
Similarity Based Classifiers Parzen Window Density Estimation
Density Estimation
Estimate the density of points in a part of the dataset.
parametric approach: Could set location one or multiple
distributions to model clusters and fit parameters to the data.
non-parametric approach: Allocate a cluster for each training data
point and combines the distributions.
Parzen window density estimation
Kernal density estimation (KDE)
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 12 / 22
Similarity Based Classifiers Parzen Window Density Estimation
Parzen Window
p̂h(x|D) =
1
Nh
N∑
i=1
κ(
x− xi
h
)
where h is called the bandwidth (usually chosen by cross-validation on the
risk).
Kernal Function κ
κ(x) = N (x|xi , σ2I)
=
1
√
2π
e−x
2/2 Gaussian kernal
or = I(|x | ≤ 1) Boxcar kernal
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 13 / 22
Similarity Based Classifiers Parzen Window Density Estimation
Parzen Window Example
−5 0 5 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
unif, h=1.000
−5 0 5 10
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
unif, h=2.000
−5 0 5 10
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
gauss, h=1.000
−5 0 5 10
0
0.01
0.02
0.03
0.04
0.05
0.06
gauss, h=2.000
Figure: Boxcar kernal vs Gaussian kernal on h = 1 and h = 2
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 14 / 22
Similarity Based Classifiers Parzen Window Density Estimation
From KDE to KNN
We can define K -Nearest Neighbors using the Parzen window or KDE
approach
Instead of a single bandwidth h, define hi for each datapoint xi such
that the K points nearest to xi are in the box.
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 15 / 22
Kernel Methods Definition
Outline of
Class Admin
Announcements
Similarity Based Classifiers
k-Nearest Neighbor Classifier
Parzen Window Density Estimation
Kernel Methods
Definition
The Kernel Trick
Suport Vector Machines
Definitions
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 16 / 22
The Kernel Trick
Kernel Function
Until now we assumed each object/data point can be represented
with fixed-size feature vector xi ∈ RD for purposes of estimation,
classification.
What if the feature is variable length? (eg. text document, protein
geometry, evolutionary tree)
One solution is to define a similarity measure called a kernel
function that doesn’t require processing the features directly, maps
them to a new space of an arbitrary dimensionality.
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 17 / 22
The Kernel Trick
Kernel Function
κ(x, x′) ∈ R and ≤ 0
For x, x′ ∈ X some abstract space X .
Usually symmetric, so κ(x, x′) = κ(x′, x)
Mercer kernal or positive definite kernel has a number of useful
properties that allow extraction of features if the kernel is learned
directly
Requires that the Gram matrix is positive definite
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 18 / 22
The Kernel Trick
Some example kernels
Gaussian Kernel
κ(x, x′) = exp
(
−
1
2
(x− x′)TΣ−1(x− x′)
)
Radial Basis Function
κ(x, x′) = exp
(
−
||x− x′||2
2σ2
)
Cosine Similatiry
κ(xi, xi′) =
xi
Txi′
||xi||2||xi′ ||2
where xij is number of times word j occurs in document i .
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 19 / 22
The Kernel Trick
Linear Kernels
Use a linear function ψ(x) = x on the features.
Then we define a linear kernel
κ(x, x′) = ψ(x)Tψ(x′)xTx′
This assumes
Features are linearly seperable
Each feature is independently informative about the datapoint
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 20 / 22
Definitions
Definitions:
R: the set of real numbers
D: training data D = {x1, . . . , xn}
K : number of nearest neighbors to include in kNN algorithm
Nk(x,D): set of K nearest neighbors of point x
I(COND): Indicator function which equals 1 if and only if COND is true.
κ(x):
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 21 / 22
Definitions
Summary
K-nearest neighbor classification
Kernel Methods
Support Vector Machines
Mark Crowley ECE 657A : Lecture 7 February 15, 2017 22 / 22
Class Admin
Announcements
Similarity Based Classifiers
k-Nearest Neighbor Classifier
Parzen Window Density Estimation
Kernel Methods
Definition
The Kernel Trick
Suport Vector Machines
Definitions