Pattern Analysis & Machine Intelligence Research Group
SVM Classifier
� The goal of classification using SVM is to separate
two classes by a hyperplane induced from the
available examples
� The goal is to produce a classifier that will work
well on unseen examples (generalizes well)
� So it belongs to the decision (function) boundary
approach
1 of 22
Method Description
� Consider the problem of separating the set of
training vectors belonging to two different classes,
� The hyper plane equation is
� � � � � �^ ` ^ 1̀,1,,,,,,,, 2211 ��� yRxyxyxyxs dnn�
0 � bxwT
� Many decision boundaries
can separate these two
classes.
� Which one should we
choose?
Class 2
Class 1
2 of 22
mcrowley
Pencil
Method Description
Examples of decision boundaries
3 of 22
Large Margin Small Margin
mcrowley
Pencil
Method Description
Class 2
Class 1
�Optimal separation:
1. No errors.
2. The distance from the
closest vector to the
hyper plane is
maximized.
� The vectors closest to the
hyper plane are called
support vectors.
m
4 of 22
mcrowley
Pencil
Method Description
Class 2
Class 1
support vectors
m
0 � bxwT
� �
� � m
w
xx
xxwT
2
2
2
�
�
��
��
1
1
� �
� �
�
�
bxw
bxw
T
T
5 of 22
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
Method Description
To get optimal separation, it becomes a constrained
optimization problem
� �
1or 1
1 Subject to
with replaced
2
1
Minimize
22
��
�t�
�
i
i
T
i
y
ibxwy
www
= for support vectors
6 of 22
mcrowley
Pencil
Using Lagrange Multipliers
� �> @
¿
¾
½
¯
®
���¦
n
i
i
T
ii bxwywbw 1
2
1
2
1
,,
min
D
D
� �
w
bxw i
T
allfor
then one can achieve min by putting
01y
One problem with the above minimzation problem is if
i
i
f
t��
D
� �> @
¿
¾
½
¯
®
���
t ¦
n
i
i
T
ii bxwywbw 1
2
12
1
0
max
,
min
It is better instead to separate the problem like this:
D
D
7 of 22
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
� �
ctors)support vefor (only 0
0set , 01- with points allFor i
Then only a few !
!�
i
i
T
i bxwy
D
D
ii
n
i xyw ¦
Solution using Karush – Kuhn Tucker
D
We can now substitute this in the objective
function
8 of 22
i=1
mcrowley
Pencil
Method Description
Dual Problem
00,y Subject to
}
2
1
{max
ii
n
1i
i
1,11
t
�
¦
¦¦
DD
DDD
D
j
T
ijij
n
ji
i
n
i
i xxyy
9 of 22
mcrowley
Pencil
Method Description
� The parameters of the equation of the hyper
plane are defined as
� The support vectors are with non-zero .
� The problem can be solved using
Quadratic Programming or
Gradient Descent
sv
1
y label itswith
ctor support vearbitrary an is where
, and
sv
svsvi
n
i
ii
x
xwybxyw � ¦
D
ix iD
10 of 22
mcrowley
Pencil
Method Description
� Two possible problems
• The previous optimization problem does not
have a solution if the classes are not linearly
separable.
• How to deal with possible noise in the training
data?
• We can allow some error on the training data.
• This is called the soft-margin SVM.
11 of 22
Soft Margin SVM
Class 2
Class 1
“”[� We allow some error in training
j[
jx
ix
i[
12 of 22
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
New Optimization Formulation
The optimization problem becomes as follows
� �
� � 0 ,1 Subject to
2
1
b,w, vMinimize
1
2
iiiii
T
i
n
i
i
andbxwy
cw
�!��t�
� ¦
[[
[[
Soft margin parameter
13 of 22
mcrowley
Pencil
mcrowley
Pencil
Non-linear SVM
� SVM can be easily transformed into a non-linear
learner.
� The vectors are transformed into a higher
dimension feature space using a non-linear
mapping .
� The problem is linearly separable in the feature
space.
ix
� �ix)
14 of 22
mcrowley
Pencil
mcrowley
Pencil
Method Description
x
3z
1z
2z
� �x)
15 of 22
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
mcrowley
Pencil
Method Description
The Kernel Trick
� Computing is computationally inefficient.
� Notice that in the formulation of the optimization
problem we have ݔ்ݔ (dot product)
� So we need to compute
� Special mappings such as dot-products can
be computed very efficiently using the kernel
function
� �x)
� � � � � � � �. and x computing without , i jji xxx ))!))�
� �x)
� �., ji xxk
16 of 22
mcrowley
Pencil
mcrowley
Pencil
Method Description
� The relation is
� Example of common kernel functions
� In practice, we specify K, thereby specifying
indirectly, instead of choosing
� � � � � �jiji xxkxx ,, ))
� � � �� �
� � � �
� � � �� �cxxsxxk
xxxxk
xxxxk
sigmoid
d
poly
rbf
�
�
��
2121
2121
2
2121
,tanh,
1,,
exp, J
� ��)
17 of 22
mcrowley
Pencil
mcrowley
Pencil
Steps of SVM
1-Solve for ߙ as a quadratic progam
2- Determine the support vectors by
finding the points for which ߙ 0
1 1, 1
n
i i i
i 1
1
max{ ( , )
2
Subject to y 0, 0
n n
i i j i j i j
i i j
y y K x x
D
D D D
D D
�
t
¦ ¦
¦
18 of 22
mcrowley
Pencil
3- Calculate
1
sv
and ,
where are the support vector
with its label y
n
i i i sv sv
i
sv
w y x b y w x
x
D
�¦
19 of 22
� To classify a new sample
� � � �
iii
ii
T
xyw
bxsKbxwy
¦
¦
� �
D
D ,y
i
i
support vectors
1
2 0
classotherwise
classtheny t
20 of 22
mcrowley
Pencil
SVM Decision Function
The number of support vectors is minimized by
maximizing the margin.
The decision function of the classifier can be written
using a kernel function of a new sample (to
be classified) and training sample (support vector)
� �,ik x x x
ix
� � � � 0, DOD
Q
� ¦
�
xxkxD ii
sx
i
i
i
set of support vectors (a subset of the training set)
1 the label of object x
0 the parameters optimized during training by
solving the optimization problem (using
Quad
i
i
s
O
D
r
t
ratic Optimization) can be costly�
21 of 22
– QP, choice of kernel can be simple linear
classifier or nonlinear
– Powerful classifier, maps to higher dimensional to
make nonlinearly separable linearly separable
j
T
i xx
� � � �
classifier polynomialnonlinear a
1 examplefor ,,
p
j
T
iji xxxxk �
22 of 22
Comments on SVM
Slide Number 1
Slide Number 2
Decision Boundary (function) Approach (Discriminant)
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 19
Slide Number 20
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Slide Number 34
Slide Number 35
Slide Number 36
Slide Number 37
Slide Number 38
Slide Number 39
Slide Number 40
Slide Number 41
Slide Number 42
Slide Number 43
Steps of SVM
Slide Number 45
Slide Number 46
Slide Number 47
Slide Number 48
Slide Number 49
Neural Network Classifiers
Slide Number 51
Slide Number 52
Slide Number 53
Slide Number 54
NN Activation Functions
Multi-Layer Perceptron (MLP) Feedforward Network
Radial Basis Function Network
Radial Basis Function Network
NN Learning using Back Propagation
NN Advantages
NN Disadvantages
Decision Trees (from MD book ref 3)
Slide Number 63
DT (example)
Issues
ID3
Information/Entropy
ID3
ID3 Example (Output1)
ID3 Example (Output1)
C4.5 – WEKA
C5 – Windows
DT
Multiple Classifier Systems
Slide Number 75
Slide Number 76
Slide Number 77
Slide Number 78
Categorization of MCS Types of Classifiers
Modular Architecture
Modular Architectures
Ensemble Architecture
Aggregation (combining) Schemes
MCS References