CS373 Data Mining and�
Machine Learning�
Lecture 3
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Quick review of support vector machines…
• Need for more powerful classifiers
• Feature mappings, non-linear classifiers, kernels
- non-linear feature mappings
- kernels, kernel perceptron
Linear classifiers (with offset)
• A linear classifier with parameters
decision
boundary
+
–
Support vector machine
• We get a max-margin decision boundary by solving a
quadratic programming problem
• The solution is unique and sparse (support vectors)
+
–
–
–
+
+
geometric margin
distance to the boundary
Support vector machine
• Relaxed quadratic optimization problem
The value of C is an additional
parameter we have to set
Beyond linear classifiers…
• Many problems are not solved well by a linear classifier
even if we allow misclassified examples (SVM with slack)
• E.g., data from experiments typically involve “clusters”
of different types of examples
Non-linear feature mappings
• The easiest way to make the classifier more powerful is
to add non-linear coordinates to the feature vectors
• The classifier is still linear in the parameters, not inputs
linear classifier
non-linear classifier
Non-linear feature mappings
• The easiest way to make the classifier more powerful is
to add non-linear coordinates to the feature vectors
• The classifier is still linear in the parameters, not inputs
linear classifier
non-linear classifier
linear decision
boundary
Non-linear feature mappings
• The easiest way to make the classifier more powerful is
to add non-linear coordinates to the feature vectors
• The classifier is still linear in the parameters, not inputs
linear classifier
non-linear classifier
linear decision
boundary
Non-linear feature mappings
• The easiest way to make the classifier more powerful is
to add non-linear coordinates to the feature vectors
• The classifier is still linear in the parameters, not inputs
linear classifier
non-linear classifier
non-linear decision boundary
Non-linear feature mappings
• The easiest way to make the classifier more powerful is
to add non-linear coordinates to the feature vectors
• The classifier is still linear in the parameters, not inputs
linear classifier
non-linear classifier
non-linear decision boundary
Non-linear feature mappings
• By expanding the feature coordinates, we still have a
linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates
Non-linear feature mappings
• By expanding the feature coordinates, we still have a
linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates
+
+
Non-linear feature mappings
• By expanding the feature coordinates, we still have a
linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates
+
+
–
–
Non-linear feature mappings
• By expanding the feature coordinates, we still have a
linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates
+
+
–
–
maximize
margin in the
feature space
Non-linear feature mappings
• By expanding the feature coordinates, we still have a
linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates
+
+
–
–
maximize
margin in the
feature space
Learning non-linear classifiers
• We can apply the same SVM formulation, just replacing
the input examples with (higher dimensional) feature
vectors
• Note that the cost of solving this quadratic
programming problem increases with the dimension of
the feature vectors (we will avoid this issues by solving
the dual instead)
• Many (low dimensional) problems are not solved well by
a linear classifier even with slack
• By mapping examples to feature vectors, and maximizing
a linear margin in the feature space, we obtain non-linear
margin curves in the original space
Non-linear classifiers
linear features 2nd order features
• Many (low dimensional) problems are not solved well by
a linear classifier even with slack
• By mapping examples to feature vectors, and maximizing
a linear margin in the feature space, we obtain non-linear
margin curves in the original space
Non-linear classifiers
linear features 2nd order features
Problems to resolve
By using non-linear feature mappings we get more
powerful sets of classifiers
• Computational efficiency?
- the cost of using higher dimensional feature vectors (seems
to) increase with the dimension
• Model selection?
- how do we choose among different feature mappings?
linear features 2nd order features 3rd order features
Non-linear perceptron, kernels
• Non-linear feature mappings can be dealt with more
efficiently through their inner products or “kernels”
• We will begin by turning the perceptron classifier with
non-linear features into a “kernel perceptron”
• For simplicity, we drop the offset parameter
(applied in a sequence or repeatedly
over a fixed training set)
On perceptron updates
• Each update adds to the parameter vector
• Repeated updates on the same example simply result in
adding the same term multiple times
• We can therefore write the current perceptron
solution as a function of how many times we performed
an update on each training example
Kernel perceptron
• By switching to the “count” representation, we can
write the perceptron algorithm entirely in terms of
inner products between the feature vectors
Kernel perceptron
• By switching to the “count” representation, we can
write the perceptron algorithm entirely in terms of
inner products between the feature vectors
Why inner products?
• For some feature mappings, the inner products can be
evaluated efficiently, without first expanding the feature
vectors
Why inner products?
• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or “kernel”�
�
• What is ?
Why inner products?
• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or “kernel”�
�
• What is now? Does it even exist?
Feature mappings and kernels
• In the kernel perceptron algorithm, the feature vectors
appear only as inner products
• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or kernel
• is a kernel function if there exists
a feature mapping such that�
Feature mappings and kernels
• In the kernel perceptron algorithm, the feature vectors
appear only as inner products
• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or kernel
• is a kernel function if there exists
a feature mapping such that�
• Examples of polynomial kernels
• We can construct valid kernels from simple
components
• For any function , if K1 is a kernel, then so
is �
�
• The set of kernel functions is closed under addition and
multiplication: if K1 and K2 are kernels, then so are
• The composition rules are also helpful in verifying that a
kernel is valid (i.e., corresponds to an inner product of
some feature vectors)
Composition rules for kernels
1)
2)
3)
Radial basis kernel
• The feature “vectors” corresponding to kernels may
also be infinite dimensional (functions)
• This is the case, e.g., for the radial basis kernel
• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!
Radial basis kernel
• The feature “vectors” corresponding to kernels may
also be infinite dimensional (functions)
• This is the case, e.g., for the radial basis kernel
• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!
• We can use the composition rules to show that this is
indeed a valid kernel
Radial basis kernel
• The feature “vectors” corresponding to kernels may
also be infinite dimensional (functions)
• This is the case, e.g., for the radial basis kernel
• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!
• We can use the composition rules to show that this is
indeed a valid kernel
Radial basis kernel
• The feature “vectors” corresponding to kernels may
also be infinite dimensional (functions)
• This is the case, e.g., for the radial basis kernel
• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!
• We can use the composition rules to show that this is
indeed a valid kernel
ç Infinite Taylor series expansion
Kernel perceptron cont’d
• We can now apply the kernel perceptron algorithm
without ever explicating the feature vectors
Kernel perceptron: example
• With a radial basis kernel
only four
mistakes!
Kernel SVM
• We can also turn SVM into its dual (kernel) form and
implicitly find the max-margin linear separator in the
feature space, e.g., corresponding to the radial basis
kernel