程序代写代做代考 data mining algorithm CS373 Data Mining and�

程序代写 CS代考 / Algorithm算法代写代考, data mining

CS373 Data Mining and�
Machine Learning�

Lecture 3
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Quick review of support vector machines…
• Need for more powerful classifiers
• Feature mappings, non-linear classifiers, kernels

- non-linear feature mappings
- kernels, kernel perceptron

Linear classifiers (with offset)
• A linear classifier with parameters

decision
boundary

+
–

Support vector machine

• We get a max-margin decision boundary by solving a
quadratic programming problem

• The solution is unique and sparse (support vectors)

+

–
–

–

+

+
geometric margin

distance to the boundary

Support vector machine
• Relaxed quadratic optimization problem

The value of C is an additional
parameter we have to set

Beyond linear classifiers…
• Many problems are not solved well by a linear classifier

even if we allow misclassified examples (SVM with slack)

• E.g., data from experiments typically involve “clusters”
of different types of examples

Non-linear feature mappings
• The easiest way to make the classifier more powerful is

to add non-linear coordinates to the feature vectors

• The classifier is still linear in the parameters, not inputs

linear classifier

non-linear classifier

Non-linear feature mappings
• The easiest way to make the classifier more powerful is

to add non-linear coordinates to the feature vectors

• The classifier is still linear in the parameters, not inputs

linear classifier

non-linear classifier

linear decision
boundary

Non-linear feature mappings
• The easiest way to make the classifier more powerful is

to add non-linear coordinates to the feature vectors

• The classifier is still linear in the parameters, not inputs

linear classifier

non-linear classifier

linear decision
boundary

Non-linear feature mappings
• The easiest way to make the classifier more powerful is

to add non-linear coordinates to the feature vectors

• The classifier is still linear in the parameters, not inputs

linear classifier

non-linear classifier

non-linear decision boundary

Non-linear feature mappings
• The easiest way to make the classifier more powerful is

to add non-linear coordinates to the feature vectors

• The classifier is still linear in the parameters, not inputs

linear classifier

non-linear classifier

non-linear decision boundary

Non-linear feature mappings
• By expanding the feature coordinates, we still have a

linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates

Non-linear feature mappings
• By expanding the feature coordinates, we still have a

linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates

+
+

Non-linear feature mappings
• By expanding the feature coordinates, we still have a

linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates

+
+

–
–

Non-linear feature mappings
• By expanding the feature coordinates, we still have a

linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates

+
+

–
–

maximize
margin in the
feature space

Non-linear feature mappings
• By expanding the feature coordinates, we still have a

linear classifier in the new feature coordinates but a
non-linear classifier in the original coordinates

+
+

–
–

maximize
margin in the
feature space

Learning non-linear classifiers
• We can apply the same SVM formulation, just replacing

the input examples with (higher dimensional) feature
vectors

• Note that the cost of solving this quadratic
programming problem increases with the dimension of
the feature vectors (we will avoid this issues by solving
the dual instead)

• Many (low dimensional) problems are not solved well by
a linear classifier even with slack

• By mapping examples to feature vectors, and maximizing
a linear margin in the feature space, we obtain non-linear
margin curves in the original space

Non-linear classifiers

linear features 2nd order features

• Many (low dimensional) problems are not solved well by
a linear classifier even with slack

• By mapping examples to feature vectors, and maximizing
a linear margin in the feature space, we obtain non-linear
margin curves in the original space

Non-linear classifiers

linear features 2nd order features

Problems to resolve
By using non-linear feature mappings we get more
powerful sets of classifiers

• Computational efficiency?
-  the cost of using higher dimensional feature vectors (seems

to) increase with the dimension

• Model selection?
- how do we choose among different feature mappings?

linear features 2nd order features 3rd order features

Non-linear perceptron, kernels
• Non-linear feature mappings can be dealt with more

efficiently through their inner products or “kernels”

• We will begin by turning the perceptron classifier with
non-linear features into a “kernel perceptron”

• For simplicity, we drop the offset parameter

(applied in a sequence or repeatedly
over a fixed training set)

On perceptron updates
• Each update adds to the parameter vector
• Repeated updates on the same example simply result in

adding the same term multiple times

• We can therefore write the current perceptron
solution as a function of how many times we performed
an update on each training example

Kernel perceptron
• By switching to the “count” representation, we can

write the perceptron algorithm entirely in terms of
inner products between the feature vectors

Kernel perceptron
• By switching to the “count” representation, we can

write the perceptron algorithm entirely in terms of
inner products between the feature vectors

Why inner products?
• For some feature mappings, the inner products can be

evaluated efficiently, without first expanding the feature
vectors

Why inner products?
• Instead of explicitly constructing feature vectors, we can

try to explicate their inner product or “kernel”�
�

• What is ?

Why inner products?
• Instead of explicitly constructing feature vectors, we can

try to explicate their inner product or “kernel”�
�

• What is now? Does it even exist?

Feature mappings and kernels
• In the kernel perceptron algorithm, the feature vectors

appear only as inner products

• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or kernel

•  is a kernel function if there exists
a feature mapping such that�

Feature mappings and kernels
• In the kernel perceptron algorithm, the feature vectors

appear only as inner products

• Instead of explicitly constructing feature vectors, we can
try to explicate their inner product or kernel

•  is a kernel function if there exists
a feature mapping such that�

• Examples of polynomial kernels

• We can construct valid kernels from simple
components

• For any function , if K1 is a kernel, then so
is �
�

• The set of kernel functions is closed under addition and
multiplication: if K1 and K2 are kernels, then so are

• The composition rules are also helpful in verifying that a
kernel is valid (i.e., corresponds to an inner product of
some feature vectors)

Composition rules for kernels

1)

2)

3)

Radial basis kernel
• The feature “vectors” corresponding to kernels may

also be infinite dimensional (functions)

• This is the case, e.g., for the radial basis kernel

• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!

Radial basis kernel
• The feature “vectors” corresponding to kernels may

also be infinite dimensional (functions)

• This is the case, e.g., for the radial basis kernel

• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!

• We can use the composition rules to show that this is
indeed a valid kernel

Radial basis kernel
• The feature “vectors” corresponding to kernels may

also be infinite dimensional (functions)

• This is the case, e.g., for the radial basis kernel

• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!

• We can use the composition rules to show that this is
indeed a valid kernel

Radial basis kernel
• The feature “vectors” corresponding to kernels may

also be infinite dimensional (functions)

• This is the case, e.g., for the radial basis kernel

• Any distinct set of training points, regardless of their
labels, are separable using this kernel function!

• We can use the composition rules to show that this is
indeed a valid kernel

ç Infinite Taylor series expansion

Kernel perceptron cont’d
• We can now apply the kernel perceptron algorithm

without ever explicating the feature vectors

Kernel perceptron: example
• With a radial basis kernel

only four
mistakes!

Kernel SVM
• We can also turn SVM into its dual (kernel) form and

implicitly find the max-margin linear separator in the
feature space, e.g., corresponding to the radial basis
kernel