CS373 Data Mining and�
Machine Learning�
Lecture 4
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Perceptron solution and kernels
• Support vector machine with kernels
- dual solution, with offset, slack
• Homework 1 due today Sep 4, 11.59pm EST
The perceptron solution
• Suppose the training set is linearly separable through
origin given a particular feature mapping, i.e., �
�
�
for some
• The perceptron algorithm, applied repeatedly over the
training set, will find a solution of the form
• We can recast the algorithm entirely in terms of these
“mistake counts”
the number of mistakes made on
the ith training example until convergence
Kernel perceptron
• We don’t need the parameters nor the feature vectors
explicitly
• All we need for predictions as well as updates is the
value of the discriminant function
value of the discriminant
function prior to the update
kernel
Kernels
• By writing the algorithm in a “kernel” form, we can try
to work with the kernel (inner product) directly rather
than explicating the high dimensional feature vectors
• All we need to ensure is that the kernel is “valid”, i.e.,
there exists some underlying feature representation
(e.g.)
Valid kernels
• A kernel function is valid (is a kernel) if there exists
some feature mapping such that
• Equivalently, a kernel is valid if it is symmetric and for all
training sets, the Gram matrix�
�
�
�
�
�
�
is positive semi-definite
Primal SVM
• Consider a simple max-margin classifier through origin
• We claim that the solution has the same form as in the
perceptron case
• As before, we focus on estimating which are now
non-negative real numbers
non-negative Lagrange multipliers
used to enforce the classification
constraints
Primal SVM
• Consider a simple max-margin classifier through origin
• To solve this, we can introduce Lagrange multipliers for
the classification constraints and minimize the resulting
Lagrangian with respect to the parameters
Primal SVM
• Consider a simple max-margin classifier through origin
• To solve this, we can introduce Lagrange multipliers for
the classification constraints and minimize the resulting
Lagrangian with respect to the parameters
positive values
enforce classification
constraints
should become non-negative
Understanding the Lagrangian
• We can recover the primal problem by maximizing the
Lagrangian with respect to the Lagrange multipliers
to maximize:
since yi(θ � ϕ(xi)) – 1 ≥ 0, for all i
make αi=0
to maximize:
yi(θ � ϕ(xi)) – 1 < 0, for some i
make αi = ∞
Note: to minimize maxα≥0 L(θ,α) with respect to θ, we should always fulfill constraints, to avoid ∞
Primal - Dual
• expressed in terms of
• explicit feature vectors
step 1step 2
?
• expressed in terms of
• kernels
Primal - Dual
• expressed in terms of
• explicit feature vectors
step 1step 2
Slater conditions
(linear constraints è strong duality)
?
• expressed in terms of
• kernels
Lagrangian Dual (step 1)
• The dual problem is obtained simply by plugging this
solution back into the Lagrangian
Lagrangian Dual (step 1)
• The dual problem is obtained simply by plugging this
solution back into the Lagrangian
Lagrangian Dual (step 1)
• The dual problem is obtained simply by plugging this
solution back into the Lagrangian
Lagrangian Dual (step 1)
• The dual problem is obtained simply by plugging this
solution back into the Lagrangian
(unique solution
as a function of )
(unique solution
as a function of )
Lagrangian Dual (step 1)
• The dual problem is obtained by substituting this
solution back into the Lagrangian and recalling that the
Lagrange multipliers are non-negative
• This is again a quadratic programming problem but with
simpler “box” constraints
Dual SVM (step 2)
kernel
• This is again a quadratic programming problem but with
simpler “box” constraints
Dual SVM (step 2)
kernel
(support vector)
• This is again a quadratic programming problem but with
simpler “box” constraints
Dual SVM (step 2)
kernel
(support vector)
• This is again a quadratic programming problem but with
simpler “box” constraints
Dual SVM (step 2)
kernel
(support vector)
• This is again a quadratic programming problem but with
simpler “box” constraints
Dual SVM (step 2)
kernel
(support vector)
Dual SVM
• This is again a quadratic programming problem but with
simpler “box” constraints
• Once we solve for , we predict labels according to
kernel
kernel
Kernel SVM
• Solving the SVM dual implicitly finds the max-margin
linear separator in the feature space
RBF kernel, support vectors
• Assume no offset, no slack. A point is not a support
vector if the margin constraint is satisfied without it
(otherwise it has to be a SV)
( if αi=0 )
Dual SVM with offset
• Where’s the offset parameter? How do we solve for it?
kernel
Dual SVM with offset
• Where’s the offset parameter? How do we solve for it?
• We know that the classification constraints are tight for
support vectors. If the ith point is a support vector,
then
kernel
Dual SVM with offset
• Where’s the offset parameter? How do we solve for it?
• We know that the classification constraints are tight for
support vectors. If the ith point is a support vector,
then
kernel
kernel
Note: you can pick any SV
Dual SVM with offset and slack
• C is the same slack penalty as in the primal formulation
kernel