程序代写代做代考 data mining algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 4
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Perceptron solution and kernels
• Support vector machine with kernels

- dual solution, with offset, slack

• Homework 1 due today Sep 4, 11.59pm EST

The perceptron solution
• Suppose the training set is linearly separable through

origin given a particular feature mapping, i.e., �


for some

• The perceptron algorithm, applied repeatedly over the
training set, will find a solution of the form

• We can recast the algorithm entirely in terms of these
“mistake counts”

the number of mistakes made on
the ith training example until convergence

Kernel perceptron
• We don’t need the parameters nor the feature vectors

explicitly

• All we need for predictions as well as updates is the
value of the discriminant function

value of the discriminant
function prior to the update

kernel

Kernels
• By writing the algorithm in a “kernel” form, we can try

to work with the kernel (inner product) directly rather
than explicating the high dimensional feature vectors

• All we need to ensure is that the kernel is “valid”, i.e.,
there exists some underlying feature representation

(e.g.)

Valid kernels
• A kernel function is valid (is a kernel) if there exists

some feature mapping such that

• Equivalently, a kernel is valid if it is symmetric and for all
training sets, the Gram matrix�






is positive semi-definite

Primal SVM
• Consider a simple max-margin classifier through origin

• We claim that the solution has the same form as in the
perceptron case

• As before, we focus on estimating which are now
non-negative real numbers

non-negative Lagrange multipliers
used to enforce the classification

constraints

Primal SVM
• Consider a simple max-margin classifier through origin

• To solve this, we can introduce Lagrange multipliers for
the classification constraints and minimize the resulting
Lagrangian with respect to the parameters

Primal SVM
• Consider a simple max-margin classifier through origin

• To solve this, we can introduce Lagrange multipliers for
the classification constraints and minimize the resulting
Lagrangian with respect to the parameters

positive values
enforce classification

constraints

should become non-negative

Understanding the Lagrangian

• We can recover the primal problem by maximizing the
Lagrangian with respect to the Lagrange multipliers

to maximize:
since yi(θ � ϕ(xi)) – 1 ≥ 0, for all i
make αi=0

to maximize:
yi(θ � ϕ(xi)) – 1 < 0, for some i make αi = ∞ Note: to minimize maxα≥0 L(θ,α) with respect to θ, we should always fulfill constraints, to avoid ∞ Primal - Dual • expressed in terms of • explicit feature vectors step 1step 2 ? •  expressed in terms of •  kernels Primal - Dual • expressed in terms of • explicit feature vectors step 1step 2 Slater conditions (linear constraints è strong duality) ? •  expressed in terms of •  kernels Lagrangian Dual (step 1) • The dual problem is obtained simply by plugging this solution back into the Lagrangian Lagrangian Dual (step 1) • The dual problem is obtained simply by plugging this solution back into the Lagrangian Lagrangian Dual (step 1) • The dual problem is obtained simply by plugging this solution back into the Lagrangian Lagrangian Dual (step 1) • The dual problem is obtained simply by plugging this solution back into the Lagrangian (unique solution as a function of ) (unique solution as a function of ) Lagrangian Dual (step 1) • The dual problem is obtained by substituting this solution back into the Lagrangian and recalling that the Lagrange multipliers are non-negative • This is again a quadratic programming problem but with simpler “box” constraints Dual SVM (step 2) kernel • This is again a quadratic programming problem but with simpler “box” constraints Dual SVM (step 2) kernel (support vector) • This is again a quadratic programming problem but with simpler “box” constraints Dual SVM (step 2) kernel (support vector) • This is again a quadratic programming problem but with simpler “box” constraints Dual SVM (step 2) kernel (support vector) • This is again a quadratic programming problem but with simpler “box” constraints Dual SVM (step 2) kernel (support vector) Dual SVM • This is again a quadratic programming problem but with simpler “box” constraints • Once we solve for , we predict labels according to kernel kernel Kernel SVM • Solving the SVM dual implicitly finds the max-margin linear separator in the feature space RBF kernel, support vectors • Assume no offset, no slack. A point is not a support vector if the margin constraint is satisfied without it (otherwise it has to be a SV) ( if αi=0 ) Dual SVM with offset • Where’s the offset parameter? How do we solve for it? kernel Dual SVM with offset • Where’s the offset parameter? How do we solve for it? • We know that the classification constraints are tight for support vectors. If the ith point is a support vector, then kernel Dual SVM with offset • Where’s the offset parameter? How do we solve for it? • We know that the classification constraints are tight for support vectors. If the ith point is a support vector, then kernel kernel Note: you can pick any SV Dual SVM with offset and slack • C is the same slack penalty as in the primal formulation kernel